Analyses can slow to a crawl when models need hours to run. In this
article you will find a few tricks to prevent this bottleneck when using
orsf()
.
control
The default control
for orsf()
is
NULL
because, if unspecified, orsf()
will pick
the fastest possible control
for you depending on the type
of forest being grown. The default control
run-time
compared to other approaches can be striking. For example:
time_fast <- system.time(
expr = orsf(pbc_orsf,
formula = time+status~. -id,
n_tree = 5)
)
time_net <- system.time(
expr = orsf(pbc_orsf,
formula = time+status~. -id,
control = orsf_control_survival(method = 'net'),
n_tree = 5)
)
# unspecified control is much faster
time_net['elapsed'] / time_fast['elapsed']
#> elapsed
#> 107.1111
n_thread
The n_thread
argument uses multi-threading to run
aorsf
functions in parallel when possible. If you know how
many threads you want, e.g. you want exactly 5, set
n_thread = 5
. If you aren’t sure how many threads you have
available but want to use a feasible amount, using
n_thread = 0
(the default) tells aorsf
to do
that for you.
# automatically pick number of threads based on amount available
orsf(pbc_orsf,
formula = time+status~. -id,
n_tree = 5,
n_thread = 0)
Note: sometimes multi-threading is not possible. For example, because
R is a single threaded language, multi-threading cannot be applied when
orsf()
needs to call R functions from C++, which occurs
when a customized R function is used to find linear combination of
variables or compute prediction accuracy.
There are some inputs in orsf()
that can be adjusted to
make it run faster:
set n_retry
to 0
set oobag_pred_type
to 'none'
set importance
to 'none'
increase split_min_events
,
split_min_obs
, leaf_min_events
, or
leaf_min_obs
to make trees stop growing sooner
increase split_min_stat
to enforce more strict
requirements for growing deeper trees.
Applying these tips:
orsf(pbc_orsf,
formula = time+status~.,
n_thread = 0,
n_tree = 5,
n_retry = 0,
oobag_pred_type = 'none',
importance = 'none',
split_min_events = 20,
leaf_min_events = 10,
split_min_stat = 10)
While modifying these inputs can make orsf()
run faster,
they can also impact prediction accuracy.
Setting verbose_progress = TRUE
doesn’t make anything
run faster, but it can help make it feel like things are
running less slow.
Instead of running a model and hoping it will be fast, you can
estimate how long a specification of that model will take by using
no_fit = TRUE
in the call to orsf()
.
fit_spec <- orsf(pbc_orsf,
formula = time+status~. -id,
control = orsf_control_survival(method = 'net'),
n_tree = 2000,
no_fit = TRUE)
# how much time it takes to estimate training time:
system.time(
time_est <- orsf_time_to_train(fit_spec, n_tree_subset = 5)
)
#> user system elapsed
#> 0.287 0.000 0.287
# the estimated training time:
time_est
#> Time difference of 114.4179 secs