--- title: "PD and ICE curves with ORSF" description: > Learn how to interpret ORSF models using partial dependence. output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{PD and ICE curves with ORSF} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.height = 5, fig.width = 7, warning = FALSE, message = FALSE ) ``` # Partial dependence (PD) `r aorsf:::roxy_pd_explain()` ```{r} library(aorsf) library(ggplot2) ``` You can compute PD and individual conditional expectation (ICE) in three ways: - using in-bag predictions for the training data. In-bag PD indicates relationships that the model has learned during training. This is helpful if your goal is to interpret the model. - using out-of-bag predictions for the training data. Out-of-bag PD indicates relationships that the model has learned during training but using the out-of-bag data simulates application of the model to new data. This is helpful if you want to test your model's reliability or fairness in new data but you don't have access to a large testing set. - using predictions for a new set of data. New data PD shows how the model predicts outcomes for observations it has not seen. This is helpful if you want to test your model's reliability or fairness. ## Classification Begin by fitting an oblique classification random forest: ```{r} set.seed(329) index_train <- sample(nrow(penguins_orsf), 150) penguins_orsf_train <- penguins_orsf[index_train, ] penguins_orsf_test <- penguins_orsf[-index_train, ] fit_clsf <- orsf(data = penguins_orsf_train, formula = species ~ .) ``` Compute PD using out-of-bag data for `flipper_length_mm = c(190, 210)`. ```{r} pred_spec <- list(flipper_length_mm = c(190, 210)) pd_oob <- orsf_pd_oob(fit_clsf, pred_spec = pred_spec) pd_oob ``` Note that predicted probabilities are returned for each class and probabilities in the `mean` column sum to 1 if you take the sum over each class at a specific value of the `pred_spec` variables. For example, ```{r} sum(pd_oob[flipper_length_mm == 190, mean]) ``` But this isn't the case for the median predicted probability! ```{r} sum(pd_oob[flipper_length_mm == 190, medn]) ``` ## Regression Begin by fitting an oblique regression random forest: ```{r} set.seed(329) index_train <- sample(nrow(penguins_orsf), 150) penguins_orsf_train <- penguins_orsf[index_train, ] penguins_orsf_test <- penguins_orsf[-index_train, ] fit_regr <- orsf(data = penguins_orsf_train, formula = bill_length_mm ~ .) ``` Compute PD using new data for `flipper_length_mm = c(190, 210)`. ```{r} pred_spec <- list(flipper_length_mm = c(190, 210)) pd_new <- orsf_pd_new(fit_regr, pred_spec = pred_spec, new_data = penguins_orsf_test) pd_new ``` You can also let `pred_spec_auto` pick reasonable values like so: ```{r} pred_spec = pred_spec_auto(species, island, body_mass_g) pd_new <- orsf_pd_new(fit_regr, pred_spec = pred_spec, new_data = penguins_orsf_test) pd_new ``` By default, all combinations of all variables are used. However, you can also look at the variables one by one, separately, like so: ```{r} pd_new <- orsf_pd_new(fit_regr, expand_grid = FALSE, pred_spec = pred_spec, new_data = penguins_orsf_test) pd_new ``` And you can also bypass all the bells and whistles by using your own `data.frame` for a `pred_spec`. (Just make sure you request values that exist in the training data.) ```{r} custom_pred_spec <- data.frame(species = 'Adelie', island = 'Biscoe') pd_new <- orsf_pd_new(fit_regr, pred_spec = custom_pred_spec, new_data = penguins_orsf_test) pd_new ``` ## Survival Begin by fitting an oblique survival random forest: ```{r} set.seed(329) index_train <- sample(nrow(pbc_orsf), 150) pbc_orsf_train <- pbc_orsf[index_train, ] pbc_orsf_test <- pbc_orsf[-index_train, ] fit_surv <- orsf(data = pbc_orsf_train, formula = Surv(time, status) ~ . - id, oobag_pred_horizon = 365.25 * 5) ``` Compute PD using in-bag data for `bili = c(1,2,3,4,5)`: ```{r} pd_train <- orsf_pd_inb(fit_surv, pred_spec = list(bili = 1:5)) pd_train ``` If you don't have specific values of a variable in mind, let `pred_spec_auto` pick for you: ```{r} pd_train <- orsf_pd_inb(fit_surv, pred_spec_auto(bili)) pd_train ``` Specify `pred_horizon` to get PD at each value: ```{r} pd_train <- orsf_pd_inb(fit_surv, pred_spec_auto(bili), pred_horizon = seq(500, 3000, by = 500)) pd_train ``` ## One variable, moving horizon For the next few sections, we update `orsf_fit` to include all the data in `pbc_orsf` instead of just the training sample: ```{r} # a rare case of modify_in_place = TRUE orsf_update(fit_surv, data = pbc_orsf, modify_in_place = TRUE) fit_surv ``` What if the effect of a predictor varies over time? Partial dependence can show this. ```{r} pd_sex_tv <- orsf_pd_oob(fit_surv, pred_spec = pred_spec_auto(sex), pred_horizon = seq(365, 365*5)) ggplot(pd_sex_tv) + aes(x = pred_horizon, y = mean, color = sex) + geom_line() + labs(x = 'Time since baseline', y = 'Expected risk') ``` From inspection, we can see that males have higher risk than females and the difference in that risk grows over time. This can also be seen by viewing the ratio of expected risk over time: ```{r} library(data.table) ratio_tv <- pd_sex_tv[ , .(ratio = mean[sex == 'm'] / mean[sex == 'f']), by = pred_horizon ] ggplot(ratio_tv, aes(x = pred_horizon, y = ratio)) + geom_line(color = 'grey') + geom_smooth(color = 'black', se = FALSE) + labs(x = 'time since baseline', y = 'ratio in expected risk for males versus females') ``` To get a view of PD for any number of variables in the training data, use `orsf_summarize_uni()`. This function computes out-of-bag PD for the most important `n_variables` and returns a nicely formatted view of the output: ```{r} pd_smry <- orsf_summarize_uni(fit_surv, n_variables = 4) pd_smry ``` This 'summary' object can be converted into a `data.table` for downstream plotting and tables. ```{r} head(as.data.table(pd_smry)) ``` ## Multiple variables, jointly Partial dependence can show the expected value of a model's predictions as a function of a specific predictor, or as a function of multiple predictors. For instance, we can estimate predicted risk as a joint function of `bili`, `edema`, and `trt`: ```{r} pred_spec = pred_spec_auto(bili, edema, trt) pd_bili_edema <- orsf_pd_oob(fit_surv, pred_spec) ggplot(pd_bili_edema) + aes(x = bili, y = medn, col = trt, linetype = edema) + geom_line() + labs(y = 'Expected predicted risk') ``` From inspection, - the model's predictions indicate slightly lower risk for the placebo group, and these do not seem to change much at different values of `bili` or `edema`. - There is a clear increase in predicted risk with higher levels of `edema` and with higher levels of `bili` - the slope of predicted risk as a function of `bili` appears highest among patients with `edema` of 0.5. Is the effect of `bili` modified by `edema` being 0.5? A quick sanity check with `coxph` suggests there is. ```{r} library(survival) pbc_orsf$edema_05 <- ifelse(pbc_orsf$edema == '0.5', 'yes', 'no') fit_cph <- coxph(Surv(time,status) ~ edema_05 * bili, data = pbc_orsf) anova(fit_cph) ``` ```{r, echo = FALSE} # drop so it won't interfere with downstream code pbc_orsf$edema_05 <- NULL ``` ## Find interactions using PD Random forests are good at using interactions, but less good at telling you about them. Use `orsf_vint()` to apply the method for variable interaction scoring with PD described by Greenwell et al (2018). This can take a little while if you have lots of predictors, and it seems to work best with continuous by continuous interactions. Interactions with categorical variables are sometimes over- or under- scored. ```{r} # use just the continuous variables preds <- names(fit_surv$get_means()) vint_scores <- orsf_vint(fit_surv, predictors = preds) vint_scores ``` The scores include partial dependence values that you can pull out and plot: ```{r} # top scoring interaction pd_top <- vint_scores$pd_values[[1]] # center pd values so it's easier to see the interaction effect pd_top[, mean := mean - mean[1], by = var_2_value] ggplot(pd_top) + aes(x = var_1_value, y = mean, color = factor(var_2_value), group = factor(var_2_value)) + geom_line() + labs(x = "albumin", y = "predicted mortality (centered)", color = "protime") ``` Again we use a sanity check with `coxph` to see if these interactions are detected using a standard test: ```{r} # test the top score (expect strong interaction) fit_cph <- coxph(Surv(time,status) ~ albumin * protime, data = pbc_orsf) anova(fit_cph) ``` *Note*: Caution is warranted when interpreting statistical hypotheses that are motivated by the same data they are tested with. Results like the p-values for interaction shown above should be interpreted as exploratory. # Individual conditional expectations (ICE) `r aorsf:::roxy_ice_explain()` ## Classification Compute ICE using out-of-bag data for `flipper_length_mm = c(190, 210)`. ```{r} pred_spec <- list(flipper_length_mm = c(190, 210)) ice_oob <- orsf_ice_oob(fit_clsf, pred_spec = pred_spec) ice_oob ``` There are two identifiers in the output: - `id_variable` is an identifier for the current value of the variable(s) that are in the data. It is redundant if you only have one variable, but helpful if there are multiple variables. - `id_row` is an identifier for the observation in the original data. Note that predicted probabilities are returned for each class and each observation in the data. Predicted probabilities for a given observation and given variable value sum to 1. For example, ```{r, eval=FALSE} ice_oob %>% .[flipper_length_mm == 190] %>% .[id_row == 1] %>% .[['pred']] %>% sum() ``` ```{r echo=FALSE} 1 ``` ## Regression Compute ICE using new data for `flipper_length_mm = c(190, 210)`. ```{r} pred_spec <- list(flipper_length_mm = c(190, 210)) ice_new <- orsf_ice_new(fit_regr, pred_spec = pred_spec, new_data = penguins_orsf_test) ice_new ``` You can also let `pred_spec_auto` pick reasonable values like so: ```{r} pred_spec = pred_spec_auto(species, island, body_mass_g) ice_new <- orsf_ice_new(fit_regr, pred_spec = pred_spec, new_data = penguins_orsf_test) ice_new ``` By default, all combinations of all variables are used. However, you can also look at the variables one by one, separately, like so: ```{r} ice_new <- orsf_ice_new(fit_regr, expand_grid = FALSE, pred_spec = pred_spec, new_data = penguins_orsf_test) ice_new ``` And you can also bypass all the bells and whistles by using your own `data.frame` for a `pred_spec`. (Just make sure you request values that exist in the training data.) ```{r} custom_pred_spec <- data.frame(species = 'Adelie', island = 'Biscoe') ice_new <- orsf_ice_new(fit_regr, pred_spec = custom_pred_spec, new_data = penguins_orsf_test) ice_new ``` ## Survival Compute ICE using in-bag data for `bili = c(1,2,3,4,5)`: ```{r} ice_train <- orsf_ice_inb(fit_surv, pred_spec = list(bili = 1:5)) ice_train ``` If you don't have specific values of a variable in mind, let `pred_spec_auto` pick for you: ```{r} ice_train <- orsf_ice_inb(fit_surv, pred_spec_auto(bili)) ice_train ``` Specify `pred_horizon` to get ICE at each value: ```{r} ice_train <- orsf_ice_inb(fit_surv, pred_spec_auto(bili), pred_horizon = seq(500, 3000, by = 500)) ice_train ``` Multi-prediction horizon ice comes with minimal extra computational cost. Use a fine grid of time values and assess whether predictors have time-varying effects. ## Visualizing ICE curves Inspecting the ICE curves for each observation can help identify whether there is heterogeneity in a model's predictions. I.e., does the effect of the variable follow the same pattern for all the data, or are there groups where the variable impacts risk differently? I am going to turn off boundary checking in `orsf_ice_oob` by setting `boundary_checks = FALSE`, and this will allow me to generate ICE curves that go beyond the 90th percentile of `bili`. ```{r} pred_spec <- list(bili = seq(1, 10, length.out = 25)) ice_oob <- orsf_ice_oob(fit_surv, pred_spec, boundary_checks = FALSE) ice_oob ``` For plots, it is helpful to scale the ICE data. I subtract the initial value of predicted risk (i.e., when `bili = 1`) from each observation's conditional expectation values. So, - Every curve start at 0 - The plot shows *change* in predicted risk as a function of `bili`. ```{r} ice_oob[, pred_subtract := rep(pred[id_variable==1], times=25)] ice_oob[, pred := pred - pred_subtract] ``` Now we can visualize the curves. ```{r, -orsf_ice} ggplot(ice_oob, aes(x = bili, y = pred, group = id_row)) + geom_line(alpha = 0.15) + labs(y = 'Change in predicted risk') + geom_smooth(se = FALSE, aes(group = 1)) ``` From inspection of the figure, - Most of the individual slopes cluster around the overall trend - Good! - A small number of individual slopes appear to be flat. It may be helpful to investigate this further. ## Limitations of PD `r aorsf:::roxy_pd_limitations()` ## References 1. `r aorsf:::cite("hooker_2021")`