Random Survival Forests

Aleksandr Savenkov

11/21/2017

Introduction


  • An ensemble method for right-censored data

  • Extension of random forests

  • Based on H. Ishwaran, et al., Random Survival Forests, The Annals of Applied Statistics, 841-860, 2008

  • Citations in 2017: “Nature Genetics”, “Clinical Cancer Research”, “Nature Medicine”, “PloS one”, etc

Random forests



  • Two forms of randomization:

    • random bootstrapped sample
    • random subset of variables at each node split

  • Applications:

    • regression and classifiction problems

Random forest

OOB samples



  • Probability for a data point to be selected for an OOB sample: \[ \left(1 - \frac{1}{n}\right)^n \approx \exp(-1) = 0.368 \]

  • Training data will contain approximately \(63.2\%\) of the original data

OOB Samples

Survival data


  • methods with restrictive assumptions: proportional hazards

  • ad hoc methods to detect nonlinear effects

  • identifying interaction terms are problematic
    • brute force for all two-way or three-way interactions
    • subject knowledge to narrow the search

Forest approaches


  • Hothorn,
    • R packages: party and partykit

  • Zhu and Kosorok, recursively imputed survival tree

  • Ishwaran
    • R package: randomForestSRC

RSF algorithm

  • Draw B bootstrap samples from the original data

  • Grow a survival tree for each bootstrap sample

  • Calculate a cumulative hazard function (CHF) for each tree

  • Average to obtain the ensemble CHF

  • Calculate prediction error for the ensemble CHF, using OOB data

Survival tree

  • Survival trees are binary trees

  • Start at the root node

  • split into two nodes using a survival criterion
    • log-rank: splits by maximazation of the log-rank test statistics
    • log-rank score: standardized log-rank statistics
    • custom (require some efforts)

Survival tree



  • a node should contain a minimum \(d_0 > 0\) unique events

  • CHF is estimated by a Nelson-Aalen estimator

Prediction error



  • Harrell’s concordance index, C-index (Harrell et al., 1982)

  • C-index specifically acounts for censoring

Concordance index

  • Form all possible pairs of cases over the data

  • omit the pairs if shorter survival time is censored

  • omit \((i, j)\) if \(T_i = T_j\) unless at least on is an event

  • For a permissible pair:
    • for \(T_i \neq T_j\), count 1 if shorter survival time has worse predicted outcome, 0.5 if tied
    • for \(T_i = T_j\) and both are events count 1 if predictions are tied and 0.5 otherwise. If not both events, 1 if event has worse predicted outcome, 0.5 otherwise

Data description


  • breast - Wisconsin Prognostic Breast Cancer Data

    • 198 observations with 32 covariates
    • outcome: R = recurrent, N = non-recurrent
    • the first 30 covariates are computed from a digitized image
  • breast10 - breast plus 10 uniform iid random variables

  • breast50 - breast plus 50 uniform iid random variables

Simmulations (from Ishwaran, 2008)



References


  1. H. Ishwaran, et al., Random Survival Forests, The Annals of Applied Statistics, 841-860, 2008
  2. L. Breiman, Random Forests, Machine Learning, 5-32, 2001
  3. Ishwaran H. and Kogalur U.B. (2017). randomForestSRC Random Forests for Survival, Regression, and Classification, R package version 2.5.0.
  4. T. Hothorn, et al., Survival Ensembles, Biostatistics, 355-373
  5. Zhu R., Kosorok M., Recursively imputed survival trees, JASA, 331-340, 2012