Ensemble methods

Alex Savenkov

05/23/2017

Intro

Consider a standard supervised learning problem

Ensemble

pros and cons



  • benefits
    • good generalization ability

  • challenges
    • computationally expensive

Ensemble learning models



  • boosting

  • bagging

  • random forests

Methods


  • averaging

  • voting

  • stacking

Averaging

  • averaging is the most popular and fundamental combination method

  • suppose we have set of learners \(\{h_1, h_2,...,h_N\}\), then combined output \(H(x)\) is given by \[ H(x) = \frac{1}{N}\sum_{i=1}^Nh_i(x) \]
  • weighted average is given by \[ H(x) =\sum_{i=1}^N w_ih_i(x) \quad \text{with} \ w_i \geq 0 \ \text{and} \ \sum_{i=1}^Nw_i = 1 \]

Voting (example)


  • consider classification problem with 10 samples with responses: \(Y = (1, 1, 1, 1, 1, 1, 1, 1, 1, 1)\)

  • three learners: M1, M2, M3

  • all three classifiers have 0.7 accuracy

Voting (cont.)



  • P(all three correct) = \(0.7\times0.7\times0.7 = 0.343\)

  • P(two correct) = \(3\times0.3\times0.7\times0.7 = 0.441\)

  • P(at least two correct) = 0.784

Correlation


  • highly correlated

    • M1: \((1, 1, 1, 1, 1, 1, 1, 0, 0, 0)\)
    • M2: \((1, 1, 1, 1, 0, 1, 1, 0, 1, 0)\)
    • M3: \((1, 1, 1, 1, 1, 1, 0, 0, 0, 1)\)
    • E1:  \((1, 1, 1, 1, 1, 1, 1, 0, 0, 0)\)
    • ensemble accuracy - 0.7

Correlation


  • less correlated

    • M1: \((1, 0, 0, 1, 1, 1, 1, 1, 0, 1)\)
    • M2: \((1, 1, 1, 1, 0, 0, 1, 1, 1, 0)\)
    • M3: \((1, 1, 1, 0, 1, 1, 0, 1, 0, 1)\)
    • E1:  \((1, 1, 1, 1, 1, 1, 1, 1, 0, 1)\)
    • ensemble accuracy - 0.9

Stacking



  • Stacking is a general procedure where a learner is trained to combine the individual learners.

  • individual learners are called the first-level learners

  • a model that combines first-level learners is called a meta-learner

SuperLearner

  • let \(Y\) - vector of responses and \(X\) - covariates

  • \(h_1,...,h_L\) - set of base learners

  • create matrix \(Z\) where each column is cross-validated predictions of the base learners

    • for each validation set \(v \in \{1,..., V\}\) train learner \(h_l\) on \(\{1,..., V\}\setminus v\) and then predict using subset \(v\)
  • estimate a meta-learner \(H\) based on matrix \(Z\) and \(Y\)

Packages



  • caret

  • SuperLearner

  • caretEnsemble

caret



  • data splitting

  • pre-processing

  • feature selection

  • model tuning using resampling

  • variable importance estimation

example with caretEnsemble

Data: Sonar from mlbench:

  • 60 covariates, 208 observations, response - \(\{0, 1\}\)
##        V1     V2     V3     V4     V5     V6
## 1  0.0200 0.0371 0.0428 0.0207 0.0954 0.0986
## 2  0.0453 0.0523 0.0843 0.0689 0.1183 0.2583
## 3  0.0262 0.0582 0.1099 0.1083 0.0974 0.2280
## 4  0.0100 0.0171 0.0623 0.0205 0.0205 0.0368
## 5  0.0762 0.0666 0.0481 0.0394 0.0590 0.0649
## 6  0.0286 0.0453 0.0277 0.0174 0.0384 0.0990
## 7  0.0317 0.0956 0.1321 0.1408 0.1674 0.1710
## 8  0.0519 0.0548 0.0842 0.0319 0.1158 0.0922
## 9  0.0223 0.0375 0.0484 0.0475 0.0647 0.0591
## 10 0.0164 0.0173 0.0347 0.0070 0.0187 0.0671

example (cont.)

set.seed(107)
model_list <- caretList(
  Class~., data=training,
  trControl=my_control,
  methodList=c("rf", "gbm", "rpart")
)

Correlation between models

##              rf       gbm     rpart
## rf    1.0000000 0.6447882 0.3250142
## gbm   0.6447882 1.0000000 0.1658579
## rpart 0.3250142 0.1658579 1.0000000

example (cont.)

glm_ensemble <- caretStack(
  model_list,
  method="glm",
  metric="ROC",
  trControl=trainControl(
    method="boot",
    number=10,
    savePredictions="final",
    classProbs=TRUE,
    summaryFunction=twoClassSummary
  )
)
model_preds$ensemble <- predict(glm_ensemble,
                                newdata=testing, type="prob")
##               rf       gbm     rpart  ensemble
## M vs. R 0.945216 0.8919753 0.6566358 0.9552469

set.seed(107)
model_list <- caretList(
  Class~., data=training,
  trControl=my_control,
  methodList=c("rf", "gbm")
)
glm_ensemble <- caretStack(
  model_list,
  method="glm",
  metric="ROC",
  trControl=trainControl(
    method="boot",
    number=10,
    savePredictions="final",
    classProbs=TRUE,
    summaryFunction=twoClassSummary
  )
)
model_preds$ensemble <- predict(glm_ensemble,
                                newdata=testing, type="prob")
##               rf       gbm  ensemble
## M vs. R 0.945216 0.8919753 0.9429012

set.seed(107)
model_list <- caretList(
  Class~., data=training,
  trControl=my_control,
  methodList=c("rf", "rpart")
)
glm_ensemble <- caretStack(
  model_list,
  method="glm",
  metric="ROC",
  trControl=trainControl(
    method="boot",
    number=10,
    savePredictions="final",
    classProbs=TRUE,
    summaryFunction=twoClassSummary
  )
)
model_preds$ensemble <- predict(glm_ensemble,
                                newdata=testing, type="prob")
##               rf     rpart  ensemble
## M vs. R 0.945216 0.6566358 0.9506173

References

  1. L. Breiman. Stacked regressions. Machine Learning, 24(1):49-64, 1996
  2. Wolpert, D. H. The supervised learning no-free-lunch theorems. In Soft computing and industry, pages 25-42, 2002
  3. van der Laan, M. J., Polley, E. C., and Hubbard, A. E. Super learner. Statistical applications in genetics and molecular biology 6, 2007
  4. caret package http://topepo.github.io/caret/index.html
  5. caretEnsemble package https://cran.r-project.org/web/packages/caretEnsemble/