Ensemble methods

Alex Savenkov

05/23/2017

Intro

Consider a standard supervised learning problem

Ensemble

pros and cons

benefits
- good generalization ability
challenges
- computationally expensive

Ensemble learning models

boosting
bagging
random forests

Methods

averaging
voting
stacking

Averaging

averaging is the most popular and fundamental combination method
suppose we have set of learners \(\{h_1, h_2,...,h_N\}\), then combined output \(H(x)\) is given by \[ H(x) = \frac{1}{N}\sum_{i=1}^Nh_i(x) \]
weighted average is given by \[ H(x) =\sum_{i=1}^N w_ih_i(x) \quad \text{with} \ w_i \geq 0 \ \text{and} \ \sum_{i=1}^Nw_i = 1 \]

Voting (example)

consider classification problem with 10 samples with responses: \(Y = (1, 1, 1, 1, 1, 1, 1, 1, 1, 1)\)
three learners: M1, M2, M3
all three classifiers have 0.7 accuracy

Voting (cont.)

P(all three correct) = \(0.7\times0.7\times0.7 = 0.343\)
P(two correct) = \(3\times0.3\times0.7\times0.7 = 0.441\)
P(at least two correct) = 0.784

Correlation

highly correlated
- M1: \((1, 1, 1, 1, 1, 1, 1, 0, 0, 0)\)
- M2: \((1, 1, 1, 1, 0, 1, 1, 0, 1, 0)\)
- M3: \((1, 1, 1, 1, 1, 1, 0, 0, 0, 1)\)
- E1: \((1, 1, 1, 1, 1, 1, 1, 0, 0, 0)\)
- ensemble accuracy - 0.7

Correlation

less correlated
- M1: \((1, 0, 0, 1, 1, 1, 1, 1, 0, 1)\)
- M2: \((1, 1, 1, 1, 0, 0, 1, 1, 1, 0)\)
- M3: \((1, 1, 1, 0, 1, 1, 0, 1, 0, 1)\)
- E1: \((1, 1, 1, 1, 1, 1, 1, 1, 0, 1)\)
- ensemble accuracy - 0.9

Stacking

Stacking is a general procedure where a learner is trained to combine the individual learners.
individual learners are called the first-level learners
a model that combines first-level learners is called a meta-learner

SuperLearner

let \(Y\) - vector of responses and \(X\) - covariates
\(h_1,...,h_L\) - set of base learners
create matrix \(Z\) where each column is cross-validated predictions of the base learners
- for each validation set \(v \in \{1,..., V\}\) train learner \(h_l\) on \(\{1,..., V\}\setminus v\) and then predict using subset \(v\)
estimate a meta-learner \(H\) based on matrix \(Z\) and \(Y\)

Packages

caret
SuperLearner
caretEnsemble

caret

data splitting
pre-processing
feature selection
model tuning using resampling
variable importance estimation

example with caretEnsemble

Data: Sonar from mlbench:

60 covariates, 208 observations, response - \(\{0, 1\}\)

##        V1     V2     V3     V4     V5     V6
## 1  0.0200 0.0371 0.0428 0.0207 0.0954 0.0986
## 2  0.0453 0.0523 0.0843 0.0689 0.1183 0.2583
## 3  0.0262 0.0582 0.1099 0.1083 0.0974 0.2280
## 4  0.0100 0.0171 0.0623 0.0205 0.0205 0.0368
## 5  0.0762 0.0666 0.0481 0.0394 0.0590 0.0649
## 6  0.0286 0.0453 0.0277 0.0174 0.0384 0.0990
## 7  0.0317 0.0956 0.1321 0.1408 0.1674 0.1710
## 8  0.0519 0.0548 0.0842 0.0319 0.1158 0.0922
## 9  0.0223 0.0375 0.0484 0.0475 0.0647 0.0591
## 10 0.0164 0.0173 0.0347 0.0070 0.0187 0.0671

example (cont.)

set.seed(107)
model_list <- caretList(
  Class~., data=training,
  trControl=my_control,
  methodList=c("rf", "gbm", "rpart")
)

Correlation between models

##              rf       gbm     rpart
## rf    1.0000000 0.6447882 0.3250142
## gbm   0.6447882 1.0000000 0.1658579
## rpart 0.3250142 0.1658579 1.0000000

example (cont.)

glm_ensemble <- caretStack(
  model_list,
  method="glm",
  metric="ROC",
  trControl=trainControl(
    method="boot",
    number=10,
    savePredictions="final",
    classProbs=TRUE,
    summaryFunction=twoClassSummary
  )
)
model_preds$ensemble <- predict(glm_ensemble,
                                newdata=testing, type="prob")

##               rf       gbm     rpart  ensemble
## M vs. R 0.945216 0.8919753 0.6566358 0.9552469

set.seed(107)
model_list <- caretList(
  Class~., data=training,
  trControl=my_control,
  methodList=c("rf", "gbm")
)

glm_ensemble <- caretStack(
  model_list,
  method="glm",
  metric="ROC",
  trControl=trainControl(
    method="boot",
    number=10,
    savePredictions="final",
    classProbs=TRUE,
    summaryFunction=twoClassSummary
  )
)
model_preds$ensemble <- predict(glm_ensemble,
                                newdata=testing, type="prob")

##               rf       gbm  ensemble
## M vs. R 0.945216 0.8919753 0.9429012

set.seed(107)
model_list <- caretList(
  Class~., data=training,
  trControl=my_control,
  methodList=c("rf", "rpart")
)

glm_ensemble <- caretStack(
  model_list,
  method="glm",
  metric="ROC",
  trControl=trainControl(
    method="boot",
    number=10,
    savePredictions="final",
    classProbs=TRUE,
    summaryFunction=twoClassSummary
  )
)
model_preds$ensemble <- predict(glm_ensemble,
                                newdata=testing, type="prob")

##               rf     rpart  ensemble
## M vs. R 0.945216 0.6566358 0.9506173

References

L. Breiman. Stacked regressions. Machine Learning, 24(1):49-64, 1996
Wolpert, D. H. The supervised learning no-free-lunch theorems. In Soft computing and industry, pages 25-42, 2002
van der Laan, M. J., Polley, E. C., and Hubbard, A. E. Super learner. Statistical applications in genetics and molecular biology 6, 2007
caret package http://topepo.github.io/caret/index.html
caretEnsemble package https://cran.r-project.org/web/packages/caretEnsemble/