Feb. 28, 2017

The fundamental problem

What is causal inference?

The typical problem of causal inference can be presented as follows.

Denote

  • \(Y\) is an outcome of interest,
  • \(A\) is a binary treatment variable, and
  • \(W\) is a vector of observed covariates.

We are interested in learning the effect of \(A\) on \(Y\).

Setup

Frameworks for causal inference:

  • the potential outcomes framework,
  • non-parametric structural equation models (NPSEM), and
  • a few more

Let's focus on NPSEMs

NPSEM

Assume we observe \(n\) copies of \(O=(W,A,Y)\). An NPSEM is a generative model, in which the data are assumed to be generated by a set of equations

\[ \begin{eqnarray} W & = & f_W(U_W)\\ A & = & f_A(W,U_A)\\ Y & = & f_Y(W,A,U_Y)\\ \end{eqnarray} \]

  • \(U_W\), \(U_A\), and \(U_Y\) are unobserved, exogenous variables
  • \(f_W\), \(f_A\), and \(f_Y\) are unknown, but fixed functions

Counterfactuals

Causal effects are defined in terms of an intervention on the data generating mechanism. For example, what would be the effect of setting \(A=1\)?

In an NPSEM, variables under an intervention are called counterfactuals. In our example, we have two counterfactuals:

\[ \begin{eqnarray} Y_1 & = & f_Y(W,1,U_Y)\\ Y_0 & = & f_Y(W,0,U_Y) \end{eqnarray} \]

In the potential outcomes framework, \(Y_a:a=0,1\) is the primitive.

Causal effects

Causal effects are often defined as contrasts between counterfactuals. For example:

  • Average treatment effect:

\[E(Y_1) - E(Y_0)\]

  • Risk ratio

\[\frac{E(Y_1)}{E(Y_0)}\]

The fundamental problem of causal inference is that countefactuals are not observed. So the question is, can we estimate the above quantities from data on \(O=(W,A,Y)\)? This is the problem of identification.

Identification

Identification

Under the assumption that

\[A \perp U_Y \mid W\]

We can show (math omitted but happy to show it)

\[E(Y_1) = E_W[E(Y | A=1, W)]\]

(Note: in a randomized trial \(A\) is independent of everything, that is why it is often considered the gold standard.)

Likewise

\[E(Y_0) = E_W[E(Y | A=0, W)]\]

Identification

An NPSEM can be visually represented in a directed acyclic graph (DAG)

The above DAGs satisfy \(A \perp U_Y \mid W\).

This concludes the causal inference problem. All we have to do now is statistics to estimate

\[E(Y_1) = E_W[E(Y | A=1, W)]\]

Estimation

Let us first revise traditional estimators

When \(Y\) is binary, a common approach is to look at the coefficient of treatment in a logistic regression:

\[ \DeclareMathOperator{\logit}{logit} \logit P(Y=1\mid A=a, W=w) = \beta_0 + \beta_1a + \beta_2w \]

A very common mistake is to take \(\exp(\beta_1)\) and say ''this is the odds ratio adjusted for covariates''

R illustration

We will use the WCGS dataset in the R epitools package. We focus on the effect of smoking on CHD.

library(epitools)
library(dplyr)
library(tidyr)

data(wcgs)
y <- select(wcgs, chd69)[, 1]
a <- as.numeric(select(wcgs, ncigs0)[, 1] > 0)
w <- select(wcgs, age0, height0, weight0, dibpat0)
fit <- glm(y ~ ., data = data.frame(a = a, w),
           family = binomial(link = logit))

R illustration

exp(coef(fit)['a'])
##        a 
## 1.981932
mu1 <- predict(fit, newdata = data.frame(a = 1, w), type = 'response')
mu0 <- predict(fit, newdata = data.frame(a = 0, w), type = 'response')
or <- (mean(mu1) / (1 - mean(mu1))) / (mean(mu0) / (1 - mean(mu0)))
or
## [1] 1.937736

(Note: inspection of the OR equation reveals that we are implicitly imputing the counterfactual outcomes)

Other parameters in logistic regression

We could also compute other contrasts, e.g.,

\[E[E(Y \mid A=1, W)] - E[E(Y\mid A=0, W)]\] by computing \(E(Y \mid A=1, W)]\) for all subjects, subtracting E(Y A=0, W)], and taking a sample average.

mean(mu1) - mean(mu0)
## [1] 0.0489257

or the RR

mean(mu1) / mean(mu0)
## [1] 1.837045

Other problems with this approach

  • If the model is wrong (which it most certainly is), then \(\beta_1\) has no clear interpretation.
  • Assumes no \(A\times W\) interactions.
  • Such interactions are often very important, indicating treatment effect modification (also called predictive power).
  • In high dimensions (\(W\) contains many variables), there is absolutely no reason to believe the model is a good enough approximation.
  • This can cause (potentially large) amounts of bias.

Solution

The machine and statistical learning literature contain a wealth of methods to estimate conditional probabilities and expectations. A few examples:

  • Regression trees (random forests, xgboost, cart, etc.)
  • Support vector machines
  • Multivariate adaptive regression splines
  • Least square angle regression (\(L_1\) regularization)
  • Neural networks, etc.

All of these can be used to estimate

\[E(Y \mid A=a, W=w)\text{ for } a = 0,1\]

Solution

Once we train our learning algorithm, we can compute

\[ \begin{eqnarray} &\hat E(Y \mid A=1, W_i)\\ &\hat E(Y \mid A=0, W_i) \end{eqnarray} \]

For all subjects \(i\) in the sample, and the estimate of the ATE is

\[\frac{1}{n}\sum_{i=1}^n\hat E(Y \mid A=1, W_i) - \frac{1}{n}\sum_{i=1}^n\hat E(Y \mid A=0, W_i)\]

R illustration

library(SuperLearner)
lib <- c('SL.glm', 'SL.randomForest', 'SL.bayesglm',
         'SL.earth', 'SL.gam')
fitSL <- SuperLearner(Y = y , X = data.frame(a = a, w),
                      family = binomial(), SL.library = lib)
## ATE
mean(predict(fitSL, newdata = data.frame(a = 1, w))$pred[, 1] -
     predict(fitSL, newdata = data.frame(a = 0, w))$pred[, 1])
## [1] 0.04561105

Problems with the SL estimator

We have no statistical inference available (p-values, confidence intervals, exact or asymptotic distributions).

Propensity score

The propensity score is defined as:

\[P(A=1\mid W=w)\]

  • This is just a conditional probability so it can (should?) be estimated using data-adaptive methods.

  • Logistic regression can be used in a RCT: \[P(A=a\mid W=w) = \beta_0 + \beta_1w\text{ (we know $\beta_1=0$)}\]

Inverse probability weighting

You can do the math to check that

\[E[E(Y\mid A=1, W)] = E\left[\frac{A}{P(A=1\mid W)} Y\right]\]

This teaches us that another possible estimator is the inverse probability weighted estimator

\[\frac{1}{n}\sum_{i=1}^n\frac{A_i}{\hat P(A=1\mid W_i)}Y_i\]

R illustration

fitSL <- SuperLearner(Y = a , X = w,
                      family = binomial(), SL.library = lib)

## Pscore
ps <- predict(fitSL)$pred[, 1]
## ATE
mean(a / ps * y) - mean((1 - a) / (1 - ps) * y)
## [1] 0.04858146

Problems with the IPW estimator

If the propensity score is estimated with machine learning, We have no statistical inference available.

Solution

There are (at least) two estimators that solve this problem:

  • Augmented inverse probability weighting (AIPW)
  • Targeted minimum loss based estimation (TMLE)

These estimators have nice properties (under some conditions):

  • Doubly robust: consistent if either the propensity score or the outcome regression are consistently estimated
  • Efficient (in the sense of a Cramer-Rao lower bound) if both are consistent
  • Normal asymptotic distribution (can obtain approximate p-values and CIs)

Further reading

  • J. Pearl. "Causal inference in statistics: An overview." Statistics Surveys 3 (2009): 96-146.
  • M. J. van der Laan and S. Rose. Targeted learning: causal inference for observational and experimental data. Springer, 2011.