Feb. 28, 2017
The typical problem of causal inference can be presented as follows.
Denote
We are interested in learning the effect of \(A\) on \(Y\).
Frameworks for causal inference:
Let's focus on NPSEMs
Assume we observe \(n\) copies of \(O=(W,A,Y)\). An NPSEM is a generative model, in which the data are assumed to be generated by a set of equations
\[ \begin{eqnarray} W & = & f_W(U_W)\\ A & = & f_A(W,U_A)\\ Y & = & f_Y(W,A,U_Y)\\ \end{eqnarray} \]
Causal effects are defined in terms of an intervention on the data generating mechanism. For example, what would be the effect of setting \(A=1\)?
In an NPSEM, variables under an intervention are called counterfactuals. In our example, we have two counterfactuals:
\[ \begin{eqnarray} Y_1 & = & f_Y(W,1,U_Y)\\ Y_0 & = & f_Y(W,0,U_Y) \end{eqnarray} \]
In the potential outcomes framework, \(Y_a:a=0,1\) is the primitive.
Causal effects are often defined as contrasts between counterfactuals. For example:
\[E(Y_1) - E(Y_0)\]
\[\frac{E(Y_1)}{E(Y_0)}\]
The fundamental problem of causal inference is that countefactuals are not observed. So the question is, can we estimate the above quantities from data on \(O=(W,A,Y)\)? This is the problem of identification.
Under the assumption that
\[A \perp U_Y \mid W\]
We can show (math omitted but happy to show it)
\[E(Y_1) = E_W[E(Y | A=1, W)]\]
(Note: in a randomized trial \(A\) is independent of everything, that is why it is often considered the gold standard.)
Likewise
\[E(Y_0) = E_W[E(Y | A=0, W)]\]
An NPSEM can be visually represented in a directed acyclic graph (DAG)
The above DAGs satisfy \(A \perp U_Y \mid W\).
This concludes the causal inference problem. All we have to do now is statistics to estimate
\[E(Y_1) = E_W[E(Y | A=1, W)]\]
When \(Y\) is binary, a common approach is to look at the coefficient of treatment in a logistic regression:
\[ \DeclareMathOperator{\logit}{logit} \logit P(Y=1\mid A=a, W=w) = \beta_0 + \beta_1a + \beta_2w \]
A very common mistake is to take \(\exp(\beta_1)\) and say ''this is the odds ratio adjusted for covariates''
We will use the WCGS dataset in the R epitools package. We focus on the effect of smoking on CHD.
library(epitools) library(dplyr) library(tidyr) data(wcgs) y <- select(wcgs, chd69)[, 1] a <- as.numeric(select(wcgs, ncigs0)[, 1] > 0) w <- select(wcgs, age0, height0, weight0, dibpat0) fit <- glm(y ~ ., data = data.frame(a = a, w), family = binomial(link = logit))
exp(coef(fit)['a'])
## a ## 1.981932
mu1 <- predict(fit, newdata = data.frame(a = 1, w), type = 'response') mu0 <- predict(fit, newdata = data.frame(a = 0, w), type = 'response') or <- (mean(mu1) / (1 - mean(mu1))) / (mean(mu0) / (1 - mean(mu0))) or
## [1] 1.937736
(Note: inspection of the OR equation reveals that we are implicitly imputing the counterfactual outcomes)
We could also compute other contrasts, e.g.,
\[E[E(Y \mid A=1, W)] - E[E(Y\mid A=0, W)]\] by computing \(E(Y \mid A=1, W)]\) for all subjects, subtracting E(Y A=0, W)], and taking a sample average.
mean(mu1) - mean(mu0)
## [1] 0.0489257
or the RR
mean(mu1) / mean(mu0)
## [1] 1.837045
The machine and statistical learning literature contain a wealth of methods to estimate conditional probabilities and expectations. A few examples:
All of these can be used to estimate
\[E(Y \mid A=a, W=w)\text{ for } a = 0,1\]
Once we train our learning algorithm, we can compute
\[ \begin{eqnarray} &\hat E(Y \mid A=1, W_i)\\ &\hat E(Y \mid A=0, W_i) \end{eqnarray} \]
For all subjects \(i\) in the sample, and the estimate of the ATE is
\[\frac{1}{n}\sum_{i=1}^n\hat E(Y \mid A=1, W_i) - \frac{1}{n}\sum_{i=1}^n\hat E(Y \mid A=0, W_i)\]
library(SuperLearner) lib <- c('SL.glm', 'SL.randomForest', 'SL.bayesglm', 'SL.earth', 'SL.gam') fitSL <- SuperLearner(Y = y , X = data.frame(a = a, w), family = binomial(), SL.library = lib) ## ATE mean(predict(fitSL, newdata = data.frame(a = 1, w))$pred[, 1] - predict(fitSL, newdata = data.frame(a = 0, w))$pred[, 1])
## [1] 0.04561105
We have no statistical inference available (p-values, confidence intervals, exact or asymptotic distributions).
The propensity score is defined as:
\[P(A=1\mid W=w)\]
This is just a conditional probability so it can (should?) be estimated using data-adaptive methods.
Logistic regression can be used in a RCT: \[P(A=a\mid W=w) = \beta_0 + \beta_1w\text{ (we know $\beta_1=0$)}\]
You can do the math to check that
\[E[E(Y\mid A=1, W)] = E\left[\frac{A}{P(A=1\mid W)} Y\right]\]
This teaches us that another possible estimator is the inverse probability weighted estimator
\[\frac{1}{n}\sum_{i=1}^n\frac{A_i}{\hat P(A=1\mid W_i)}Y_i\]
fitSL <- SuperLearner(Y = a , X = w, family = binomial(), SL.library = lib) ## Pscore ps <- predict(fitSL)$pred[, 1] ## ATE mean(a / ps * y) - mean((1 - a) / (1 - ps) * y)
## [1] 0.04858146
If the propensity score is estimated with machine learning, We have no statistical inference available.
There are (at least) two estimators that solve this problem:
These estimators have nice properties (under some conditions):