7. Estimation
https://www.bradyneal.com/causal-inference-course#course-textbook
Once we identify some causal estimand by reducing it to a statistical estimand, we still have more work to do. We need to get a corresponding estimate. In this chapter, we'll cover a variety of estimators that we can use to do this.
7.1. Preliminaries
We denote the individual treatment effect (ITE) with τi and average treatment effect (ATE) with τ:
ITEs are the most specific kind of causal effects, but they are hard to estimate without strong assumptions. However, we often want to estimate causal effects that are a bit more individualized than the ATE.
For example, say we've observed an individual's covariates x; we might like to use those to estimate a more specific effect for that individual (and anyone else with covariates x). This brings us to the conditional average treatment effect (CATE) τ(x):
The X that is conditioned on does not need to consist of all of the observed covariates, but this is often the case when people refer to CATEs. We call that individualized average treatment effects (IATEs).
ITEs and "CATEs" (what we call IATEs) are sometimes conflated, but they are not the same. For example, two individuals could have the same covariates, but their potential outcomes could be different because of other unobserved differences between these individuals. If we encompass everything about an individual that is relevant to their potential outcomes in the vector I, then ITEs and "CATEs" are the same if X = I. In a causal graph, I corresponds to all of the exogenous variables in the magnified graph that have causal association flowing to Y.
Unconfoundedness
Throughout this chapter, whenever we are estimating an ATE, we will assume that W is a sufficient adjustment set, and whenever we are estimating a CATE, we will assume that W ∪ X is a sufficient adjustment set.
In other words, for ATE estimation, we assume that W satisfies the backdoor criterion (Definition 4.1); equivalently for ATE estimation, we assume that we have conditional exchangeability given W (Assumption 2.2).
And similarly for CATE estimation, assuming W ∪ X is a sufficient adjustment set means that we are assuming that W ∪ X satisfies the backdoor criterion / gives us unconfoundedness.
This unconfoundedness assumption gives us parametric identification and allows us to focus on estimation in this chapter.
(By "parametric identification" we mean identification under the parametric assumptions of our statistical models.)
7.2. Conditional Outcome Modeling (COM)
We are interested in estimating the ATE τ. We'll start with recalling the adjustment formula (Theorem 2.1), which can be derived as a corollary of the backdoor adjustment (Theorem 4.2):
On the left-hand side of Equation 7.4, we have a causal estimand, and on the right-hand side, we have a statistical estimand (i.e. we have identified this causal quantity).
Then, the next step in the Identification-Estimation Flowchart is to get an estimate of this (statistical) estimand.
The most straightforward thing to do is to just fit a statistical model (machine learning model) to the conditional expectation E[Y | T, W] and then approximate Ew with an empirical mean over the n data points (1/n summation i).
To make this more clear, we introduce μ in place of this conditional expectation:
Then, we can fit a statistical model to μ. We will denote that these fitted models are approximations of μ with a hat: μ^. We will refer to a model μ^ as a conditional outcome model.
Now, we can cleanly write the model-assisted estimator (for the ATE) that we've described:
We will refer to estimators that take this form as conditinal outcome model (COM) estimators.
Because minimizing the mean-squared error (MSE) of predicting Y from (T, W) pairs is equivalent to modeling this conditional expectation, there are many different models we can use for μ^ in Equation 7.6 to get a COM estimator (seem e.g., scikit-learn).
For CATE estimation, because we assumed that W ∪ X is a sufficient adjustment set, rather than just W, we must additionally add X as an input to our conditional outcome model. More precisely, for CATE estimation, we define μ as follows:
Then, we train a statistical model μ^ to predict Y from (T, W, X). And this gives us the following COM estimator for the CATE τ(x):
where n_x is the number of data points that have x_i = x. When we are interested in the IATE (CATE where X is all of the observed covariates), n_x is often 1, which simplifies our estimator to a simple difference between predictions:
Even, though IATEs are different from ITEs (τ(x_i) ≠ τ_i), if we really want to give estimates for ITEs, it is relatively common to take this estimator as our estimator of the ITE τ_i as well:
Though, this will likely be unreliable due to severe positivity violation.
7.3. Grouped Conditional Outcome Modeling (GCOM)
In order to get the estimate in Equation 7.6, we must train a model that predicts Y from (T, W). However, T is often one-dimensional, whereas W can be high-dimensional. But the input to μ^ for t is the only thing that changes between the two terms inside the sum μ^(1, wi) - μ^(0, wi). Imagine concatenating T to a 100-dimensional vector W and then feeding that through a neural network that we're using for μ^. It seems reasonable that the network could ignore T while focusing on the other 100 dimensions of its input. This would result in an ATE estimate of zero. And, indeed, there is some evidence of COM estimators being biased toward zero.
So how can we ensure that the model μ^ doesn't ignore T? We can just train two different models μ1^(w) and μ2^(w) thta model μ1(w) and μ0(w), respectively, where
Using two separate models for the values of treatment ensures that T cannot be ignored. To train these statistical models, we first group the data into a group where T = 1 and a group where T = 0. Then, we train μ1^(w) to predict Y from W in the group where T = 1. And, similarly, we train μ0^(w) to predict Y from W in the group where T = 0. This gives us a natural derivative of COM estimators (Equation 7.6), grouped conditional outcome model (GCOM) estimators:
While GCOM estimation seems to fix the problem that COM estimation can have regarding bias toward zero treatment effect, it does have an important downside. In COM estimation, we were able to make use of all the data when we estimate the single model μ^. However, in grouped conditional outcome model estimation, we only use the T = 1 group to estimate μ1^, and we only use the T = 0 group to estimate μ0^. Importantly, we are missing out on making the most of our data by not using all of the data to estimate μ1^ and all of the data to estimate μ0^.
7.4. Increasing Data Efficiency
In this section, we'll cover two ways to address the problem of data efficiency that we mentioned is present in GCOM estimation: TARNet and X-Learner.
7.4.1. TARNet
Consider that we're using neural networks for our statistical models; starting with that, we'll contrast, vanilla COM estimation, GCOM estimation, and TARNet. In vanilla COM estimation, the neural network is used to predict Y from (T, W). This has the problem of potentially yielding ATE estimates that are biased toward zero, as the network might ignore the scalar T, especially when W is high-dimensional.
We ensure that T can't be ignored in GCOM estimation by using two separate neural networks for the two treatment groups. However, this is inefficient as we only use the treatment group data for training one network and the control group data for training the other network.
We can achieve a middle ground between vanilla COM estimation and GCOM estimation using TARNet. With TARNet, we use a single network that takes only W as input but then branches off into two separate heads (sub-networks) for each treatment group. We then use this model for μ(t, w) to get a COM estimator. This has the advantage of learning a treatment-agnostic representation (TAR) of W using all of the data while still forcing the model to not ignore T by branching into two heads for the different values of T.
In other words, TARNet uses the knowledge we have about T (as a uniquely important variable) in its architecture. Still, the sub-networks for each of these heads are only trained with the data for the corresponding treatment group, rather than all of the data.
7.4.2. X-Learner
We just saw that one way to increase data efficiency relative to GCOM estimation is to use TARNet, a COM estimator that shares some qualities with GCOM estimators. However, TARNet still doesn't use all of the data for the full model (neural network).
In this section, we will start with GCOM estimation and build on it to create a class of estimators that use all of the data for both models that are part of the estimators. An estimator in this class is known as an X-learner. Unlike TARNet, X-learners are neither COM estimators nor GCOM estimators.
There are three steps to X-learning, and the first step is the exact same as what's used in GCOM estimation: estimate μ1^(x) using the treatment group data and estimate μ0^(x) using the control group data. As before, this can be done with any models that minimize MSE. For simplicity, we'll be considering IATEs (X is all of the observed variables) where X satisfies the backdoor criterion (X contains W and no descendants of T).
The second step is the most important part as it is both where we end up using all of the data for both models and where the "X" comes from. We specify τ^1,i for the treatment group ITE estimates and τ^0,i for the control group ITE estimates:
Here, τ^1,i is estimated using the treatment group outcomes and the imputed counterfactual that we get from μ0^. Similarly, τ^0,i is estimated using the control group outcomes and the imputed counterfactual that we get from μ1^.
If you draw a line between the observed potential outcomes and a line between the imputed potential outcomes, you can see the "X" shape. Importantly, this "X" tells us that each treatment group ITE estimate τ^1,i uses both treatment group data (its observed potential outcome under treatment), and control group data (in μ0^). Similarly, τ^0,i is estimated with data from both treatment groups.
However, each ITE estimate only uses a single data point from its corresponding treatment group. We can fix this by fitting a model τ1^(x) to predict τ^1,i from the corresponding treatment group xi's. Finally, we have a model τ1^(x) that was fit using all of the data (treatment group data just now and control group data when μ0 was fit in step 1). Similarly, we can fit a model τ0^(x) to predict τ^0,i from the corresponding control group xi's.
The output of step 2 is two different estimators for the IATE: τ1^(x) and τ0^(x).
Finally, in step 3, we combine τ1^(x) and τ0^(x) together to get our IATE estimator:
where g(x) is some weighting function that produces values between 0 and 1. Kunzel et al. report that an estimate of the propensity score works well, but that choosing the constant function 0 or 1 also makes sense if the treatment groups are very different sizes. Or that choosing g(x) to minimize the variance of τ^(x) could also be attractive.
7.5. Propensity Scores
Given that the vector of variables W satisfies the backdoor criterion (or, equivalently, that (Y(1), Y(0)) ㅛ T | W), we might wonder if it is really necessary to condition on that whole vector to isolate causal association, especially when W is high-dimensional. It turns out that it isn't.
If W satisfies unconfoundedness and positivity, then we can actually get away with only conditioning on the scalar P(T = 1 | W). We'll let e(w) denote P(T=1 | W=w), as we'll refer to e(w) as the propensity score since it is the propensity for (probability of) receiving treatment given that W is w.
The magic of being able to condition on the scalar e(W) in the place of the vector W is due to Rosenbaum and Rubin's propensity score theorem:
We provide a graphical proof here. Consider the graph in Figure 7.3. Because the edge from W to T is a symbol for the mechanism P(T | W) and because the propensity score completely describes that distribution (P(T=1 | W) = e(W)), we can think of the propensity score as a full mediator of the effect of W on T.
This means that we can redraw this graph with e(W) situated between W and T. And in this redrawned graph in Figure 7.4, we can see that e(W) blocks all backdoor paths that W blocks, so e(W) must be a sufficient adjustment set if W is. Therefore, we have a graphical proof of the propensity score theorem using the backdoor adjustment (Theorem 4.2).
Importantly, this theorem means that we can swap in e(W) in place of W wherever we are adjusting for W in a given estimator in this chapter. This seems very useful when W is high-dimensional.
Recall The Positivity-Unconfoundedness Tradeoff. As we condition on more non-collider-bias-inducing variables, we decrease confounding. However, this comes at the cost of decreasing overlap because the W in P(T=1 | W) becomes higher and higher dimensional. The propensity score seems to allow us to magically fix that issue since the e(W) remains a scalar, even as W grows in dimension.
Unfortunately, we usually don't have access to e(W). Rather, the best we can do is model it. We do this by training a model to predict T from W. For example, logistic regression is very commonly used to do this. And because this model is fit to the high-dimensional W, in some sense, we have just shifted the positivity problem to our model for e(W).
7.6. Inverse Probability Weighting (IPW)
What if we could resample the data in a way to make it so that association is causation? This is the motivation behind creating "pseudo-populations" that are made up of reweighting versions of the observed population. To get to this, let's recall why association is not causation in general.
Association is not causation in the graph in Figure 7.5 because W is a common cause of T and Y. In other words, the mechanism that generates T depends on W, and the mechanism that generates Y depends on W.
Focusing on the mechanism that generates T, we can write this mathematically as P(T | W) ≠ P(T). It turns out that we can reweight the data to get a pseudo-population where P(T | W) = P(T) or P(T | W) equals some constant; the important part is that we make T independent of W. The corresponding graph for such a pseudo-population has no edge from W to T because T does not depend on W; we depict this in Figure 7.6.
It turns out that the propensity score is key to this reweighting. All we have to do is reweight each data point with treatment T and confounders W by its inverse probability of receiving its value of treatment given that it has its value of W.
This is why this technique is called inverse probability weighting (IPW). For individuals that received treatment 1, this weight is 1/e(W), and for individuals that received treatment 0, this weight is 1 / 1-e(W).
If the treatment were continuous, the weight would be 1 / P(T|W), which happens to also be the reciprocal of the generalization of the propensity score to continuous treatment.
Why does what we described in the above paragraph work? Recall that our goal is to undo confounding by "removing" the edge that goes from W to T (i.e. move from Figure 7.5 to Figure 7.6). And the mechanism that edge describes is P(T | W). By weighting the data points by 1 / P(T|W), we are effectively canceling it out. That's the intuition. Formally, we have the following identification equation:
where 1(T=t) is an indicator random variable that takes on the value 1 if T=t and 0 otherwise. We provide a proof of Equation 7.18 using the familiar adjustment formula E[Y(t)] = E[E[Y | t, W]] (Theorem 2.1) in Appendix A.3.
Assuming binary treatment, the following identification equation for the ATE follows from Equation 7.18:
Now that we have a statistical estimand in the form of IPW, we can get an IPW estimator. Replacing expectations by empirical means and e(W) by a propensity score model e^(W), we get the following equivalent formulations of the basic IPW estimator for the ATE:
where n1 and n0 are the number of treatment group units and control group units, respectively.
Weight Trimming
As you can see in Equation 7.20 and 7.21, if the propensity scores are very close to 0 or 1, the estimates will blow up. In order to prevent this, it is not uncommon to trim the propensity scores that are less than ε to ε and those that are greater than 1 - ε to 1 - ε (effectively trimming the weights to be no larger than 1/ ε), though this introduces its own problems such as bias.
CATE Estimation
We can extend the ATE estimator in Equation 7.20 to get an IPW estimator for the CATE τ(x) by just restricting to the data points where xi = x:
where n_x is the number of data points with xi = x. However, the estimator in Equation 7.22 may quickly run into the problem of using very small amounts of data, leading to high variance. More general CATE estimation with IPW estimators is more complex and outside the scope of this book.
7.7. Doubly Robust Methods
We've seen that we can estimate causal effect by modeling μ(t, w) = E[Y | t,w] or by modeling e(w) = P(T=1 | w).
What if we modeled both μ(t, w) and e(w)? we can. And estimators that do this are sometimes doubly robust. A doubly robust estimator has the property that it is a consistent estimator of τ if either μ^ is a consistent estimator of μ or e^ is a consistent estimate of e.
(An estimator is consistent if it converges in probability to its estimand as the number of samples n grows).
In other words, only one of μ^ and e^ needs to be well-specified. Additionally, the rate at which a doubly robust estimator converges to τ is the product of the rate at which μ^ converges to μ and the rate at which e^ converges to e.
This makes double robustness is very useful when we are using flexible machine learning models in high-dimensions because, in this setting, each of our individual models (μ^ and e^) converge more slowly than the ideal rate of n^(-1/2).
However, there is some controversy over how well doubly robust methods work in practice if not at least one of μ^ or e^ is well-specified. Though, this might be contested as we get better at using doubly robust estimators with flexible machine learning models. Meanwhile, the estimators that currently seem to do the best all flexibly model μ (unlike pure IPW estimators).
7.8. Other Methods
As this chapter is only an introduction to estimation in causal inference, there are some methods that we've entirely left out. We'll briefly describe some of the most popular ones in this section.
Matching
In matching methods, we try to match units in the treatment group with units in the control group and throw away the non-matches to create comparable groups. We can match in raw covariate space, coarsened covariate space, or propensity score space. There are different distance functions for deciding how close two units are. Furthermore, there are different criteria for deciding whether a given distance is close enough to count as a match (one criterion requires an exact match), how many matches each treatment group unit can have, how many matches each control group unit can have, etc.
Double Machine Learning
In double machine learning, we fit three models in two stages: two in the first stage and a final model in the second stage.
First stage:
1. Fit a model to predict Y from W to get the predicted Y^.
2. Fit a model to predict T from W to get the predicted T^.
Second stage:
We "partial out" W by looking at Y - Y^ and T - T^. In a sense, we have deconfounded the effect of treatment on the outcome with this partialling out.
Then, we fit a model to predict Y - Y^ from T - T^. This gives us our causal effect estimates.
Causal Trees and Forests
Another popular estimation method is to recursively partition the data into subsets that have the same treatment effects. This forms a causal tree where the leaves are subsets of the population with similar causal effects. Since random forests generally perform better thatn decision trees, it would be great if this kind of strategy can be extended to random forests. And it can. This extensions is known as causal forests. Importantly, these methods were developed with the goal in mind of yielding valid confidence intervals for the estimates.
7.9. Concluding Remarks
7.9.1. Confidence Intervals
So far, we have only discussed point estimates for causal effects. We haven't discussed how we can gauge our uncertainty due to data sampling. We haven't discussed how to calculate confidence intervals on these estimates. This is a machine learning perspective, after all; because we are allowing for arbitrary machine learning models in all of the estimators we discuss, it is actually quite difficult to get valid confidence intervals.
Bootstrapping
One way to get confidence intervals is to use bootstrapping. With bootstrapping, we repeat the causal effect estimation process many times, each time with a different sample (with replacement) from our data. This allows us to build an empirical distribution for the estimate. We can then compute whatever confidence interval we like from that empirical distribution. Unfortunately, bootstrapped confidence intervals are not always valid. For example, if we take a bootstrapped 95% confidence interval, it might not contain the true value (estimand) 95% of the time.
Specialized Models
Another way to get confidence intervals is to analyze very specific models, rather than allowing for arbitrary models. Linear models are the simplest example of this; it is easy to get confidence intervals in linear models. Similarly, if we use a linear model as the second stage model in double machine learning, we can get confidence intervals. Noticeably, causal trees and causal forests were developed with the goal in mind of getting confidence intervals.
7.9.2. Comparison to Randomized Experiments
You might read somewhere that some of these adjustment techniques ensure that we've addressed confounding and isolated a causal effect. Of course, this is not true when there is unobserved confounding. These methods only address observed confounding. If there are any unobserved confounders, these methods don't fix that like randomization does. These adjustment methods aren't magic. And it's hard to know when it is reasonable to assume we've observed all confounders. That's why it is important to run a sensitivity analysis where we gauge how robust our causal effect estimates are to unobserved confounding. This is the topic of the next chapter.