-
code example (8): weighting on the propensity scoreCausality/3 2025. 3. 20. 07:13
https://mixtape.scunning.com/05-matching_and_subclassification#example-the-nsw-job-training-program
There are several ways researchers can estimate average treatment effects using an estimated propensity score. As there are different ways in which the weights are incorporated into a weighting design, I discuss a few canonical versions of the method of inverse probability weighting and associated methods for inference.
Assuming that CIA holds in our data, then one way we can estimate treatment effects is to use a weighting procedure in which each individual’s propensity score is a weight of that individual’s outcome (Imbens 2000). When aggregated, this has the potential to identify some average treatment effect. The weight enters the expression differently depending on each unit’s treatment status and takes on two different forms depending on whether the target parameter is the ATE or the ATT (or the ATU, which is not shown here):
The sample versions of both ATE and ATT are obtained by a two-step estimation procedure. In the first step, the researcher estimates the propensity score using logit or probit. In the second step, the researcher uses the estimated score to produce sample versions of one of the average treatment effect estimators shown above. Those sample versions can be written as follows:
We have a few options for estimating the variance of this estimator, but one is simply to use bootstrapping. Bootstrapping is a procedure used to estimate the variance of an estimator. In the context of inverse probability weighting, we would repeatedly draw (“with replacement”) a random sample of our original data and then use that smaller sample to calculate the sample analogs of the ATE or ATT. More specifically, using the smaller “bootstrapped” data, we would first estimate the propensity score and then use the estimated propensity score to calculate sample analogs of the ATE or ATT over and over to obtain a distribution of treatment effects corresponding to different cuts of the data itself. If we do this 1,000 or 10,000 times, we get a distribution of parameter estimates from which we can calculate the standard deviation. This standard deviation becomes like a standard error and gives us a measure of the dispersion of the parameter estimate under uncertainty regarding the sample itself.
(Bootstrapping and randomization inference are mechanically similar. Each randomizes something over and over, and under each randomization, reestimates treatment effects to obtain a distribution of treatment effects. But that is where the similarity ends. Bootstrapping is a method for computing the variance in an estimator where we take the treatment assignment as given. The uncertainty in bootstrapping stems from the sample, not the treatment assignment. And thus with each bootstrapped sample, we use fewer observations than exist in our real sample. That is not the source of uncertainty in randomization inference, though. In randomization inference, as you recall from the earlier chapter, the uncertainty in question regards the treatment assignment, not the sample. And thus in randomization inference, we randomly assign the treatment in order to reject or fail to reject Fisher’s sharp null of no individual treatment effects.)
The sensitivity of inverse probability weighting to extreme values of the propensity score has led some researchers to propose an alternative that can handle extremes a bit better. An inverse probability weighting estimator of the average treatment effect assigns weights normalized by the sum of propensity scores for treated and control groups as opposed to equal weights of 1/N to each observation. Its weights sum to one within each group, which tends to make it more stable. The expression of this normalized estimator is shown here:
When we estimate the treatment effect using inverse probability weighting using the non-normalized weighting procedure described earlier, we find an estimated ATT of −$11,876. Using the normalization of the weights, we get −$7,238. Why is this so much different than what we get using the experimental data?
Recall what inverse probability weighting is doing. It is weighting treatment and control units according to p^(X), which is causing units with very small values of the propensity score to blow up and become unusually influential in the calculation of ATT. Thus, we will need to trim the data. Here we will do a very small trim to eliminate the mass of values at the far-left tail. Crump et al. (2009) develop a principled method for addressing a lack of overlap. A good rule of thumb, they note, is to keep only observations on the interval [0.1,0.9], which was performed at the end of the program.
Now let’s repeat the analysis having trimmed the propensity score, keeping only values whose scores are between 0.1 and 0.9. Now we find $2,006 using the non-normalized weights and $1,806 using the normalized weights. This is very similar to what we know is the true causal effect using the experimental data, which was $1,794. And we can see that the normalized weights are even closer. We still need to calculate standard errors, such as based on a bootstrapping method.
Nearest-neighbor matching
An alternative, very popular approach to inverse probability weighting is matching on the propensity score. This is often done by finding a couple of units with comparable propensity scores from the control unit donor pool within some ad hoc chosen radius distance of the treated unit’s own propensity score. The researcher then averages the outcomes and then assigns that average as an imputation to the original treated unit as a proxy for the potential outcome under counterfactual control. Then effort is made to enforce common support through trimming.
But this method has been criticized by King and Nielsen (2019). The King and Nielsen (2019) critique is not of the propensity score itself. For instance, the critique does not apply to stratification based on the propensity score (Rosenbaum and Rubin 1983), regression adjustment or inverse probability weighting. The problem is only focused on nearest-neighbor matching and is related to the forced balance through trimming as well as myriad other common research choices made in the course of the project that together ultimately amplify bias. King and Nielsen (2019) write: “The more balanced the data, or the more balance it becomes by trimming some of the observations through matching, the more likely propensity score matching will degrade inferences” (p.1).
Nevertheless, nearest-neighbor matching, along with inverse probability weighting, is perhaps the most common method for estimating a propensity score model. Nearest-neighbor matching using the propensity score pairs each treatment unit i with one or more comparable control group units j, where comparability is measured in terms of distance to the nearest propensity score. This control group unit’s outcome is then plugged into a matched sample. Once we have the matched sample, we can calculate the ATT as
where Yi(j) is the matched control group unit to i. We will focus on the ATT because of the problems with overlap that we discussed earlier.
I chose to match using five nearest neighbors. Nearest neighbors, in other words, will find the five nearest units in the control group, where “nearest” is measured as closest on the propensity score itself. Unlike covariate matching, distance here is straightforward because of the dimension reduction afforded by the propensity score. We then average actual outcome, and match that average outcome to each treatment unit. Once we have that, we subtract each unit’s matched control from its treatment value, and then divide by NT, the number of treatment units. When we do that in Stata, we get an ATT of $1,725 with p<0.05. Thus, it is both relatively precise and similar to what we find with the experiment itself.
Matching methods are an important member of the causal inference arsenal. Propensity scores are an excellent tool to check the balance and overlap of covariates. It’s an under-appreciated diagnostic, and one that you might miss if you only ran regressions. The propensity score can make groups comparable, but only on the variables used to estimate the propensity score in the first place. It is an area that continues to advance to include covariate balancing (Imai and Ratkovic 2013; Zubizarreta 2015; Zhao 2019) and doubly robust estimators (Band and Robins 2005). Consider this chapter more about the mechanics of matching when you have exact and approximate matching situations.
Learning about the propensity score is particularly valuable given that it appears to have a very long half-life. For instance, propensity scores make their way into other contemporary designs too, such as difference-in-differences (Sant’Anna and Zhao 2018). So investing in a basic understanding of these ideas and methods is likely worthwhile. You never know when the right project comes along for which these methods are the perfect solution, so there’s no intelligent reason to write them off.
But remember, every matching solution to a causality problem requires a credible belief that the backdoor criterion can be achieved by conditioning on some matrix X, or what we’ve called CIA. This explicitly requires that there are no unobservable variables opening backdoor paths as confounders, which to many researchers requires a leap of faith so great they are unwilling to make it. In some respects, CIA is somewhat advanced because it requires deep institutional knowledge to say with confidence that no such unobserved confounder exists. The method is easy compared to such domain-specific knowledge. So if you have good reason to believe that there are important, unobservable variables, you will need another tool. But if you are willing to make such an assumption, then these methods and others could be useful for you in your projects.
'Causality > 3' 카테고리의 다른 글
Placebos in DD (0) 2025.03.22 Difference-in-Differences (0) 2025.03.22 code example (7): matching - propensity score methods (0) 2025.03.18 code example (6): matching - nearest neighbor covariate (0) 2025.03.18 code example (5): subclassification (0) 2025.03.17