-
18. [F-learner] Plug-and-Play Estimators*Causality/2 2025. 4. 10. 07:56
https://matheusfacure.github.io/python-causality-handbook/20-Plug-and-Play-Estimators.html
So far, we’ve seen how to debias our data in the case where the treatment is not randomly assigned, which results in confounding bias. That helps us with the identification problem in causal inference. In other words, once the units are exchangeable, or Y(0),Y(1)⊥T|X, it becomes possible to learn the treatment effect. But we are far from done.
Identification means that we can find the average treatment effect. In other words, we know how effective a treatment is on average. Of course this is useful, as it helps us to decide if we should roll out a treatment or not. But we want more than that. We want to know if there are subgroups of units that respond better or worse to the treatment. That should allow for a much better policy, one where we only treat the ones that will benefit from it.
Problem Setup
Let’s recall our setup of interest. Given the potential outcomes, we can define the individual treatment effect as the difference between the potential outcomes.

or, the continuous treatment case, τi=∂Y(t), where t is the treatment variable. Of course, we can never observe the individual treatment effect, because we only get to see the one of potential outcomes

We can define the average treatment effect (ATE) as

and the conditional average treatment effect (CATE) as

In Part I of this book, we’ve focused mostly on the ATE. Now, we are interested in the CATE. The CATE is useful for personalising a decision making process. For example, if you have a drug as the treatment t, you want to know which type of patients are more responsive to the drug (higher CATE) and if there are some types of patient with a negative response (CATE < 0).
We’ve seen how to estimate the CATE using a linear regression with interactions between the treatment and the features

If we estimate this model, we can get estimates for τ(x)

Still, the linear models have some drawbacks. The main one being the linearity assumption on X. Notice that you don’t even care about β2 on this model. But if the features X don’t have a linear relationship with the outcome, your estimates of the causal parameters β1 and β3 will be off.
It would be great if we could replace the linear model by a more flexible machine learning model. We could even plug the treatment as a feature to a ML model, like boosted trees or a neural network

but from there, it is not clear how we can get treatment effect estimates, since this model will output y^ predictions, not τ(x)^ predictions. Ideally, we would use a machine learning regression model that, instead of minimising the outcome MSE

would minimise the treatment effect MSE

However, this criterion is what we call infeasible. Again, the problem here is that τ(x)i is not observable, so we can’t optimize it directly. This puts us in a tough spot… Let’s try to simplify it a bit and maybe we can think of something.

Target Transformation
Suppose your treatment is binary. Let’s say you are an investment firm testing the effectiveness of sending a financial education email. You hope the email will make people invest more. Also, let’s say you did a randomized study where 50% of the customers got the email and the other 50% didn’t.
Here is a crazy idea: let’s transform the outcome variable by multiplying it with the treatment.

So, if the unit was treated, you would take the outcome and multiply it by 2. If it wasn’t treated, you would take the outcome and multiply it by -2. For example, if one of your customers invested BRL 2000.00 and got the email, the transformed target would be 4000. However, if he or she didn’t get the email, it would be -4000.
This seems very odd, because you are saying that the effect of the email can be a negative number, but bear with me. If we do a little bit of math, we can see that, on average or in expectation, this transformed target will be the treatment effect. This is nothing short of amazing. What I’m saying is that by applying this somewhat wacky transformation, I get to estimate something that I can’t even observe.
To understand that, we need a bit of math. Because of random assignment, we have that T⊥Y(0),Y(1), which is our old unconfoundedness friend. That implies that E[T,Y(t)]=E[T]∗E[Y(t)], which is the definition of independence.
Also, we know that

because the treatment is what materializes one or the other potential outcomes. With that in mind, let’s take the expected value of Yi∗ and see what we end up with.

So, this apparently crazy idea ended up being an unbiased estimate of the individual treatment effect τ(x)i. Now, we can replace our infeasible optimization criterion with

In simpler terms, all we have to do is use any regression machine learning model to predict Yi∗ and this model will output treatment effect predictions.
Now that we’ve solved the simple case, what about the more complicated case, where treatment is not 50% 50%, or not even randomly assigned? As it turns out, the answer is a bit more complicated, but not much. First, if we don’t have random assignment, we need at least conditional independence T⊥Y(1),Y(0)|X. That is, controlling for X, T is as good as random. With that, we can generalize the transformed target to

where e(Xi) is the propensity score. So, if the treatment is not 50% 50%, but randomized with a different probability p, all you have to do is replace the propensity score in the above formula with p. If the treatment is not random, then you have to use the propensity score, either stored or estimated.
If you take the expectation of this, you will see that it also matches the treatment effect. The proof is left as an exercise to the reader. Just kidding, here it is. It’s a bit cumbersome, so feel free to skip it.

As always, I think this will become much more concrete with an example. Again, consider the investment emails we’ve sent trying to make people invest more. The outcome variable the binary (invested vs didn’t invest) converted.

Our goal here is one of personalization. Let’s focus on email-1. We wish to send it only to those customers who will respond better to it. In other words, we wish to estimate the conditional average treatment effect of email-1

so that we can target those customers who will have the best response to the email (higher CATE)
But first, let’s break our dataset into a training and a validation set. We will estimate τ(x)i on one set and evaluate the estimates on the other.

Now, we will apply the target transformation we’ve just learned. Since the emails were randomly assigned (although not on a 50% 50% basis), we don’t need to worry about the propensity score. Rather, it is constant and equal to the treatment probability.

With the transformed target, we can pick any ML regression algorithm to predict it. Lets use boosted trees here.

This model can now estimate τ(x)i. In other words, what it outputs is τ^(x)i. For example, if we make predictions on the test set, we will see that some units have higher CATE than others. For example, customer 6958 has a CATE of 0.1, meaning the probability he or she will buy our investment product is predicted to increase by 0.1 if we send the email to this customer. In contrast, for customer 3903, the probability of buying the product is predicted to increase just 0.04.

You do get a lot of simplicity, since you can just transform the target and use any ML estimator to predict heterogeneous treatment effects. The cost of it is that you get a lot of variance. That’s because the transformed target is a very noisy estimate of the individual treatment effect and that variance gets transferred to your estimation. This is a huge problem if you don’t have a lot of data, but it should be less of a problem in big data applications, where you are dealing with more than 1MM samples.
The Continuous Treatment Case

Another obvious downside of the target transformation method is that it only works for discrete or binary treatments. This is something you see a lot in the causal inference literature. Most of the research is done for the binary treatment case, but you don’t find a lot about continuous treatments. That bothered me a lot, because in the industry, continuous treatments are everywhere, mostly in the form of prices you need to optimize. So, even though I couldn’t find anything regarding target transformations for continuous treatment, I came up with something that works in practice. Just keep in mind that I don’t have a super solid econometric research around it.
To motivate it, let’s go back to the ice cream sales example. There, we were tasked with the problem of estimating demand elasticity to price so that we can better set the ice cream prices to optimize our revenues. Recall that the event sample in the dataset is a day and we wish to know when people are less sensitive to price increases. Also, recall that prices are randomly assigned in this dataset, which means we don’t need to worry about confounding bias.

Now is where we need a little bit of creativity. For the discrete case, the conditional average treatment effect is given by how much the outcome changes when we go from untreated to treated, conditioned on unit characteristics X.

In plain english, this is estimating the impact of the treatment on different unit profiles, where profiles are defined using the features X. For the continuous case, we don’t have that on-off switch. Units are not treated or untreated. Rather, they are all treated, but with different intensities. Therefore, we can’t talk about the effect of giving the treatment. Rather, we need to speak in terms of increasing the treatment. In other words, we wish to know how the outcome would change if we increase the treatment by some amount. This is like estimating the partial derivative of the outcome function Y on the treatment t. And because we wish to know that for each group (the CATE, not the ATE), we condition on the features X

How can we estimate that? First, let’s consider the easy case, where the outcome is linear on the treatment. Suppose you have two types of days: hot days (yellow) and cold days (blue). On cold days people are more sensitive to price increases. Also, as price increases, demand falls linearly.

In this case, the CATE will be the slope of each demand line. These slopes will tell us how much demand will fall if we increase price by any amount. If this relationship is indeed linear, we can estimate those elasticities with the coefficient of a simple linear regression estimate on hot days and on cold days separately.

We can be inspired by this estimator and think about what it would be like for an individual unit. In other words, what if we have that same thing up there, defined for each day. In my head, it would be something like this:

In plain English, we would transform the original target by subtracting the mean from it, then we would multiply it by the treatment, from which we’ve also subtracted the mean from. Finally, we would divide it by the treatment variance. Alas, we have a target transformation for the continuous case.

The question now is: does it work? As a matter of fact it does and we can go over a similar proof for why it works, just like we did in the binary case. First, lets call

If all that math seems tiresome, don’t worry. The code is actually very simple. Once again, we transform our training target with the formulas seen above. Here, we have random treatment assignments, so we don’t need to build a model that predicts prices. I’m also omitting the denominator, because here I only care about ordering the treatment effect.

This time, the CATE’s interpretation is non intuitive. Since we’ve removed the denominator from the target transformation, this CATE we are seeing is scaled by Var(X). However, this prediction should still order the treatment effect pretty well.
Non Linear Treatment Effects
Having talked about the continuous case, there is still an elephant in the room we need to adress. We’ve assumed a linearity on the treatment effect. However, that is very rarely a reasonable assumption. Usually, treatment effects saturate in one form or another. In our example, it’s reasonable to think that demand will go down faster at the first units of price increase, but then it will fall slowlier.

The problem here is that elasticity or treatment effect changes with the treatment itself. In our example, the treatment effect is more intense at the beginning of the curve and smaller as prices get higher. Again,suppose you have two types of days: hot days (yellow) and cold days (blue) and we want to distinguish between the two with a causal model. The thing is that causal models should predict elasticity, but in the nonlinear case, the elasticity for hot and cold days could be the same, if we look at different price points in the curve (right image).
There is no easy way out of this problem and I confess I’m still investigating what works best. For now, the thing that I do is try to think about the functional form of the treatment effect and somehow linearize it. For example, demand usually has the following functional form, where higher αs means that demand falls faster with each price increase

So, if I apply the log transformation to both the demand Y and prices T, I should get something that is linear.

Linearization is not so easy to do, as it involves some thinking. But you can also try stuff out and see what works best. Often, things like logs and square roots help.
Key Ideas
We are now moving in the direction of estimating conditional average treatment effects using machine learning models. The biggest challenge when doing so is adapting a predictive model to one that estimates causal effects. Another way of thinking about it is that predictive models focus on estimating the outcome Y as a function of features X and possibly treatment T Y=M(X,T) while causal models need to estimate the partial derivative of this output function on the treatment ∂Y=∂M(X,T). This is far from trivial, because while we do observe the outcome Y, we can’t observe ∂Y, at least not on an individual level. As a consequence, we need to be creative when designing an objective function for our models.
Here, we saw a very simple technique of target transformation. The idea is to combine the original target Y with the treatment T to form a transformed target which is, in expectation, equal to the CATE. With that new target, we can plug any predictive ML model to estimate it and then the model predictions will be CATE estimates. As a side note, besides target transformation, this method also goes by the name of F-Learner.
With all that simplicity, there is also a price to pay. The transformed target is a very noisy estimate of the individual treatment effect and that noise will be transferred to the model estimates in the form of variance. This makes target transformation better suited for big data applications, where variance is less of a problem due to sheer sample sizes. Another downside of the target transformation method is that it is only defined for binary or categorical treatments. We did our best to come up with a continuous version of the approach and even ended up with something that seemed to work, but up until now, there is no solid theoretical framework to back it up.
Finally, we’ve ended with a discussion on non linear treatment effects and the challenges that come with it. Namely, when the treatment effect changes with the treatment itself, we might mistakenly think units have the same treatment response curve because they have the same responsiveness to the treatment, but actually they are just receiving different treatment amounts.
Susan Atheys’ and Guido W. Imbens, Machine Learning Methods for Estimating Heterogeneous Causal Effects.
https://gsb-faculty.stanford.edu/guido-w-imbens/files/2022/04/3350.pdf
Meta-learners for Estimating Heterogeneous Treatment Effects using Machine Learning, K¨unzel et al, 2019
'*Causality > 2' 카테고리의 다른 글
20. [R-learner, Double ML] Debiased/Orthogonal Machine Learning (0) 2025.04.10 19. Meta Learners (0) 2025.04.10 17. Heterogeneous Treatment Effects and Personalization (0) 2025.04.09 16. Regression Discontinuity Design (0) 2025.03.30 15. Synthetic Control (0) 2025.03.30