Double Machine Learning Explained: A Comprehensive Guide

In the realm of data science, predicting outcomes based on variables is a common task. However, understanding the underlying causal relationships and how to influence them is often more valuable. This is where causal machine learning (causal ML) comes into play, and double machine learning (DML) is a powerful tool within this field. This article aims to provide a comprehensive explanation of double machine learning, its benefits, and its applications, drawing upon various resources and examples.

The Need for Causal Inference

Traditional machine learning excels at prediction, but businesses often need to understand why things happen and how to intervene. For example, predicting customer churn is useful, but knowing why customers churn and how to prevent it is crucial.

Causal ML focuses on estimating the causal effect of a treatment (T) on an outcome (Y). While A/B testing is a form of causal inference, it's not always feasible. For instance, understanding how delivery speed affects customer lifetime value is challenging to test directly, as intentionally slowing deliveries could harm customer satisfaction. In such cases, causal ML provides an alternative.

Limitations of Linear Regression

One might consider using linear regression to estimate treatment effects. For example, to see how personalized ads (T) affect user engagement (Y), a simple regression equation could be:

Y = βT + γX + ϵ

Where:

Y is user engagement
β is the coefficient representing the effect of personalized ads on engagement
X is a vector of control variables (user age, browsing history, etc.)
ϵ is the error term

However, this approach has limitations:

Linearity Assumption: Real-world data is often more complex than a linear relationship can capture, leading to biased estimates of β.
Confounding Variables: Omitting key confounders in X means β may capture the effect of those missing variables, rather than the true causal effect of T.

Introducing Double Machine Learning

Double Machine Learning (DML), or de-biased machine learning, addresses the limitations of traditional methods. Proposed by MIT statistician and economist Victor Chernozhukov, DML aims to:

Develop a causal estimator that leverages the flexibility of non-parametric machine learning models.
Reduce bias.
Provide valid confidence intervals.
Achieve a "root-n-consistent" estimator, where the estimation error approaches zero at a rate of 1/√n as the sample size (n) increases.

The Frisch-Waugh-Lovell (FWL) Theorem: A Foundation for DML

One of the key foundations of DML is the Frisch-Waugh-Lovell (FWL) theorem. This theorem isolates the effect of T through these steps:

Regress T on X (control variables) and obtain the residuals η.
Regress Y on X and obtain the residuals ν.
Regress ν on η.

The true causal effect β is equal to η at the end of this procedure.

Read also: Defensive Strategies in CFB 25

While this procedure still uses linear regression to model the causal effect, steps 1 and 2 extract the variation in T and Y that is independent of X, meaning the residuals are free from the influence of the confounders.

Orthogonalization: Enhancing the FWL Procedure

Double ML enhances the FWL process by replacing the linear regression in steps 1 and 2 with any machine learning model (e.g., XGBoost, LightGBM). This modification is called orthogonalization. The strength of orthogonalization lies in its ability to capture complex, non-linear relationships in the data. The general equation can be written as:

Y = βT + f(X) + ϵ

T = g(X) + ϵ’

where f(.) and g(.) can be any non-linear functions of X.

Read also: Requirements for Double Husky

By using machine learning instead of traditional linear regression, we can more effectively separate the estimation of the causal parameter β from other factors that might confuse our results. In causal inference, these confusing factors are called “nuisance parameters.”

Addressing Biases in DML

DML helps reduce two main types of biases:

Regularization Bias: Regularization methods (e.g., lasso, ridge) can introduce bias into the treatment effect estimates by shrinking coefficients, which may distort the treatment-outcome relationship.
Overfitting Bias: Overfitting occurs when a model captures spurious correlations, resulting in high variance and biased causal effect estimates.

To address these biases, DML applies orthogonalization for regularization bias and CrossFitting for overfitting bias.

CrossFitting: Correcting for Overfitting

CrossFitting follows this procedure:

Split the data into K folds.
For each fold k, train the machine learning models on the remaining K-1 folds.
Use the trained models to predict the outcome and treatment for the observations in fold k.
Estimate the treatment effect using the predictions from each fold.
Average the treatment effect estimates across all folds.

CrossFitting helps correct bias, while cross-validation is used to evaluate how well a model performs.

With Double Machine Learning, we overcome many common problems, achieving an estimator with low variance and bias and providing valid confidence intervals.

DML in Practice: Estimating the Effect of a Training Program on Earnings

Let’s consider a classic example: estimating the causal effect of a training program on earnings using the Lalonde dataset. This dataset includes information on individuals who participated in a job training program and those who did not, with covariates such as age, education, race, and marital status. Our goal is to apply Double Machine Learning (DML) to estimate the ATE of the training program on earnings.

The Average Treatment Effect (ATE) can be written as:

ATE = E[Yi(1) - Yi(0)]

Where:

Yi(1): is the outcome under treatment
Yi(0): is the outcome under no treatment.

Since we can’t see both outcomes for the same person, we estimate them.

Data Visualization:
We want to compare the outcomes between the control group and the treatment group. It’s important to examine the characteristics of each group to see if they are similar or if there are noticeable differences.At first glance, the control group might seem to have higher earnings than the treatment group, which could suggest a negative effect of the job training. However, to accurately estimate the causal effect, we need to ensure that the two groups are similar in terms of other factors - this is known as checking for covariate balance. Covariate balance ensures that the groups are comparable and that any observed differences are due to the training itself.The groups are not balanced, meaning the differences we see might not be just from the treatment, maybe age or other covariates. To fix this, we use matching techniques to make sure the groups are similar.
Data Preprocessing:
We need to ensure that our treatment and control groups are comparable so that any differences in outcomes are genuinely due to the treatment itself, not other factors. One way to solve the covariate balance issue is by using matching techniques. These methods help us make the groups more similar by controlling for factors like age, education, or prior income that could otherwise bias our estimate of the treatment effect.
- Propensity Score Matching: This technique estimates the probability of receiving the treatment based on covariates, known as the propensity score. In our example, studying the effect of job training on earnings, the propensity score might estimate the likelihood of receiving the training program based on factors like age, education. Individuals with similar propensity scores are then matched between the treatment and control groups. This helps create a balanced dataset where these covariates are similar in both groups.To estimate these propensity scores, we can use logistic regression, but like always we need to be sure that our matching is relevant. Here’s how you can do that:
  - Variance Ratio: compare the variance of a covariate in both groups. We want this ratio to be close to 1, meaning the groups are now similar.
  - A/A Testing: We also compare the outcomes of the matched groups. If the groups are well-matched, any differences in outcomes should be small and not statistically significant, showing that the groups are comparable.
Modeling with DoWhy and EconML:
Now, let’s set up our causal model! In the code below, we’re defining our treatment, outcome, and the factors we think might influence both. In a real-world scenario, you should always design your causal graph (if you’re unfamiliar with causal graphs, check out my article on causal inference for a deep dive):
```
from dowhy import CausalModelestimand = CausalModel(data = df,treatment='treat',outcome='re78',common_causes=['nodegree', 'black', 'hispan', 'age', 'educ', 'married'])identified_estimand = estimand.identify_effect()print(identified_estimand)
```
The output of the estimand might look a little scary, but trust me, it’s straightforward once we break it down. Let’s analyze it step by step:
- Estimand name (Backdoor): is the method we used to estimate our quantity the treatment effect, its a little bit complicated to explain it in this blog but in nutshell it help you estimate the treatment effect while controlling for specific variables, assuming no other unobserved confounders are affecting the treatment and outcome if you are interested to go deeper in details of this method or other you can chekc my article.
- Estimand expression: This represents the Average Treatment Effect , showing how job training affects the earning, after controlling for the common causes.
- Estimand Assumption 1, Unconfoundedness: This assumption means that, after we control for all the covariates (like race, age, education, etc.), there are no hidden factors influencing both the treatment and the outcome. In simpler terms, after adjusting for these variables, any remaining difference in earnings between treated and untreated individuals is because of the treatment itself.
Now that we’ve set up our estimand, it’s time for EconML to take over!
3.1 EconML with DML: Estimating Treatment Effect
Next, we implement the Double Machine Learning estimator using EconML and apply it to our data. For this, we’ll use the LinearDML estimator.
Notice that we pass ‘backdoor.econml.dml.LinearDML’ to the methodname parameter. Here’s where we add the extra argument ‘discretetreatment’ and set it to True. This is crucial because our treatment is binary (either you got the training or you didn’t), and by default, EconML assumes treatments are continuous.
In this example, we’ll use Logistic Regression to model the propensity score (the probability of receiving the treatment). For simplicity, we won’t dive into checking the quality of the matching here, but keep in mind that in a real-world scenario, it’s super important to evaluate it!
```
from sklearn.linear_model import LinearRegression, LogisticRegression, LassoCVfrom sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressorestimate = model.estimate_effect(identified_estimand=estimand,method_name='backdoor.econml.dml.LinearDML',target_units='ate',method_params={'init_params': {'model_y': RandomForestRegressor(),'model_t': LogisticRegression(max_iter=1000),'discrete_treatment': True},'fit_params': {}})print(estimate)
```
If you print the estimate, you’ll see the output includes the effect estimate and a few other key metrics. Let’s break it down:
- Realized estimand: This is the actual equation that the model has identified as being relevant for estimating the causal effect. You might be wondering: “Why is the equation linear if we’re using fancy models like Random Forest?” Great question! Yes, the final equation is linear, but don’t forget how DML works. In the first stage, DML uses non-linear models like Random Forest to predict both the outcome and the treatment based on covariates. This flexible, machine-learning-powered stage helps capture complex relationships. Then, in the second stage, a linear regression is applied to the residuals from those models to estimate the treatment effect.
- Estimate: it’s the actual treatment effect estimate, which is the result of all the heavy lifting our model has done behind the scenes.
The job training program had a solid impact, with participants earning an average of $1,348.95 more than those who didn’t participate.

DML: A Deeper Dive into the Methodology

DML builds upon the principles of the FWL theorem and orthogonalization to provide robust causal estimates. The method involves several key steps:

Model Specification: Define the outcome (Y), treatment (T), and control variables (X).
Nuisance Parameter Estimation: Use machine learning models to estimate the relationships between the control variables and both the outcome (f(X)) and the treatment (g(X)).
Residual Calculation: Calculate the residuals for both the outcome (Y - f(X)) and the treatment (T - g(X)).
Treatment Effect Estimation: Estimate the treatment effect (β) by regressing the outcome residuals on the treatment residuals.
Cross-Fitting: Use cross-fitting to mitigate overfitting bias by splitting the data into K folds and iteratively estimating the nuisance parameters and treatment effect on different subsets of the data.

Advantages of Double Machine Learning

DML offers several advantages over traditional methods for causal inference:

Flexibility: DML can handle complex, non-linear relationships between variables, making it suitable for a wide range of real-world scenarios.
Robustness: DML is less sensitive to model misspecification and confounding variables than traditional methods, providing more reliable causal estimates.
Efficiency: DML can leverage the power of machine learning to efficiently estimate causal effects in high-dimensional data settings.
CATE Estimation: DML is particularly well-suited for estimating Conditional Average Treatment Effects (CATEs), allowing researchers to examine how treatment effects vary across different subgroups of the population.

Applications of Double Machine Learning

DML has been applied in a variety of fields to estimate causal effects, including:

Economics: Estimating the impact of job training programs on earnings, the effect of education on wages, and the causal effect of air pollution on housing prices.
Marketing: Measuring the impact of advertising spend on sales, the effect of personalized ads on user engagement, and the causal effect of marketing campaigns on customer behavior.
Healthcare: Estimating the effectiveness of medical treatments, the impact of public health interventions, and the causal effect of lifestyle factors on health outcomes.

Limitations and Considerations

While DML is a powerful tool for causal inference, it is important to be aware of its limitations and potential pitfalls:

Assumptions: DML relies on certain assumptions about the causal structure of the data, such as the absence of unmeasured confounders. Violations of these assumptions can lead to biased causal estimates.
Model Selection: The choice of machine learning models for estimating the nuisance parameters can impact the accuracy of the causal estimates. Careful consideration should be given to model selection and validation.
Computational Complexity: DML can be computationally intensive, especially when dealing with high-dimensional data or complex machine learning models.

DML and Meta-Learners

Double/Debiased ML can be seen as Frisch, Waugh and Lovell theorem on steroids. The idea is that ML models are super flexible, hence, they can capture interactions and non linearities when estimating the (Y) and (T) residuals while still maintaining a FWL style orthogonalisation. This means we don’t have to make any parametric assumption about the relationship between the covariates (X) and the outcome (Y) nor between the covariates and the treatment in order to get the correct treatment effect.

The power you gain with ML is flexibility. ML is so powerful that it can capture complicated functional forms in the nuisance relationships.

This generalization comes from realizing that the Double/Debiased ML procedure defines a new loss function that we can minimize however we want. The nice thing about Double-ML is that it frees us from all the hassle of learning the nuisance parameters in a causal model. With that, we can focus all our attention to learning the causal parameter of interest, be it the ATE or the CATE.

tags: #double #machine #learning #explained