import pyfixest as pfInstrumental Variables
You should have read the Getting Started page and have PyFixest installed.
Introduction
Estimation of a linear model via Ordinary Least Squares (OLS) yields biased and inconsistent estimates when a regressor is correlated with the error term — a problem known as endogeneity, which arises, for example, in the presence of unobserved confounders. Instrumental Variable (IV) estimation addresses this by finding a variable \(Z\) that satisfies three conditions:
- Relevance: \(Z\) has a causal effect on \(T\).
- Exclusion Restriction: \(Z\)’s causal effect on \(Y\) is fully mediated by \(T\).
- Instrumental Unconfoundedness: \(Z\) has no unobserved common causes with \(Y\).
In Figure 1, the path from \(Z\) to \(T\) shows relevance, the absence of a direct path from \(Z\) to \(Y\) encodes the exclusion restriction, and the absence of a path from the unobservable to the instrument shows instrumental unconfoundedness.
PyFixest estimates the IV using two-stage least squares (2SLS) estimator where it first projects \(T\) onto \(Z\) (and all other exogenous variables) to obtain \(\hat{T}\), then uses \(\hat{T}\) to estimate the causal effect of \(T\) on \(Y\). Because \(\hat{T}\) is not a function of \(U\), we can think of the dashed path from the unobserved variable as blocked or removed.
In PyFixest, the IV syntax is:
Y ~ exogenous_controls | fixed_effects | endogenous ~ instrument
When panel data are available, endogeneity may also stem from time-invariant unobserved heterogeneity — unit-specific characteristics (e.g., ability, culture, geography) that are fixed over time but correlated with both treatment and outcome. PyFixest addresses this simultaneously by applying a within-transformation (demeaning) to absorb unit fixed effects before running 2SLS, following the FE-IV approach described in Wooldridge (2010, Ch. 11). Crucially, after demeaning, the instrument must retain within-unit variation over time — time-invariant instruments are eliminated along with the fixed effects and cannot be used for identification. When both fixed effects and an instrument are specified, PyFixest therefore isolates the clean variation in treatment that is both within-unit and driven by the instrument, blocking confounding from time-invariant unobservables and time-varying endogenous confounders simultaneously.
This tutorial walks through three applications, all addressing endogeneity from selection bias — a form of Omitted Variable Bias (OVB) where unobserved confounders drive both selection into treatment and the outcome. Application 1 operates in an observational setting, exploiting quasi-random variation from individual-level selection. Application 2 arises in an experimental setting where encouragement is randomly assigned but treatment take-up remains subject to self-selection. Application 3 returns to an observational setting, exploiting regional-level sorting via a shift-share instrument.
Application 1: The Motherhood Penalty
Does having children reduce women’s earnings?
A naive regression of earnings on fertility is biased: women with stronger career ambitions may both earn more and be less likely to have children. Since career ambition is positively correlated with earnings but negatively correlated with fertility, OLS overstates the motherhood penalty, which is the reduction in earnings mothers experience after child birth. This is a classic case of omitted variable bias (OVB): career ambition is unobserved, yet it drives both fertility decisions and earnings outcomes. In order to disentangle the true causal effect of having a child on earnings from the confounding influence of career ambition, we need to find a “quasi-random” source of variation in fertility that is independent of career ambition.
Lundborg, Plug, and Rasmussen (2017) find such an “instrument” in the quasi-random success of in-vitro fertilization (IVF) treatment. IVF success is largely determined by biological factors outside a woman’s control, making it difficult to conceive of an unobserved variable that jointly drives both the success of treatment and labor market outcomes.
Synthetic Data
ivf_df = pf.get_ivf_data()
ivf_df.head()| earnings | num_children | ivf_success | |
|---|---|---|---|
| 0 | 7.394832 | 1.159067 | 0 |
| 1 | 12.714616 | 1.166081 | 0 |
| 2 | 9.621299 | 2.129572 | 1 |
| 3 | 11.543746 | 1.931092 | 0 |
| 4 | 9.333634 | 1.185489 | 0 |
We create a synthetic dataset with \(N = 2{,}000\) observations. The true causal effect of num_children on earnings is \(\beta = -0.15\) — this is the treatment effect the IV estimator should recover (full DGP in the Appendix).
Naive OLS
Without accounting for endogeneity, OLS overstates the penalty because career ambition is an omitted variable that increases earnings while reducing fertility:
fit_ols = pf.feols("earnings ~ num_children", data=ivf_df)
fit_ols.summary()###
Estimation: OLS
Dep. var.: earnings
sample: None = all
Inference: iid
Observations: 2000
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept | 10.764 | 0.059 | 181.787 | 0.000 | 10.648 | 10.880 |
| num_children | -0.631 | 0.034 | -18.441 | 0.000 | -0.698 | -0.564 |
---
RMSE: 1.126 R2: 0.145
IV Estimation
In the IV estimation, we use ivf_success as an instrument for num_children:
2SLS = IV = Wald estimator
With a single instrument \(Z\), the 2SLS estimator is numerically identical to the IV estimator:
\[\hat{\beta}_{IV} = \frac{\widehat{\text{Cov}}(Y,\, Z)}{\widehat{\text{Cov}}(T,\, Z)}\]
With a binary instrument, this simplifies to the Wald estimator:
\[\hat{\beta}_{\text{Wald}} = \frac{\bar{Y}_{Z=1} - \bar{Y}_{Z=0}}{\bar{T}_{Z=1} - \bar{T}_{Z=0}}\]
In PyFixest, we can fit the IV model using the formula interface:
fit_iv = pf.feols("earnings ~ 1 | num_children ~ ivf_success", data=ivf_df)
pf.etable([
fit_ols,
fit_iv,
],
labels={"earnings": "Earnings", "num_children": "Number of Children"},
model_heads=["OLS", "IV"],
caption="Motherhood Penalty: OLS vs IV",
)| Motherhood Penalty: OLS vs IV | ||
| Earnings | ||
|---|---|---|
| OLS | IV | |
| (1) | (2) | |
| coef | ||
| Number of Children | -0.631 (0.034) |
-0.222 (0.066) |
| Intercept | 10.764 (0.059) |
10.123 (0.107) |
| stats | ||
| Observations | 2,000 | 2,000 |
| R2 | 0.145 | - |
| Format of coefficient cell: Coefficient (Std. Error) | ||
The IV estimate is closer to the true effect of -0.15. The “naive” OLS estimate is downward biased: it overstates the penalty due to omitted-variable bias from career ambition.
First-Stage Diagnostics
In the section above, we have argued the IVF success is a credible instrument for fertility - the unconfoundedness assumption is plausible given the quasi-random nature of IVF success. But how do we know if the instrument is strong enough to yield reliable estimates? For this reason, applied econometricians routinely run diagnostic checks for “weak instruments”, which aim to validate the “relevance” assumption.
PyFixest provides two diagnostics for instrument strength: a standard first-stage F-statistic, and the more robust effective F-statistic from Montiel Olea and Pflueger (2013) that remains valid under heteroskedasticity. Both can be accessed via the ._model_1st_stage attribute.
# first_stage() must be called before IV_Diag() — it fits the first-stage OLS
# regression and stores the model in fit_iv._model_1st_stage, which IV_Diag() requires.
fit_iv.first_stage()
fit_iv._model_1st_stage.summary()###
Estimation: OLS
Dep. var.: num_children
sample: None = all
Inference: iid
Observations: 2000
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept | 1.207 | 0.019 | 64.106 | 0.000 | 1.171 | 1.244 |
| ivf_success | 0.791 | 0.028 | 28.286 | 0.000 | 0.736 | 0.846 |
---
RMSE: 0.622 R2: 0.286
fit_iv.IV_Diag()
print(f"First-stage F-statistic : {fit_iv._f_stat_1st_stage:.2f}")
print(f"Effective F-statistic : {fit_iv._eff_F:.2f}")First-stage F-statistic : 800.12
Effective F-statistic : 793.48
Both F-statistics are well above 10, which is the canonical threshold for a strong instrument. More recent work instead suggests that for reliable inference, the effective F-statistic should be significantly higher.
Application 2: A/B Encouragement Design
Estimating the effect of feature adoption on revenue when users don’t comply with treatment assignment.
A tech company runs an A/B test in which half of users are encouraged to try a new feature, e.g. by showing them a banner. But not everyone who sees the banner actually tries out the feature. On top, some control users might discover the new feature on their own.
This setup is similar to drug trials. Suppose that medical researchers wanted to learn about the effect of taking Vitamin D on health outcomes. They could run a randomized trial in which some patients receive Vitamin D supplements for free, while control patients do not. Again, there will be imperfect compliance: some patients in the treatment group may not take their supplements, while others in the control group may take Vitamin D supplements on their own.
In such setups with imperfect compliance, one estimand of interest is the so-called intent-to-treat (ITT) effect, which compares outcomes between the encouraged and non-encouraged groups. However, the ITT estimates the effect of encouragement, not the effect of actual adoption. If we want to recover the effect of adoption itself, we need to use the random assignment of encouragement as an instrument for actual adoption.
All three IV assumptions are credible for this application. Relevance is satisfied because encouragement has a causal effect on adoption. The exclusion restriction is plausible because encouragement only affects revenue through adoption — there are no other channels through which seeing the banner could affect revenue. Instrumental unconfoundedness holds because encouragement is randomly assigned, so there are no unobserved confounders that jointly affect encouragement and revenue.
The only remaining assumption to discuss is monotonicity — the assumption that there are no “defiers” who would do the opposite of their encouragement assignment. In this context, monotonicity means that there are no users who would adopt the feature if not encouraged but would fail to adopt if encouraged.
If all assumptions hold, the IV estimate recovers the Local Average Treatment Effect (LATE) of adoption on revenue for the “compliers” — users whose adoption decisions are influenced by encouragement. LATE estimates are larger than the ITT because it scales up the effect of encouragement by the share of compliers in the population.
Synthetic Data
ab_df = pf.get_encouragement_data()
ab_df.head()| revenue | assigned_treatment | adopted_feature | user_type | |
|---|---|---|---|---|
| 0 | 6.791617 | 1 | 1 | 2 |
| 1 | 7.995041 | 1 | 1 | 2 |
| 2 | 8.204051 | 0 | 1 | 2 |
| 3 | 4.264271 | 0 | 0 | 1 |
| 4 | 7.980713 | 1 | 1 | 0 |
We first create a synthetic data set with \(N = 4{,}000\) users. The true LATE of adopted_feature on revenue is \(2.0\) — this is what the IV estimate should recover.
Three Estimands
We estimate three parameters of interest: the reduced form (ITT), the first stage, and the IV/LATE:
# Intent-to-treat (reduced form)
fit_itt = pf.feols("revenue ~ assigned_treatment | user_type", data=ab_df)
# First stage
fit_fs = pf.feols("adopted_feature ~ assigned_treatment | user_type", data=ab_df)
# IV / LATE
fit_late = pf.feols("revenue ~ 1 | user_type | adopted_feature ~ assigned_treatment", data=ab_df)Compare All Three
The table below places the ITT, first stage, and IV estimates side by side.
pf.etable(
[fit_itt, fit_fs, fit_late],
labels={
"revenue": "Revenue",
"adopted_feature": "Adopted Feature",
"assigned_treatment": "Assigned Treatment",
},
felabels={"user_type": "User Type FE"},
model_heads=["ITT", "First Stage", "LATE"],
caption="A/B Encouragement Design: ITT, First Stage, and LATE",
)| A/B Encouragement Design: ITT, First Stage, and LATE | |||
| Revenue | Adopted Feature | Revenue | |
|---|---|---|---|
| ITT | First Stage | LATE | |
| (1) | (2) | (3) | |
| coef | |||
| Assigned Treatment | 1.110 (0.041) |
0.554 (0.013) |
|
| Adopted Feature | 2.004 (0.057) |
||
| fe | |||
| User Type FE | x | x | x |
| stats | |||
| Observations | 4,000 | 4,000 | 4,000 |
| R2 | 0.297 | 0.313 | - |
| Format of coefficient cell: Coefficient (Std. Error) | |||
The coefficient plot compares the ITT effect of encouragement on revenue with the IV (LATE) estimate of actual feature adoption. Because not all encouraged users adopt, the LATE is larger than the ITT — scaled up by the complier share.
pf.coefplot([fit_itt, fit_late], keep="assigned_treatment|adopted_feature")IV Diagnostics
Since the instrument (assigned_treatment) is fully randomized, the first stage is expected to be very strong and both statistics should comfortably exceed 10.
# first_stage() must be called before IV_Diag()
fit_late.first_stage()
fit_late.IV_Diag()
print(f"First-stage F-statistic : {fit_late._f_stat_1st_stage:.2f}")
print(f"Effective F-statistic : {fit_late._eff_F:.2f}")First-stage F-statistic : 1820.73
Effective F-statistic : 1832.97
IV Diagnostics in PyFixest
Weak instruments - instruments that are only loosely correlated with the endogenous variable - lead to biased and unreliable IV estimates. PyFixest provides two key diagnostic tools to detect this problem.
The First-Stage F-Statistic
The .first_stage() method re-estimates the first-stage regression and computes the first-stage F-statistic, which tests \(H_0\colon \pi = 0\) (all instrument coefficients are jointly zero). The classic rule of thumb is \(F > 10\) for iid errors (Stock and Yogo (2005)).
# Re-use the motherhood penalty IV model
# Note: IV_Diag() switches vcov to hetero internally for the effective F computation.
# Reset to iid here to get the iid-based first-stage F-statistic.
fit_iv.vcov("iid")
fit_iv.first_stage()
# The F-stat is stored as an attribute after calling first_stage()
print(f"First-stage F-statistic: {fit_iv._f_stat_1st_stage:.1f}")
print(f"First-stage p-value: {fit_iv._p_value_1st_stage:.4f}")First-stage F-statistic: 800.1
First-stage p-value: 0.0000
The Effective F-Statistic
The standard F-statistic can be misleading when there are multiple endogenous regressors or when errors are non-homoskedastic. The effective F-statistic (Montiel Olea and Pflueger (2013)) is a more robust measure of instrument strength that remains valid under heteroskedasticity:
\[ F_{\text{eff}} = \frac{\hat{\pi}' Q_{ZZ} \hat{\pi}}{\text{tr}(\hat{\Sigma} \, Q_{ZZ})} \]
where \(\hat{\pi}\) are the first-stage coefficients on the excluded instruments, \(Q_{ZZ} = Z'Z\), and \(\hat{\Sigma}\) is the robust variance-covariance matrix of \(\hat{\pi}\).
The .IV_Diag() method computes both the standard F-statistic and the effective F-statistic in one call:
fit_iv.IV_Diag()
print(f"Standard F-statistic: {fit_iv._f_stat_1st_stage:.1f}")
print(f"Effective F-statistic: {fit_iv._eff_F:.1f}")Standard F-statistic: 800.1
Effective F-statistic: 793.5
- Standard Errors & Inference — learn about robust, cluster-robust, and bootstrap inference.
- Regression Tables — customize publication-ready output tables.
FeivAPI Reference — full documentation of the IV estimator class.