
In the age of “big data,” it is increasingly common for analysts to have access to two types of data: observational data (large datasets where treatments are not randomly assigned, but many outcomes are observed) and experimental data (smaller datasets where treatments are randomly assigned, but only a subset of outcomes are observed). Both types of data have strengths and limitations: observational data allow us to study a wide range of outcomes, but the inferences we draw from them can be biased because of selection bias in treatment assignment, while experimental data makes it straightforward to identify causal effects, but often lack information on outcomes of interest.
These problems are particularly important in the study of economic mobility, which typically focuses on analyzing the impacts of interventions on outcomes observed years later. For example, there is much interest in identifying the causal effects of classroom sizes and teacher quality in elementary school on high school graduation rates. Observational data with information on class sizes, teachers, and graduation rates are now widely available from school districts’ administrative records. But causal inference using these data is challenging because of selection biases arising from non-random assignment to classrooms. Causal inference is more straightforward in experimental data – such as the widely studied Project STAR class size experiment – but experimental datasets often do not contain information on outcomes such as graduation rates because they are observed with long delays.
In this study, we develop a new solution to this problem that we term the Experimental Selection Correction (ESC) estimator. The estimator uses the difference between observed outcomes and predicted outcomes (based on experimental data) to correct for biases in observational data. The method relies on a new assumption called latent unconfoundness, which requires that the same unobserved factors affect both primary and secondary outcomes. Importantly, this assumption is strictly weaker than the assumptions underlying commonly used surrogate estimators that we have applied in our prior work (Athey, Chetty, Imbens, Kang 2025).
We apply this ESC estimator to identify the effect of third grade class size on students’ outcomes. Estimated impacts on test scores using OLS regressions in observational school district data have the opposite sign of estimates from the Tennessee STAR experiment. In contrast, selection-corrected estimates in the observational data replicate the experimental estimates. Our estimator reveals that reducing class sizes by 25% increases high school graduation rates by 0.7 percentage points. Controlling for observables does not change conventional regression estimates, demonstrating that experimental selection correction can remove biases that cannot be addressed with standard controls.
Explore more of Opportunity Insights’ methodological research in econometrics.