Correct for nonrandom sampling and self-selection when the dependent variable is unobserved for part of the population.
Quantify how strong an omitted confounder must be to invalidate causal inference — before resorting to IV estimation.
Address omitted variable bias via two-stage residual inclusion — suited for nonlinear models and heteroskedasticity.
Not sure which type of endogeneity affects your research model? Select the scenario that best describes your concern to receive a tailored diagnosis and find the right statistical technique.
James J. Heckman's (1979) two-stage estimation identifies and mitigates selection bias — when nonrandom sampling or self-selection causes the dependent variable to be unobserved for part of the population.
Selection bias arises when a rule other than simple random sampling is used to select observations into a study, causing the dependent variable to remain unobserved for a portion of the population. Two mechanisms drive this: sample selection, where researchers selectively gather data resulting in truncated samples, and self-selection, where subjects choose to participate based on unobservable characteristics.
In entrepreneurship and innovation research, for example, only certain firms choose to go public or report R&D expenditures — decisions influenced by unobserved strategic factors that also affect outcomes.
In the first stage (selection equation), a probit regression models the selection process using all available observations, a binary selection variable, and at least one instrument satisfying the exclusion restriction. The instrument must affect the selection variable but not directly impact the dependent variable.
In the second stage (outcome equation), the Inverse Mills Ratio — derived from the first-stage predictions — is included as an additional regressor to capture unobservable characteristics that cause endogeneity. Standard errors must be bootstrapped to account for the variability introduced by this two-stage procedure.
A key requirement is that the error terms follow a bivariate normal distribution, verified through a RESET analysis. This makes the Heckman estimation best suited for linear models; it is generally incompatible with nonlinear models such as count data or censored regressions.
Our discussion paper, available on SSRN, provides a comprehensive guide including a literature review of 193 studies, a step-by-step flowchart, a hypothetical research model with Stata and R commands, and practical guidance on common econometric pitfalls such as weak instruments and multicollinearity.
The ITCV, pioneered by Kenneth A. Frank (2000), quantifies how strong an omitted confounder would need to be to invalidate a study's causal inference — a practical first step before resorting to instrumental variable estimation.
An omitted variable is one that belongs in an estimation model but is not included, yet correlates with both the explanatory variables and the error term. This creates endogeneity that distorts measured effects and undermines causal inference.
While instrumental variable (IV) techniques are commonly used to address this, their correct application is complex — requiring valid instruments that are both relevant and exogenous — and errors can introduce more bias than they correct.
Rather than correcting for bias directly, the ITCV assesses how strong a confounding variable would need to be to invalidate a study's causal inference. It does this by calculating partial correlations between potential confounders and both the dependent and primary explanatory variables, then determining the threshold at which a significant coefficient would become insignificant.
The ITCV is then compared against the impact values of the control variables already in the model. If the ITCV exceeds all control variable impacts, any confounder would need to have a stronger influence than every established control to bias the results — suggesting considerable robustness.
For nonlinear models, the Robustness of Inference to Replacement (RIR) extends this framework by quantifying the percentage of observations that would need to be affected by a confounder to overturn the inference.
Published in Industrial Marketing Management (2024), our study provides the first comprehensive guide to the ITCV for empirical marketing research, including a direct comparison with IV estimation techniques, step-by-step Stata and R implementation, a hypothetical research model demonstration, and reporting guidelines with graphical illustrations.
The Control Function Approach addresses omitted variable bias through two-stage residual inclusion (2SRI) — especially suited for nonlinear models and settings with heteroskedasticity where 2SLS falls short.
When omitted variable bias is present and the ITCV analysis suggests it may be a concern, researchers typically turn to instrumental variable methods. The most common is two-stage least squares (2SLS), but it can only be applied consistently in linear models.
In nonlinear settings — such as probit, logit, Tobit, or Poisson regressions — 2SLS yields inconsistent estimates because predictor substitution distorts the model's functional form. Yet a cross-disciplinary review of 328 studies shows that 32% of IS studies inappropriately applied 2SLS to nonlinear models, compared to only 2% in marketing.
Rooted in the work of Heckman and Robb (1985) and formalized by Petrin and Train (2010) and Wooldridge (2015), the approach works through two-stage residual inclusion. In the first stage, the endogenous variable is regressed on all exogenous variables and a valid instrument. The residuals capture the endogenous variation.
In the second stage, these residuals are included as an additional regressor alongside the original variables, with bootstrapped standard errors to account for estimation uncertainty.
The approach offers three key advantages over 2SLS: it remains consistent in nonlinear models, it provides higher estimation efficiency under heteroskedasticity, and the significance of the residual term serves as a direct test for endogeneity.
Our working paper (forthcoming) provides a comprehensive guide to the control function approach for empirical information systems research, including a comparative analysis of 328 studies across IS, marketing, and management journals, a simulated panel dataset, step-by-step implementation in Stata and R, and guidance on extensions for interaction terms, polynomial specifications, and multiple endogenous variables.
Professor
School of Business and Economics
University of Münster
Germany
Assistant Professor
School of Business and Economics
University of Münster
Germany