Statistics Lab Münster — Mitigating Endogeneity Bias in Empirical Research

Method 01 · Selection Bias

Heckman Two-Stage Estimation

James J. Heckman's (1979) two-stage estimation identifies and mitigates selection bias — when nonrandom sampling or self-selection causes the dependent variable to be unobserved for part of the population.

Selection bias arises when a rule other than simple random sampling is used to select observations into a study, causing the dependent variable to remain unobserved for a portion of the population. Two mechanisms drive this: sample selection, where researchers selectively gather data resulting in truncated samples, and self-selection, where subjects choose to participate based on unobservable characteristics.

In entrepreneurship and innovation research, for example, only certain firms choose to go public or report R&D expenditures — decisions influenced by unobserved strategic factors that also affect outcomes.

In the first stage (selection equation), a probit regression models the selection process using all available observations, a binary selection variable, and at least one instrument satisfying the exclusion restriction. The instrument must affect the selection variable but not directly impact the dependent variable.

In the second stage (outcome equation), the Inverse Mills Ratio — derived from the first-stage predictions — is included as an additional regressor to capture unobservable characteristics that cause endogeneity. Standard errors must be bootstrapped to account for the variability introduced by this two-stage procedure.

A key requirement is that the error terms follow a bivariate normal distribution, verified through a RESET analysis. This makes the Heckman estimation best suited for linear models; it is generally incompatible with nonlinear models such as count data or censored regressions.

Our discussion paper, available on SSRN, provides a comprehensive guide including a literature review of 193 studies, a step-by-step flowchart, a hypothetical research model with Stata and R commands, and practical guidance on common econometric pitfalls such as weak instruments and multicollinearity.

Interactive Flowchart

Start

Step 1

Heckman Two-Stage Estimation

Walk through the theoretical assumptions and practical application of the Heckman estimation, step by step.

This guidance is intended as a practical introduction and is not a replacement for specialized econometric literature. The application details may vary depending on your model, data structure, and research context.

Method 02 · Sensitivity Analysis

Impact Threshold of a Confounding Variable

Read Paper · Open Access

The ITCV, pioneered by Kenneth A. Frank (2000), quantifies how strong an omitted confounder would need to be to invalidate a study's causal inference — a practical first step before resorting to instrumental variable estimation.

An omitted variable is one that belongs in an estimation model but is not included, yet correlates with both the explanatory variables and the error term. This creates endogeneity that distorts measured effects and undermines causal inference.

While instrumental variable (IV) techniques are commonly used to address this, their correct application is complex — requiring valid instruments that are both relevant and exogenous — and errors can introduce more bias than they correct.

Rather than correcting for bias directly, the ITCV assesses how strong a confounding variable would need to be to invalidate a study's causal inference. It does this by calculating partial correlations between potential confounders and both the dependent and primary explanatory variables, then determining the threshold at which a significant coefficient would become insignificant.

The ITCV is then compared against the impact values of the control variables already in the model. If the ITCV exceeds all control variable impacts, any confounder would need to have a stronger influence than every established control to bias the results — suggesting considerable robustness.

For nonlinear models, the Robustness of Inference to Replacement (RIR) extends this framework by quantifying the percentage of observations that would need to be affected by a confounder to overturn the inference.

Published in Industrial Marketing Management (2024), our study provides the first comprehensive guide to the ITCV for empirical marketing research, including a direct comparison with IV estimation techniques, step-by-step Stata and R implementation, a hypothetical research model demonstration, and reporting guidelines with graphical illustrations.

Interactive ITCV Flowchart

Start

Step 1

ITCV Analysis

Assess how strong an omitted confounder would need to be to invalidate your causal inference.

This guidance is intended as a practical introduction and is not a replacement for specialized econometric literature. The application details may vary depending on your model, data structure, and research context.

Method 03 · Instrumental Variable Method

Control Function Approach

Paper · Forthcoming

The Control Function Approach addresses omitted variable bias through two-stage residual inclusion (2SRI) — especially suited for nonlinear models and settings with heteroskedasticity where 2SLS falls short.

When omitted variable bias is present and the ITCV analysis suggests it may be a concern, researchers typically turn to instrumental variable methods. The most common is two-stage least squares (2SLS), but it can only be applied consistently in linear models.

In nonlinear settings — such as probit, logit, Tobit, or Poisson regressions — 2SLS yields inconsistent estimates because predictor substitution distorts the model's functional form. Yet a cross-disciplinary review of 328 studies shows that 32% of IS studies inappropriately applied 2SLS to nonlinear models, compared to only 2% in marketing.

Rooted in the work of Heckman and Robb (1985) and formalized by Petrin and Train (2010) and Wooldridge (2015), the approach works through two-stage residual inclusion. In the first stage, the endogenous variable is regressed on all exogenous variables and a valid instrument. The residuals capture the endogenous variation.

In the second stage, these residuals are included as an additional regressor alongside the original variables, with bootstrapped standard errors to account for estimation uncertainty.

The approach offers three key advantages over 2SLS: it remains consistent in nonlinear models, it provides higher estimation efficiency under heteroskedasticity, and the significance of the residual term serves as a direct test for endogeneity.

Our working paper (forthcoming) provides a comprehensive guide to the control function approach for empirical information systems research, including a comparative analysis of 328 studies across IS, marketing, and management journals, a simulated panel dataset, step-by-step implementation in Stata and R, and guidance on extensions for interaction terms, polynomial specifications, and multiple endogenous variables.

Interactive Flowchart Beta

Start

Step 1

Control Function Approach

Walk through the control function approach for identifying and mitigating omitted variable bias.

This guidance is intended as a practical introduction and is not a replacement for specialized econometric literature. The application details may vary depending on your model, data structure, and research context.

Robust to confounders. Probably.

The Team

Prof. Dr. David Bendig

Professor
School of Business and Economics
University of Münster
Germany

Website Google Scholar

Dr. Jonathan Hoke

Assistant Professor
School of Business and Economics
University of Münster
Germany

Website Google Scholar

Get in Touch

Contact Us

Questions, feedback, or collaboration ideas? We'd love to hear from you.

Email Us

A practical guide on mitigating endogeneity bias in empirical research

Three Methods for Addressing Endogeneity

Heckman Two-Stage Estimation

Impact Threshold of a Confounding Variable

Control Function Approach

Endogeneity Diagnostic Tool

Identify Your Endogeneity Concern

Endogeneity Diagnostic

Heckman Two-Stage Estimation

The Problem: Selection Bias

The Method

Our Paper

Interactive Flowchart

Start

Heckman Two-Stage Estimation

Impact Threshold of a Confounding Variable

The Problem: Omitted Variable Bias

The Method

Our Paper

Interactive ITCV Flowchart

Start

ITCV Analysis

Control Function Approach

The Problem: Omitted Variable Bias in Nonlinear Models

The Method

Our Paper

Interactive Flowchart Beta

Start

Control Function Approach

The Team

Prof. Dr. David Bendig

Dr. Jonathan Hoke

Contact Us