# A deep dive into Endogeneity problem

We have discussed in the earlier blog post about the endogeneity problem (see What is instrumental variable? and Endogeneity and use of instrumental variable). In these two blog posts, I simply listed out the cause of endogeneity in the regression model and introduced IV as a remedial measure. Now, in this blog post, I will demonstrate why endogeneity is actually a problem.
Let's recall the causes of endogeneity in our model. First, simultaneous causality bias, Second, omitted variable bias, and third, the existence of covariates.

Let's consider a simple regression model.

Y=X\beta+e---i

Existence of simultaneity

If our model suffers from simultaneity, then the stuff on the RHS in one equation must show up in the LHS in the other equation(s), that is,
Y=X\beta+e---i
X=Y\alpha+Z\Gamma---ii
then,
Substituting equation (i) into equation (ii) yields
X=(X\beta+e)\alpha+Z\Gamma
X=X\beta\alpha+Z\Gamma+e\alpha
X-X\beta\alpha = Z\Gamma+e\alpha
X=[I-\beta\alpha]^(-1)(Z\Gamma+e\alpha)---iii

In equation (iii), X is correlated with e. This means that E[X,e]\ne0. Hence, the fundamental OLS assumption is violated.

Omitted variable bias

The omitted variable bias occurs when we fail to include a relevant variable that is correlated with an independent variable (s) in our model. We considered the regression model as Y=X\beta+e-i. However, the true model is Y=X\beta+Z\Gamma+e-iv. Nevertheless, we estimate based on equation (i).
We know,
\beta_\text{OLS}=(X^TX)^(-1)(X^TY)
The expectation of the OLS estimator is:
E[\beta_\text{OLS}]=E[(X^TX)^(-1)(X^TY)]
E[\beta_\text{OLS}]=E[(X^TX)^(-1)[X^T(X\beta+Z\Gamma+e)]]
E[\beta_\text{OLS}]=E[[(X^TX)^(-1)(X^TX)\beta]+[(X^TX)^(-1)(X^T(Z\Gamma+e))]]
E[\beta_\text{OLS}]=E[\beta+[(X^TX)^(-1)(X^T(Z\Gamma+e))]
E[\beta_\text{OLS}]=E[\beta+[(X^TX)^(-1)((X^TZ\Gamma)+(X^Te))]
E[\beta_\text{OLS}]=E[\beta+[(X^TX)^(-1)(X^TZ\Gamma)]+[(X^TX)^(-1)(X^Te)]
E[\beta_\text{OLS}]=\beta+[(X^TX)^(-1)(X^TZ\Gamma)]+[(X^TX)^(-1)E(X^Te)]
\text{Since, } E(X^Te)=0
E[\beta_\text{OLS}]=\beta+[(X^TX)^(-1)(X^TZ\Gamma)]---v
Equation (v) provides an interesting conclusion. e, the error term, is exogenous to X. The \hat\beta_text{OLS} is biased if E[X^TZ]\ne0. If Z is a random variable, then E[X^TZ]=0. Thus, only correlated missing variables are a problem. We should not worry about missing variables that are uncorrelated to X. If X and Z are correlated then the OLS estimator is comprised of two terms added together: (1) the true coefficient on X and (2) the marginal effect of X on Z\Gamma.

\text{To be continued...}