# Endogeneity and Use of Instrumental Variable

What is an endogeneity issue?

The endogeneity issue arises when one of the explanatory variables is correlated with the error term. Such an issue arises when some unobserved effect, which is difficult to measure, affects the explanatory variables (See A deep dive into endogeneity problem and What is instrumental variable?).

For example

$Marks = a + b \times \text{Class_attendance}+\epsilon_1$

Here, the estimate 'b' coefficient is biased as Class attendance is influenced by observed factors (such as distance to school) and unobserved factors (such as motivation to read). Thus, these omitted variables are captured in the error term, which results in the correlation between Class_attendance and Error term. This problem is called an Endogeneity issue caused due to omitted variable bias. So, using OLS in this context gives biased estimates.

Hence, we shall use other techniques such as Two-stage Least Squares, the Generalized Method of Moments, and so on.

The endogeneity issue is dangerous in comparison to other issues such as multicollinearity, heteroskedasticity, and autocorrelation.

How does the use of an instrumental variable (IV) remove endogeneity?

$Mark=\alpha +\beta \times \text{Class_attendance}+\epsilon_2$

Here, school attendance is endogenous as it is affected by factors such as motivation to read, which is very abstract. Hence, the estimated b coefficient is biased. So, our aim is to break the link between school attendance and motivation. To do so, we introduce an instrumental variable, which is random in nature.

We do two-stage OLS.

First, we regress School attendance with our IV and we regress the estimated value of school attendance to marks. In this way, the estimated value of school attendance is free of any impact of unobserved variables such as motivation.

Now, let's take rainfall as IV. Is it a good IV? Yes, because students may not go to school when it rainfalls, rainfall influences school attendance (satisfies relevance criteria) and second, the most important, rainfall is very random (satisfies exclusion restriction).

That is,

Stage 1

$\text{Class_attendance} = \delta + \gamma \times rainfall +\epsilon_3$

We get estimated_class_attendance from Stage 1

Stage 2

Mark = a+b*estimated_class_attendance+v

Now, the estimated b coefficient is unbiased as we have broken the link between school or class attendance and motivation by introducing rainfall as IV.

Remember

In theory, we generally use two-stage regression, but in practice it is better not to use two regression equations to estimate 2SLS. The "estimated_class_attendance" in Equation 2 creates a problem of efficiency, that is, standard errors of second equation is not efficient. This is because "the standard errors reported by OLS estimation of the second-stage are incorrect because they do not recognize that it is the second  stage of a two-stage process" (Stock & Watson, 2002).  Hence, statistical software such as STATA has developed a command to deal with this problem.