# What is Instrumental Variable?

An instrumental variable (IV) is a variable (say Z) that is highly correlated with one of the independent variables (say X) but is uncorrelated with the error term (e). The researcher uses an instrumental variable in case the model suffers from an endogeneity problem.

### The endogeneity problem arises from three sources:

i. Omitted variable bias: Omitted-variable bias (OVB) occurs when a statistical model leaves out one or more relevant variables. The omitted variable becomes problematic if it correlates with one of our regressors, else it does not cause an endogeneity problem (For detailed proof see A deep dive into endogeneity problem). There is no statistical test to detect OVB, but one can suspect for omitted variable through established theory and through other empirical research carried out earlier.
ii. Simultaneous causality bias: Simultaneous causality bias (SCB) occurs when there is a bi-direction relationship between the dependent and independent variables. Simultaneity and Reverse causality seem similar but they are different. Simultaneity means both dependent and independent variables affect each other, which reserve causality refers to a situation when the dependent variable affects the independent variable (For detailed proof see A deep dive into endogeneity problem).
iii. Existence of covariates: Covariates are those variables that affect both dependent and independent variables.

### Properties of a valid instrument

i. Relevance criteria
The relevance criteria state that the instrumental variable (Z) and predictor variable (X) has a strong relationship. Symbolically, COV(X, Z) is not equal to ZERO. The relevance criteria can be tested using regression.
ii. Exogeneity
The exogeneity criteria state that the instrumental variable (Z) and error term (e) has no relationship. Symbolically, COV(Z,e) is zero. The exogeneity criteria cannot be tested as we test for relevance criteria. However, we need to convince that the variable we are taking as an instrument has no association with the error term or other variables that we are not considering in the model. Thus, a random variable emerges as a strong instrumental variable as it has no relationship with other variables.

In simple words, IV should affect the dependent variable (Y) through independent variable (X) and IV should have no correlation between error term 'e'.

### Identification Strategy

In an identification strategy, a researcher generally describes how he or she has decided on an instrument for his or her research. In this section, he or she defends that his or her instrument is valid and the IV successfully holds relevance criteria and exogeneity criteria true. Though exogeneity criteria cannot be proved statistically, he or she gives a theoretical explanation to prove that the IV is uncorrelated with the error term 'e'.
The example below clarifies how a researcher chooses IV and defends the IV he or she has chosen is uncorrelated with the error term.

### For example

We are intended to find a causal relationship between score and skipped.
$Score = a + b \times Skipped + e---(i)$
where score denotes the total score that a student achieved in the board examination, skipped denotes a total number of lectures missed in a semester, 'e' denotes error term, and a and b denote parameters. Carrying out a simple linear regression between score and skipped might be riskier as skipped might be correlated with other variables in 'e'.
Moreover, highly motivates students, and students with higher ability skip classes rarely. So, what might be the possible IV for skipped that is correlated with skipped but uncorrelated with motivation and ability (they have omitted variables).
What about taking distance from home to college as a possible IV for skipped?, but Students born to low-income families live in a rural area far from college. If income is in error term 'e', distance is correlated with the error term, which violates exogeneity criteria.
What might be a strong IV? Can we take rainfall as IV? On a rainy day, students might skip the class lecture. This establishes a positive relationship between skipping class and rainfall (relevance criteria), while rainfall is a random event and no one can predict it. Hence, it is uncorrelated with omitted variables or the error term 'e' (exogeneity criteria). Thus, rainfall is a strong IV for skipped.

### Concept of two-stage OLS

The two-stage OLS is akin to simple regression, but such simple regression is carried out twice.
The first-stage regression regresses the endogenous variable with another independent variable in the model along with an instrumental variable.
The second-stage regression regresses the dependent variable with predicted endogenous variable and independent variable.

In equation (i)
First-stage regression is:
$Skipped = c + d \times rainfall + v --- (i)$
where, rainfall is instrumental variable and skipped is endogenous variable

Second-stage regression is:
$Score = f + g \times Skipped_hat + u --- (ii)$

#### Remember

In theory, we generally use two-stage regression, but in practice it is better not to use two regression equations to estimate 2SLS. The "Skipped_hat" in Equation 2 creates a problem of efficiency, that is, standard errors of second equation is not efficient. This is because "the standard errors reported by OLS estimation of the second-stage are incorrect because they do not recognize that it is the second  stage of a two-stage process" (Stock & Watson, 2002).  Hence, statistical software such as STATA has developed a command to deal with this problem.

#### Stata Code for 2SLS

//2SLS
ivregress 2sls score_s ( skip= rainfall)
//test of endogenity
estat endog
//power of instrument
estat firststage