# What is Instrumental Variable?

An instrumental variable (IV) is a variable (say Z) that is highly correlated with one of the independent variables (say X), but is uncorrelated with the error term (e). The researcher use instrumental variable in case the model suffers from endogeneity problem.

### The endogeinity problem arises from three sources:

i. Omitted variable bias: Omitted-variable bias (OVB) occurs when a statistical model leaves out one or more relevant variables. There is no any statistical test to detect OVB, but one can suspect for omited variable through established theory and through other empirical researches carried out earlier.

ii. Simultaneous causality bias: Simultaneous causality bias (SCB) occurs when there is bi-direction relationship between dependent variable and independent variable. Simultaneity and Reverse causility seems similar but they are different. Simultaneity means both dependent and independent variable affect each other, which reserse causality refers to a situation when dependent variable affects independent variable.

iii. Existence of covariates: Covariates are those variables that affect both dependent and independent variable.

### Properties of valid instrument

i. Relevance criteria
The relevance criteria states that the instrumental variable (Z) and predictor variable (X) has a strong relationship. Symbolically, COV(X,Z) is not equal to ZERO. The relevance criteria can be tested using regression.

ii. Exogeneity
The exogeneity criteria states that the instrumental variable (Z) and error term (e) has no relationship. Symbolically, COV(Z,e) is zero. The exogenerity criteria cannot be tested as we test for relevance criteria. However, we need to convince that the variable we are taking as instrument has no association with the error term or other variables that we are not considering in the model. Thus, a random variable emerges as a strong instrumental variable as it has no any relationship with other variables.

In simple words, IV should affect dependent variable (Y) through independent variable (X) and IV should have no correlation between error term 'e'.

### Identification Strategy

In identification strategy, a researcher generally discribes how he or she has decided a instrument for his or her research. In this section, he or she defends that his or her instrument is valid and the IV siccessfully holds relevance criteria and exogeneity criteria true. Though exogeneity criteria cannot be proved statistically, he or she gives theoretical explanation in order to prove that the IV is uncorrelated with the error term 'e'.
The example below clarifies how a researcher chooses IV and defends the IV he or she has choosen is uncorrelated with error term.

### For example

We are intended to find a causal relationship between score and skipped.
Score = a + b*Skipped + e---(i)
where score denotes total score that a student achieved in the board examination, skipped denotes total number of lecture missed in a semester, 'e' denotes error term, and a and b denotes parameters. Carrying out a simple linear regression between score and skipped might be riskier as skipped might be correlated with other variables in 'e'.
Moreover, highly motivates students and students with higher ability skips classes rarely. So, what might be the possible IV for skipped that is correlated with skipped but uncorrelated with motivation and ability (they are omitted variables).
What about taking distance from home to college as a possible IV for skipped?, but student born to low-income family live in rural area far from college. In case income is in error term 'e', then distance is correlated with error term, which violates exogeneity criteria.
What might be a strong IV? Can we take rainfall as IV. In rainy day, student might skip class lecture. This establishes a positive relationship between skipping class and rainfall (relevance criteria), while rainfall is a random event and no one can predict it. Hence, it is uncorrelated with omited variables or error term 'e' (exogeneity criteria). Thus, rainfall is a strong IV for skipped.

### Concept of two-stage OLS

The two stage OLS is akin to simple regression, but such simple regression is carried out twice.
The first-stage regression regresses endogeneous variable with other independent variable in the model along with an instrumental variable.
The second-stage regression regresses dependent variable with predicted endogeneous variable and independent variable.

In equation (i)
First-stage regression is:
Skipped = c + d*rainfall + v
where, rainfall is instrumental variable and skipped is endogeneous variable

Second-stage regression is:
Score = f + g*Skipped_hat + u

#### Stata Code for 2SLS

//2SLS
ivregress 2sls score_s ( skip= rainfall)
//test of endogenity
estat endog
//power of instrument
estat firststage