/// A walk through on instrumental variables estimation
clear
/// Set up the correlation matrix for the problem. x_1 = education (in some appropriate units); x_2 = ability (in some appropriate units) ///
/// e = random error uncorrelated with anything else; x_3 = a suitable instrument for x_1 (living near a college?) x_3 needs to be correlated///
/// with x_1 but not with x_2 or e//
matrix input c = (1, .5, 0, .5\.5, 1, 0, 0\0, .0, 1, 0\.5, 0, 0, 1)
matrix list c
/// Set the seed for random number generator. ///
set seed 12345
/// Create 1000 observations. ///
set obs 1000
/// Generate an id number for the observations. ///
gen id=_n
/// Now generate to random variables x_1 x_2 e x_3 with a given correlation structure ///
corr2data x_1 x_2 e x_3, corr(c)
/// Generate y from x_1 x_2 and e///
gen y= 5+ 2*x_1+4*x_2+5*e
/// Call y income. So now we know how income (by construction) depends on education and ability///
/// Check we haven't messed up by estimating the correct regression ///
reg y x_1 x_2
/// Now assume we don't observe x_2 (ability) but regress y on x_1. Remember the target is to estimate tha causal effect of x_1 on y. We have ommitted variable bias and the ///
/// slope estimate for x_1 is therefore wrong - it does not estimate the causal effect of x_1 on y ///
reg y x_1
/// Now use the instrument for x_1. Regress x_1 on x_3 and save the predicted values as x_1_hat. ///
/// x_1_hat consists of the variation in x_1 shared with x_3 (which predicts it) but variation in x_1 ///
/// that was confounded with variation in x_2 has been removed. By definition x_2 and x_3 are uncorrelated ///
/// so whatever x_1_hat is, it can't be influenced by x_2///
reg x_1 x_3
predict x_1_hat
/// Now regress y on the instrument for x_1 (x_1_hat) ///
reg y x_1_hat
/// Halleluja! We now get the right slope estimate for x_1. ///
/// But we don't get the right standard errors. That can be fixed by using the ivregress command ///
/// NB the correct se for the IV estimates is still larger than for the original OLS estimates . This is because x_1_hat contains ///
/// less information about the variation in x_1 than x_1 itself (obviously). ///
ivregress 2sls y (x_1=x_3)
/// Those who are really interested in all the gory details can consult Greene Econometric Analysis (3rd ed.) pp. 288-295///
/// or Kennedy A Guide to Econometrics (4th ed.) pp. 151-153. Both explain why estimation by OLS using x_1_hat gives ///
/// the wrong standard errors.///
/// Finally, a trick for remembering what IV regression is about - hat tip to Andrew Gelman http://andrewgelman.com/2009/07/14/how_to_think_ab_2/ ///
/// Regress y on x_3 ///
reg y x_3
/// The IV estimate for x_1 is the ratio of the ols regression of y on x_3 and x_1 on x_3 in this case 1/.5=2 ///