* Here is some STATA code to obtain the OLS estimates for a simple regression problem with 3 predictors
* You can run it as a do file (NB the file is called normalreg.txt) from the command window or cut and paste it into 
* the do-file window.
*
* The data are read from the STATA file simdata.dta which must be in memory
*
* In this example we do the same thing (more or less) in three different ways
* 
*
* First we handcrank it in STATA's matrix language MATA, where the objective boils down to 
* finding the elements of the 4x1 column vector b_hat. b_hat = INV(X'X)X'y 
* and the associated standard errors - contained in se_hat
*
* For didatic reasons, if anyone is interested, the MATA programme is broken down into an unecessarily large
* number of steps. It is possible and in practice desirable to programme this more succinctly.
*
* It is absolutely not necessary to follow this for the purposes of this course. However if you already know some linear
* algebra or want to learn some you might find it interesting/illuminating.
* Two useful references: 
* Jacques Tacq, 1997 Multivariate Analysis Techniques in Social Science Research, Sage,  pp. 388-400 
* Daniel A. Powers and Yu Xie (2008) Statistical Methods for Categorical Data Analysis (2nd ed), Emerald, pp.269-275.
* Being able to read matrix notation opens up the exposition of the more 'advanced' techniques to you. 
* Without it you will find that you can get so far-  ie to where this course ends - and not much further.
*  
*
* Then we used STATA's standard 'black box' regression routine to do the same thing - this is what you would normally use
* and is what you need to know for the primary purposes of this course.
*
* Finally we estimate the normal regression model by maximum-likelihood - the coefficients are identical to those
* estimated by OLS but the root MSE and the estimated standard errors differ (very slightly). This is not so relevant for
* week 1 and may even appear a little mysterious. Hopefully its relevance will become apparent in week 2.
*
*
* 
use "I:\simdata.dta", clear //* obviously you should change this line to reflect where you are reading the data from

//enter mata

mata

// define y as a column vector and get data from STATA
y = st_data(., "score1")

// define X as a 1000 x 4 matrix (not forgetting the constant in the first column)
x = st_data(., ("constant", "ability", "hours", "female"))

// generate tx as x' - transpose of x
tx=x'

// generate x'y crossproducts matrix
txy= tx*y

// generate crossproducts for x - the predictor variables
txx=tx*x

// generate the inverse of txx
itxx=invsym(txx)

itxx

// generate ols estimated constant and slope coefficients coefficients
b_hat=itxx*txy

// generate estimated residuals
e_hat = y - x * b_hat

// calculate the estimated standard errors - first the variances and then take the square root of the diagonal entries

s2 = (1 / (rows(x) - cols(x))) * (e_hat' * e_hat)
V = s2 * itxx
se_hat = sqrt(diagonal(V))

// print out the coefficients and standard errors

b_hat
se_hat

/**leave mata**/
end

* Here is the normal way you would do the same thing in STATA

reg score1 ability hours female

* Now do the estimation by maximum likelihood
*
* NB the standard errors will be (very) slightly different - they converge as n gets bigger
* hence exp(lnsigma) is not exactly equal to the root MSE from STATA's standard routine
program normalreg

   version 11.1
   args lnf xb lnsigma
   local y "$ML_y1"
   quietly replace `lnf' = ln(normalden(`y', `xb', exp(`lnsigma')))
end
ml model lf normalreg (xb: score1 =   ability hours female) (lnsigma:)
ml maximize