/* Intermediate Quantitative Methods - Lab Class 1
Rather than execute this do file all in one go - which would be rather pointless - I recommend that yoy read all the comments in green,
then excute the code line by line - or in blocks - and then review and think about the results
students.dta contains information on a random sample of 1926 secondary school students who have been followed from age
17 until, in some cases, six months after they graduate from university. The following variables are available to you:
uniapp - the respondent applied to university;
gcse - GCSE examination points at age 16, truncated at 80;
female - the respondent is female;
class - social class background: 1 salariat, 2 intermediate, 3 blue-collar;
fsm - whether at any time from age 11 the respondent was eligible for free school meals;
mdeg - the respondent’s mother has a university degree;
hours - average number of hours per week spent studying while at university;
attend - whether respondent tries to attend all of their lectures;
degclass - degree class on graduation: 1 third, 2 lower-second, 3 upper-second, 4 first;
post92 - university was a post 1992 university;
subject - major at university: 1 arts & humanities; 2 social sciences; 3 sciences & maths:
ecstat - economic status 6 months after graduating: 1 employed, 2 unemployed, 3 out of labour market
Set your working directory to something sensible - you need to be able to read & write from it.
It almost certainly won't have the same name as mine - Probably the default will do - in which case you can ignore the next command :-)
*/
cd "C:\Users\nn\Dropbox\Actual\Research Assistance\TA IQM"
* Useful standard header for do-files to ensure smooth re-running of the files
set more off
clear
capture log close
* Read in the datafile
use students.dta
* Open a log file. You can use any file name you want. I like to save things as text files.
log using iqm_lab_1, text replace
* It's always a good idea to inspect the data before you do anything else
summ uniapp - ecstat
/* Notice that the first six variables have 1926 cases while the rest have rather less. Obviously if a student
did not attend university there are no observations for some variables.
You might want to inspect some variables more closely: tab1 will make a oneway frequency table for a list of variables */
tab1 female class degclass subject ecstat
/* Binary Logit
Let’s now consider predicting whether a pupil applies to university. Relevant predictors will be GCSE examination results
as well as social background predictors and personal characteristics:
class is a categorical variable so let's make dummy variables on the fly and set the base for the comparison to category 3 "blue-collar" */
logit uniapp gcse female ib3.class fsm mdeg
/* What scale are these "effects" on?
Consider carefully whether the coefficients make sense (this requires you to think about what the numbers mean!)
You decide that perhaps having a university educated mother is more important for a female than a male and want to test this idea.
It implies fitting an interaction effect:
We'll now explicitly declare gcse as continuous & fsm, female and mdeg to be categorical.
By default Stata sets the base category to be the first so we should get similar effects to those estimated in the first equation.
Note that when we specify the interaction by using the ## symbols Stata automatically includes the main effects. */
logit uniapp c.gcse ib3.class i.fsm i.female##i.mdeg
/* What does this model suggest (interpret the interaction effect)?
Note that the interaction effect estimates a parameter for observations that are female & have a university educated mother.
Perhaps in addition the difference between the male and female propensity depends on GCSE score:
*/
logit uniapp ib3.class i.fsm i.female##i.mdeg i.female##c.gcse
*Let’s say we are satisfied with this model (should we be?). We now explore the implications of the model for predicted probabilities.
*Let’s calculate probability of males making a university application at some interesting covariate values:
margins, atmeans at(female=0 class= 1 fsm=0 mdeg=1)
*Now do the same for the females.
margins, atmeans at(female=1 class= 1 fsm=0 mdeg=1)
*You can do this all in one go by using:
margins female, atmeans at(class= 1 fsm=0 mdeg=1)
*Now consider what happens at the 90th percentile of gcse score.
margins female, at((p90) gcse class= 1 fsm=0 mdeg=1)
*Now test for the difference between men and women at the 90% percentile:
margins female, at((p90) gcse class= 1 fsm=0 mdeg=1) contrast(cieffects)
/* Does the 95% confidence interval include zero?
Now let’s plot predicted probabilities for men and women by gcse score.
First let’s tabulate how things look by bands of gcse score that are 5 units wide. */
margins female, at(gcse=(0(5)80) class=1 fsm=0 mdeg=0)
*Now make the plot:
marginsplot, title(Predicted probabilities and 95% CIs of applying for university)
/*
Ordinal logit
Now let’s turn to the outcome of the students’ university studies.
This is contained in degclass – a four category ordinal variable.
Let’s fit the following model:
*/
ologit degclass i.female ib3.class c.hours i.attend
/*What does the model tell you about the relationships between the predictors and the response?
Now let’s examine the predicted probabilities for a female from a blue-collar background
who studies for the mean number of hours and tried to attend all her lectures.
It probably helps to recall that in the STATA parameterization of this model:
Pr(y_i=j|x_i) = F(A_j - x_iB)-F(A_j-1 - x_iB)
so to find the probability of being in category j evaluated at particular values of the xs we need to calculate
exp(A_j-x_iB)/(1 +exp(A_j -x_iB) for the lowest ordered category & then (exp(A_j-x_iB)/(1 +exp(A_j -x_iB)) - (exp(A_j-1 - x_iB)/(1 +exp(A_j-1 - x_iB))
for the rest up to the highest category where the expression we need is: 1 - (exp(A_j-1 - x_iB)/(1 +exp(A_j-1 - x_iB))
*/
forvalues c=1/4 {
margins female, predict(outcome(`c')) atmeans at(class=3 attend=1)
}
*Test for the differences between men and women at the same covariate values. What does this tell you?
forvalues c=1/4 {
margins female, predict(outcome(`c')) atmeans at(class=3 attend=1)contrast(cieffects)
}
* For the sake of argument, let’s assume you are satisfied with all of this.
* Now let’s plot the probabilities. NB for this to work properly the working file directory has to be set properly
ologit degclass i.female ib3.class c.hours i.attend
margins female, predict(outcome(1)) at(hours=(20(5)60) class=3 attend=1)
marginsplot, title(Third) ytitle(P(third)) saving(degclass1, replace)
margins female, predict(outcome(2)) at(hours=(20(5)60) class=3 attend=1)
marginsplot, title(Lower Second) ytitle(P(lower second)) saving(degclass2, replace)
margins female, predict(outcome(3)) at(hours=(20(5)60) class=3 attend=1)
marginsplot, title(Upper Second) ytitle(P(upper second)) saving(degclass3, replace)
margins female, predict(outcome(4)) at(hours=(20(5)60) class=3 attend=1)
marginsplot, title(First) ytitle(P(first)) saving(degclass4, replace)
graph combine degclass1.gph degclass2.gph degclass3.gph degclass4.gph, ///
title(Predicted probabilities and 95% CIs of achieving various degree classes, size(medium))
/* Above we have first created separate graphs for the predicted probabilities
for ending up in each of the four degree classes. Subsequently, these four
graphs are combined into one large graph with four panels.
Multinomial logit (for you to pursue in your own time)
Finally the students have entered the labour market. Six months after graduation their economic status is recorded.
Explore what predicts their economic status. Calculate appropriate quantities,
for example predicted probabilities and differences in predicted probabilities.
The details of making informative plots will depend on the model you fit.
You will find it helpful to consult Scott Long, J. and J. Freese (2006)
Regression Models for Categorical Dependent Variables Using Stata (2nd ed.) pp. 250-254.
*/