Data Analysis for Neuroscientists V:
Regression


Subject recruitment, 1904 style, at the Galton Lab (notice the subject pays the experimenter!)



Regression: How to predict one variable from another variable

Last week looked at how to test for differences between sets of data points - eg between the heights of alleged Vikings and bone fide Shetlanders.

This week we will consider a complementary question:

How can the value of one variable be predicted from the value of another variable?

For example:

All of these questions can be addressed by fitting regression models.



Galton's measurements: child vs. midparent

In the early 20th Century, Galton set up his "Anthropometric Laboratory" in which he measured a whole set of characteristics of individuals (including, notably, collecting some of the first reaction time measurements)

He got a lot of data and people actually paid to take part in his study (see handbill above)!

One of Galton's observations was that children's height is similar to the average of their parents' heights (the 'midparent height')

  • Download Galton's data parentChildHeight.txt

  • Load it into Matlab as a matrix called heights

  • Each row is one child.
    • Column 1 is the midparent's height (mean of father and mother's height)
    • Column 2 is the child's height
    • Heights are in inches and are rounded to the nearest inch

  • Plot the childrens' heights against the midparent heights
  • ?


Fitting the regression line

The red line on the graph above is a regression line, indicating the best fit linear relationship between:

  • Our explanatory variable: midparent height
    • also called the independent variable
    • plotted on the x-axis
    • often manipulated by the researcher (but not in this case)

  • Our response variable: child height
    • also called the dependent variable
    • plotted on the y-axis
    • observed but not manipulated by the researcher

The simplest case is that the predicted value y^ of the dependent variable y is a linear function of the independent variable x

y^= mx + c

We tend to use slightly different notation:
y^ = β^1x + β^0

.... where β^0 is the estimated y-intercept and β^1 is the estimated gradient.

We can find the values of β0 and β1 analytically (using an equation) but we are not going to go into the mechanics of this today. Instead we will fit the regression line using the Matlab function glmfit

  • Use glmfit to find the values of β^0 and β^1
  • ?

  • What is the fitted slope? What is the fitted y-intercept

    • You can work out which is which either from the help info, or by common sense

  • Add a regression line to your plot using these values
  • ?


The Residuals

The difference between the fitted data and the actual data, y^ - y, are called the residuals

When fitting the regression line we chose values of β^0 and β^1 to minimise the square of the residuals (this is called "ordinary least squares" regression, or OLS).

OPTIONAL MATHS: WHY minimize squared errors?

We can find the residuals for our regression by working out the predicted value of child height y, that is, y^, for each value of midparent height x in our data set...

?

... and then doing y^ - y

?

Plot a histogram of the residuals. Are they normally distributed?

?


Testing, testing

The regression line for the data we just generated looks pretty convincing. But how can we test if the relationship is statistically significant?

To test whether there is a significant relationship between x and y, we want to test whether the slope is significantly different to zero. There are two factors that should influence our opinion about this:

If this second point sounds a bit strange, consider the cases below:

  1. Our original data
  2. More spread in y --> more uncertainty about slope
  3. Less spread in x --> more uncertainty about slope
  4. Fewer data points --> more uncertainty about slope

The equation for the standard error (SE) of the regression slope captures all of these factors:

In other words -

The significance of the regression slope can then be calculated simply by defining the t-statistic:
t = β1/SE
... where β1 is the fitted regression slope....

And comparing it against the t-distribution with (n-2) degrees of freedom.

Exercise: significance of the regression slope

To calculate the t-value for our regression:

?

Then we get the p-value by comparing to the t(n-2) distribution

?

Exercise: Standard error of the regression slope

What does it mean to talk about the Standard Error for a slope, actually?

Last week, we looked at the Standard Error of the Mean (SEM) for samples:

In other words, the SEM for the sample tells us about the spread of our estimate of the sample mean - which is important if we want to say whether the difference between the sample mean and some reference value (like a population mean, or zero) could have arisen due to chance.

The SE of the regression slope, similarly, tells us about a distribution of possible slopes that we might have got, if we had sampled different data from the same distribution

To get a feel for this, let's try grabbing smaller samples from our 'population' of 928 children and running the regression on them, then plotting all the resulting regression lines on the same graph to see how the slope varies:



Take home message

Regression slopes based on samples of 10 and 100 subjects from the same subject population


Although the statistic we are looking at is different from last week (a sample regression slope rather than a sample mean), to work out whether an effect is significant, we are considering how frequently an event would happen (in this case, how frequently some value for the regression slope would be above zero) if we repeated the experiment lots of times.



Further Exercises

Look back at the Oxford Weather data from the 2nd class. There is clearly a relationship between temperature and month - but it is not a straight line. How can we fit this with regression?

HINT: it looks sinusiodal to me...