Data Analysis for Neuroscientists VI:
Advanced Regression




Recap: linear regression model

Last week we looked at how one variable (the response variable, y) can be predicted from another variable (the explanatory variable, x)

We fitted lines of the form:

y^= mx + c

But using slightly different notation:
y^ = β^1x + β^0

We also considered the residuals, y-y^, which we assume:

Putting this all together, we have a model in which we propose that our data y depend on the value of x, plus some contant, plus Gaussian noise:

y = β0 + β1x + N(0,σ2)

By a model, I mean a description of how our data y are generated.



Model ⇔ Data

Last time we looked at Galton's child vs. midparent data and fitted a linear regression model to them. Let's just do that again quickly.

  • Load the file parentChildHeight.txt, which contains:

    • the heights of parents, in inches, in the first column
    • the heights of their children, in inches, in the 2nd column

  • Fit a linear regression line to the data using glmfit
  • ?
  • Plot the data
  • ?
  • Add the regression line to the plot
  • ?
  • Get the residuals by:

    • working out the predicted child's height y^ for each parent in the sample
    • and subtracting it from the true child's height y

    ?
  • Work out the standard deviation of the residuals
  • ?

So we have a regression model in which the child's height y is predicted from the parent's height x, plus some noise:

y = 23.9 + 0.64 x + N(0,2.232)


Different data, different model

Let's consider a different situation in which we again want to model a response variable y, based on some predictor variable x.

Consider an experiment in which participants view faces that are morphs between a fearful and angry facial expressions. Their task is to classify each image as fearful or angry:

If we code the response "angry" as 1 and "fearful" as 0, our data might look like this:

We would like to have some model that:

Can we use regression to solve this problem?


Linear regression won't work

Download the simulated data file FearfulAngry.txt, and load it into Matlab as data.

Let's try to fit a linear regression line

  • Column 1 of data contains the % angry morph on each trial

    • 100% is pure angry face
    • 0% is pure fear face

  • Column 2 contains the response on each trial

    • 1 = responded "angry"
    • 0 = responded "fear"

  • Use glmfit to fit a regression line
  • ?
  • Plot the data and the regression line
  • ?

What problems do you notice with the regression line?

I can see a couple! HINT: Are all the data values it predicts possible?

?

Back to the drawing board.



Logistic regression to the rescue

In linear regression, our model looked like this:

y = β0 + β1x + N(0,σ2)

In logistic regression, we use a different model:

Although this might look unrelated, note that:

The result is a regression line that is not straight but sigmoidal.

Fortunately, we can fit this sigmoidal line using glmfit. Let's try!

?

Now we can plot the data with the regression line:



Thinking point

Why is the sigmoidal, logistic regression line is more sensible than a straight line (linear regression) for these binary data?

?


Interpreting β0 and β1 in logistic regression

...is not straightforward

Since β0 is a constant and doesn't interact with x, changing β0 moves the logistic curve along without changing its shape

But changing the value of β1 both changes slope of the line, and its location with respect to the x-axis.

Furthermore, we can no longer do a t-test on the regression coefficient. Instead we need to refer to the stats output of glmfit to find out if the effect of x on y is significant.

>> [b, d, stats] = glmfit(data(:,1),data(:,2),'binomial')
>> stats.p % tell me the p-value for each beta


Applications of logistic regression

Logistic regression is important for predicting binary responses of all sorts

It is also used in machine learning. For example a simple classifier algorithm might take a training data set containing items of two types (e.g. images of cats and dogs) and fit a logistic regression curve to some features of those images (e.g., ear size) to try and predict which images are cats and which are dogs.

This prediction can then be tested in a new dataset to see if the boundary estabilshed using linear regression is robust.



Generalized linear model

Logistic regression is just one of a broader family of regression models called the Generalized Linear Model.

The Generalized Linear Model gives a framework for modelling relationships between explanatory variables and response variables, for various different data distributions.

glmfit stands for Generalized Linear Model fit.



Further Exercises

More on regression - for the enthusiast!

These are not in any particular order - the fMRI data is probably the most fun but fairly long.

Multiple regression

Linear relationship between y, and a function of x

What if an explanatory/independent variable is binary?

The revenge of the fMRI data!