Data Analysis for Neuroscientists III:
Standardized Scores

How unusual is my sample?

Let's start by generating two simulated 'datasets' iq and h with 1000 data points in each:

  • iq - drawn from a Normal distribution with mean 100 and standard deviation 15
  • h - drawn from a Normal distribution with mean 179cm and standard deviation 7cm
    • These are the true statistics for English men

You can do this using the function randn which generates normally-distributed random numbers with a mean of 0 and a standard deviation of 1

HINT
?

Now then.
Say we encounter a fine specimen of mankind, with height = 195cm and IQ = 130. Let's call him Eric.
Which is more unusual, Eric's height or Eric's IQ?

Eric

Distance from the mean

How much higher is Eric's IQ (130) than the population mean?

?

How much taller is Eric (height 195cm) than the population mean?

?

So we know that Eric is x IQ points higher than the population mean, and y cm taller than the population mean.
How can we compare these two values to figure out which is more unusual, his height or his IQ?

?

Standardized differences

What is the standard deviation of IQ scores in the population?
What is the population standard deviation of men's heights?

?

So how many standard deviations from the mean is Eric's height?

?

...and what about his IQ?

?

So which is more unusual?

?

Error bars

You will sometimes hear one scientist berating another for showing data without error bars.

In relation to the work you have just done, why do you think this is?

?

The Z score

Measuring the deviation of a sample in terms of how many standard deviations it is from the mean gives us a common currency for comparing how unusual samples are.

The number of (population) standard deviations a sample lies from the (population) mean is such a useful statistic that it has a name - the Z score.

Can you write down a formula for the Z-score?

?

The definition of the Z-score is the distance of a point x from the mean μ of a normal distribution, in units of the standard deviation of that normal distribution, σ

We can also think about the Z-score as 'mapping' a point in our data distribution to another, well known distribution - the Standard Normal Distribution.



The Standard Normal Distribution

What is the mean Z-score for the 1000 Heights in our sample?
What is the standard deviation of the Z-scores?

?

Do the same for the 1000 IQ samples.

What are the mean and standard deviation of the Z-scores in this case?

Hopefully, you should have found that in both cases:

  • The mean Z-score is about zero
  • The standard deviation of Z-scores is about 1.

Can you see why this is, from the equation for the Z-score?

In fact, because out height and IQ data were normally distributed (we made them that way - remember we generated the "data" using using randn), the process of converting the height/IQ values to Z-scores transforms their distribution to another normal distribution, with mean = 0 and standard deviation = 1.

This is called the Standard Normal Distribution.

It is the distribution from which the function randn draws numbers.

We can plot a Probability Density Function (PDF) for the standard Normal distribution using normpdf.

DIVERSION - skip if short of time! and just look at this one instead -->

Right. After that diversion into the world of plotting graphs, let's take another look at that plot of the standard Normal distribution.

And let's also compare that to the distribution of heights in our "sample" of 1000 men, and the distribution of Z-scores for heights. You can nicely see the distributions of those data if you plot a histogram of them using hist

?
  • What do you notice about the shape of the three distributions?
    ?
  • What about the values on the x-axis?
    ?
  • What about the values on the y-axis?
    ?

Probability of observing a given value

Say I want to know how likely it is that if I test somebody's IQ, they will be at least as smart as Eric (who as you no doubt remember, had an IQ of 130).

Last week, we calculated how many people in our sample had IQs of at least 130, using logical operators. Can you remember how to do this?

  • How many people in the current "sample" have IQs of at least 130?

Have a look at the histogram of raw IQ scores. How could you work out the proportion of subjects with IQs of at least 130 from this histogram?

?

Now, by analogy, we can use the Standard Normal distribution to work out the probability that someone has an IQ of at least 130.

First, we work out the Z-score corresponding to an IQ of 130.

Then, we find the area under the curve, to the right of this Z-score.

Because the area under the curve for the whole standard normal distribution adds up to 1, the proportion of it the right of a given Z-score gives the probability of observing a Z-score of at least this value, in the same way that the sum of the heights of the bars on the histogram gave the proportion of the 1000 men who have IQs of at least 130.

Conveniently, MATLAB has a function that will work out the area under a normal distribution for you. It's called normcdf.

Use normcdf to work out the probability of obtaining a Z-score of:
  • Z > 3.1
  • Z < -1.96
  • Z > 2.3
  • Z < 2.3
?

Because the area under the curve to the left/right of these Z-scores tells us the probability of obtaining a Z-score at least that small/large, these areas define the p-values corresponding to those Z-scores.

Hopefully, the Z-scores you just looked at correspond to some nice round-numbered p-values (you may need to round them to two significant figures).

  • Which commonly used significance tests do the Z-values correspond to?
  • Which actual values (in cm/ IQ points) do the Z-scores correspond to?

Interpretation of Z-scores

Here are the heights of some more men:

181 173 175 169 174 182 190 171 181 182

Work out their Z-scores, using only one line of code.

?

Some of those Z-scores are negative! What does that mean?

?




Non-normal data

As you have seen, the Z-score relies on our ability to map our data onto the standard normal distribution, just by changing the values on the x-axis

What if the data are not normally distributed in the first place?


Reaction time data

Go to the website humanbenchmark.com and test your reaction time!

Take a look at their statistics page where they show a histogram of 600,000 RTs that people generated in the last month



Where would your RT fall on this curve?

I captured the data from the image and made the into a text file, RT.txt.

Getting the mean and standard deviation

What are the mean and standard deviation of reaction times?

Ok, tell me the answer!

?

Z-score your data

Now you know the mean and standard deviation, you can convert your RT into a Z-score.

The problem

RT data are notoriously non-normal.

  • What RT value is 2-standard deviations below the mean?
  • ?
  • What RT value is 2-standard deviations above the mean?
  • ?
  • How many RTs were recorded more than 2 stdevs below the mean?
  • ?
  • How many RTs were recorded more than 2 stdevs above the mean?
  • Clearly, for these non-normal data, Z-scores do not tell us how unlikely a given observation is!


    Log Transform

    In the lognormal distribution, the log of the data values follows a normal distribution.

    So if we take the log of each RT value, the resulting log transformed values log(RT) should be normal distributed.

    Calculate the log-transformed RTs for the given data and plot them.

    ?

    Hopefully you are convinced that the log RTs are close to being Normally distributed than the raw RTs were.

    Hopefully, the Z-scores are no longer the same size. It is now apparent that the lower RT is more unusual.

    Log transforming data is a common trick to make right-skewed data approximately normally distributed.



    Probability Matching (no exercises in this bit)

    Log transforming data is a helpful trick for making right-skewed, positive data more Normal.

    There is a more general approach for mapping non-Normal data onto a normal distribution though: probability matching.

    This image illustrates the concept of probability matching for two probability density functions (probability distributions adjusted so that the area under the whole curve adds up to 1).

    On the left we have a distribution of Inter-spike Intervals that is heavily right skewed. On the right, we have a Standard Normal Distribution.

    For any probability density function, the area under the curve of the distribution, between two values on the x-axis, gives the probability of observing a data value in that range. In the case that we are looking at the area in the tail of the distribution (area under the curve from the edge of the distribution to some value X) the area under the curve is the same a the p-value for X.

    I have sliced the probability distribution up into slices of equal area.

    As you can see, the spacing of the slices on the x-axis is not even, and is not the same for the two distributions.

    But hopefully you can see intuitively that the boundaries between slices correspond to the same p-values in both cases. So matching the boundaries between slices of equal area gives a natural "mapping" or transformation between the two distributions.

    This is a Probability Matching transformation and is commonly used in data analysis packages - normally to convert non-Normal data to a Normal distribution so that Normal-assumption statistics like Z-scores and t-tests can be used.

    For example, the Oxford fMRI analysis package FEAT makes use of probability transformations so that "activation levels" can be reported as Z-scores.



    Extra Exercises

    Pick and choose - you won't have time for all of them



    Plotting Data

    If you skipped the 'diversion' earlier and now have time, go back and do that exercise

    You can find more material about plotting data in the MATLAB primer section on plotting data (Section 4).

    There is more fun to be had in Stormy Attaway, chapter 11, including animations!



    Writing functions in MATLAB

    Can you write a MATLAB function zscore that takes in:

    and returns the Z-score, Z?

    You can read about MATLAB functions in:

    If you are not sure how to tackle this, please work through that section first rather than just looking at the answer!

    ?

    You can test your function on Eric's data. If you do:

    >> myzscore(195,179,7)
    Do you get the answer you expect?



    Probability Integral Transform (advanced)

    I have thrown in a difficult exercise here on probability matching. This is more one for any computationally-savvy students who have whizzed through the rest already.

    It is quite long so I have hidden it behind a click. Otherwise people reading the main text will get a false impression of how much there is to get through.

    HIT ME!