Data Analysis for Neuroscientists IV:
Sample vs. Population


Island of Shetland - population 22,210



The population

Last week we generated a simulated height 'dataset':

  • h - drawn from a Normal distribution with mean 179cm and standard deviation 7cm
    • These are the true statistics for British men

Shetland is an island to the North of Scotland with a population of 22,210 and some impressive weather.

Generate a vector with simulated heights for 10,105 men (let's call this the entire male population of Shetland) with these statistics.

HINT

Heights of Shetland Men


The sample

Whilst out walking on Shetland, I encounter our old friend Eric, with a group of 5 men.

From their height, I suspect they are not from around here. They seem to be a little taller than the local men.

Also, I don't like the look of them.

How can I determine whether these are bone fide Shetlanders or possibly an invading army?



Simulating samples from the null population

One way we can work out if Eric and his friends really belong to the local population is:

Let's explore what happens if we draw samples of 5 men from our population.

First, let's draw one random sample of 5 men from our population by:

?

Now we are going to draw 1000 samples of 5 men each from the population of Shetlanders.

Notice something really important here: by working out the null distribution of the sample mean we were able to put a number on how unlikely it was that the test sample (Eric and co) were drawn from the null distribution (the population of Shetlanders).

?


The effect of sample size

The mean height of Eric's group was unusual - if we draw samples of 5 men from the population of Shetlanders, about 6% of those groups would be as tall as Eric and co.

However this would not pass a test of statistical significance - usually a sample woud be considered significantly different from the null population if it would have occurred less than 5% (or 1% or 0.1%) of the time due to chance.

What if some more of Eric's friends now show up, so we now have a test sample of 20 possible Viking Invaders with mean height 184 cm. How likely was this under the null hypothesis?

What about samples of size 50?



Standard error of the mean (SEM)

We have seen that the larger the sample, the tighter the distribution of the sample mean.

This makes intuitive sense as, if the deviation of the sample mean from the population mean was just due to chance (ie, I happen to have picked unusually tall Shetlanders), the more people I put into my sample, the less likely it is that they are all tall guys.

The relationship between sample size, and the spread of the sample means we get if we run lots of samples, is captured by the following equation Careful! if you are using Chrome the sqrt sign here is invisible! please use Firefox!:

SEM =   s
n

... where s is the sample standard deviation, n is the number of data points in the sample, and SEM stands for the Standard Error of the Mean

The Standard Error of the Mean is none other than the standard deviation of the distribution of sample means, for a given sample size n - the spread of the different distributions plotted at the top of this section.

We are not going to prove mathematically that the SEM is inversely proportional to n, but the result is so important that it is worth checking it by simulation.

Can you adapt the script you made earlier, so that it generates 50 samples each of size 1,2,3,...100 (note 50 and not 1000 samples - those MSc Centre computers may explode if you make 1000 samples each for 100 values of n!), and records their sample means and standard deviations?

HINT
?

Then we can plot the standard deviation of the sample means, that is the SEM, against n

?

And add a line showing the theoretical value of SEM from the formula above for each n, for comparison:

?

Is it a good match?



Standard error vs. Standard Deviation

To recap: how unusual a sample is, given the null population, depends on

... and we determine whether a sample of n observations (men in this case) differs from the population by considering the distribution of sample means if we drew many samples from the null population - we do not compare the mean of the sample to the null population itself but to the distribution of sample means for samples drawn from the null distribution

An aside...

Does the variability within the sample (the sample standard deviation) also depend on n?

It is the SEM, not the sample SD, that tells us how unusual our sample is.



The central limit theorem

The distribution fo sample means looked approximately Normal for our height data.

Perhaps that was not surprising given that the height data themselves were Normal.

But one of the most powerful results in statistics is that:

This result is called the Central Limit Theorem and can be proved mathematically but we are not going to do that now.

Instead we are going to prove it to ourselves using simulation.


Some very non-Normal data

Download the data file "nonNormalData.txt" which contains 10,000 samples from this distribution, and load it into Matlab as a vector called data

These data are very non-Normal, in fact they are bimodal (there are two peaks)

However, the Central Limit Theorem tells us that if we draw samples from this distribution, their sample means with be Normally distributed, as long as the samples are large enough.

Let's try it for a very small sample size, n = 2.

But what happens if I go for a slightly larger sample, n = 5 or n = 10 ?

In practice, the approximately Normal distribution of the sample mean emerges from even very non-Normal data distributions with quite small values of n

As n tends to infinity, the distribution of the sample becomes exactly Normal.


SEM for non-Normal data

What about the spread of the distribution of sample means?
Does the relationship to n hold even for very non-Normal data?

Hopefully you should have found that the standard deviation of the distribution of sample means, which is exactly the SEM, is inversely proportional to n even for highly non-normal data.



The t-distribution (a special case)


The Central Limit Theorem tells us that when n is large, the null distribution of sample means is (very) nearly Normal.

As we have seen, when n is small, the shape of the distribution of sample means depends on the data distribution.

For Normally distributed data, there is a known, precise distribution for the sample mean for any value of n

As you can see from the picture above, the t distribution has heavier tails and a pointer point than the Normal distribution. These features together are sometimes called positive kurtosis

Remember that the probability of an observation happening due to chance if given by the area under to curve to the right of the point on the x-axis corresponding to the observation.

For example, say the sample mean is 3 standard errors from the population mean and n=5.

Although the t-distribution and the Normal distribution look quite similar, they differ in the crucial region for determining statistical significance (ie, the tails)

For a small sample size n=5, an extreme observation 3 SEMs from the population mean is over 10x as likely (based on the t(4) distribution) as we would predict if we assumed the sample mean exactly followed the Normal distribution.

So if the data are Normally distributed then we can work out exactly how unlikely the sample mean was even for small n, using the t-distribution



Extra exercises (advanced)

Generating samples from an arbitrary distribution

How did I generate the non-normal data above?

First, I made a arbitrary probability generating function

>> x = 1:100;
>> px = normpdf(x,80,5)/2 + gampdf(x,5,7)
>> px = px./sum(px) % force the distribution to integrate to 1

Then, I generated 10,000 uniform distributed random numbers

>> u = rand(10000,1);

Then I used a probability integral transform to convert these to my desired probability distribution:

>> for i=1:length(u)
>> data(i) = x(find(cumsum(px) > u(i),1));
>> end

Can you see how this relates to the last section of last week's work on the probability integral transform?

Make your own probability distribution (try a trimodal one) and generate some data from it.