Data Analysis for Neuroscientists Session 2: MATLAB Refresher

Welcome to Data Analysis for Neuroscientists II

Today we will be looking in more detail at some common Matlab tasks that will come up repeatedly in the data analysis course

Using this tutorial

The command prompt

>>

indicates stuff you should type in to MATLAB.

Where there is a ? you can click to reveal the answer.

Try to work out a solution and try it out in MATLAB before you reveal the answer.

Writing following a % sign, usually green text, is a comment.

You don't have to type these into Matlab - they are my comments for you to read, to clarify what we are doing

Download the data files

Download the text files:

oxfordWeather.txt

cambridgeWeather.txt

londonWeather.txt

...and put them into your default Matlab directory

Each of these files contains a table with historical weather data.

First, find out which directory you are working in by opening matlab and typing

>> pwd

This command tells you where your default Matlab directory is (pwd stands for 'present working directory', which is where you will be when you first open Matlab).

Look at the data

Let's start with the Oxford data

Load the Oxford data

>> oxweather = load('oxfordWeather.txt');

Have a look at the data

>> oxweather % display the contents of matrix/variable called 'oxweather'

The table contains 7 columns. These are, in order,

year

month

avg max temperature that month (°C)

avg min temperature that month (°C)

number of days with air frost that month

total rainfall that month (mm)

total sunshine hours that month

It will be easier for you to read the table if you tell Matlab *not* to use scientific notation (eg, write 999 instead of 9.99e2)

>> format shortg % switch off scientific notation >> oxweather % display the table again

What is the earliest year from which we have records?

?

1854.

You could find this out by scrolling up to the top of the table, or by asking Matlab to show just the first line of the table

>> oxweather(1,:)

Get Matlab to display the first 24 lines of the table (the first 2 years of data)

?

>> oxweather(1:24,:)

You may notice that in the 7th column (sunlight hours) there is no number, but a NaN

NaN stands for not a number.

It is a placeholder for cells in a matrix where the data value is missing

There are no sunshine data for the first few years, because the sunshine measuring instrument was not installed yet

Scroll through the data and find out when the instrument was installed

Why don't we just use a zero where the data value is missing?

?
If I did that, how would I know the difference between a month with 0 sunshine hours, and a month with a missing data value?

Indexing data

One of the most basic things I might like to do is to find data points that match some criterion.

For example, I might like to know in which years and months the maximum daily temperature was below freezing

To do this I can use the find function to generate a set of indices. I will store the indices in a vector I call ix

>> ix = find(oxweather(:,3) < 0)

That gave me the line numbers in the table for the cold months when maximum daily temperature was below freezing. How can I use this to find out which years and months this happened in?

HINT
The first and second columns of the table correspond to the year and month

ANSWER
You could just pull out the whole line for each for the months identified in ix
>> oxweather(ix,:)

Note that we are using the vector ix to indicate which columns of oxweather should be displayed.

Can you see the analogy with the way we asked for row 1 only?

>> oxweather(1,:)
Or rows 1 to 10?
>> oxweather(1:10,:)

You could get just the relevant columns

>> oxweather(ix,[1 2 3 4]) % get columns 1,2,3 and 4, ie year, month, max and min temp
In this case there are only seven columns, but you can probably imagine that if your data table had 100 columns, it would be useful to be able to display a selection.

Use the technique above to identify the years/months with an average max temperature over 25°C.

As you can see it is pretty rare that we have sustained periods of really hot or cold weather. Depending where you are from, you may not consider 0°/25°C really cold or hot at all.

We do though.

How many of the years you identified can you spot in this list of official natural disasters?

I particularly like the fact that 'heatwave' summers where the average max temperature was over 25 degrees are considered so extreme that they are displayed in the same table as the Black Death and the Potato Famine...!

More on indexing matrices can be found in the Stormy Attaway book, Chapter 2, especially p33-44.

Scripts

A script is a text file (with a filename ending in .m) containing a list of commands you would like MATLAB to execute.

Running commands from a script is essentially The Same as running them directly from the command line.

For example, say you wanted to make separate matrices for rainfall data and sunlight data.
You could type on the command line:

>> rainfall = oxweather(:,[1 2 6]) >> sunlight = oxweather(:,[1 2 7])

Equally, you could have typed the commands above into a script and saved it as getRainSun.m.

>> GetRainSun.m >> oxweather = load('oxfordWeather.txt') >> rainfall = oxweather(:,[1 2 6]) >> sunlight = oxweather(:,[1 2 7])

Then you could run the script, either by clicking the "run" button, or typing

>> clear; % clear the workspace first so you know the variables are not the same ones you created before! >> getRainSun % note you don't type .m when running the script

In your workspace you should have three matrices - oxweather, rainfall and sunlight.

So why use scripts?

Since running commands from a script is the same as typing them on the command line, why do I keep getting you to write scripts? The reasons for using a script are mainly to do with convenience and record-keeping:

Record keeping

You can see exactly what commands you ran, so it is easier to find errors.
You can always check later (even months later) exactly what you did - this is really important when you are working with real data and want to publish some statistical result - you want to be sure how you obtained it, and your colleagues/supervisor may want to see exactly how you obtained it as well!

Accuracy & efficiency

You can execute whole series of commands quickly and repeatedly (so you can save a lot of time by scripting standard data analysis routines)
You wont make a mistake when repeating data analysis on many data sets, because you don't have to retype the commands each time

Loops

Some commands, like for loops, can be too long and complex to type directly on the command line without making a mistake
If there is an error in a script, MATLAB will tell you which line it is on - if you try to run directly from the command line you don't get this feedback.

>>> More on scripts in the Getting Started Guide, section on 'flow control'

>>> More on scripts in the Stormy Attaway book chapter 3

The for loop

Say I want to find out the mean hours of sunlight in each month (January-December). This is probably going to involve doing some calculation 12 times (once for each month). This can be done efficiently using a loop, as we shall now see.

First of all, how would you find out the mean number of sunlight hours in October?

?

First of all you need to identify the all the October rows - make an index vector ix

?
>> ix = find(oxweather(:,2)==10); % column 2 is the month, October is the 10th month

Then you need to get the mean of sunlight hours (column 7) for just those rows

?
>> meanSunlight = nanmean(oxweather(ix,7)); >> % note I am using nanmean rather than mean >> % nanmean gets the mean but just ignores any NaNs in the input (from before the sunlight instrument was invented)

Confused? It might be clearer if you do it in several steps and look at the output after each line:

?
>> octoberWeather = oxweather(ix,:) >> octoberSunlight = octoberWeather(:,7) % column 7 is sunlight hours >> meanSunlight = nanmean(octoberSunlight);

Now, if I want to repeat the process for each of months 1-12 (ie, January to October) I can do it using a for loop:

?

>> for month=1:12; >> disp(['month = ' num2str(month)])% print the value of 'month' to the screen each time I go round the loop >> ix = find(oxweather(:,2)==month); >> meanSunlight(1, month) = month; % top row of the output table tells us which month is which >> meanSunlight(2, month) = nanmean(oxweather(ix,7)); >> end;

You can have a look at the results by printing the matrix meanSunlight to the command line

?

>> meanSunlight % note no semicolon!

What is the mean number of sunlight hours in June? What about in January?

?

>> meanSunlight(2, 6) % June >> meanSunlight(2, 1) % January

These figures are the mean (across years) number of sunlight hours in the whole month. So 60 hours of sunlight in January works out at about 2hrs per day (ugh!). In practice, there will be a few sunny days and many days where the cloud is so thick and the sunlight so weak that it never really seems to get light at all...

You can spot patterns in the data more easily if you plot them

?

>> plot(meanSunlight(1,:), meanSunlight(2,:),'bo-');

meanSunlight(1,:) are the x-values of the data (the months 1-12), meanSunlight(2,:) are the y-values of the data (mean sunlight hours in months 1-12), 'bo-' tells it the color, line type and marker type.

Try switching:

'b' (blue) for 'r'

'o' (open circle markers) for '*'

'-' (solid line) for '--'

Did the hours of sunlight vary with month in the way you expected?

More exercises

Load the Cambridge and London weather data into matrices called camweather and uclweather

Use/modify your for loop to get plots for the mean sunlight hours at Oxford, Cambridge and UCL

You can us the hold on command to get these plots on the same graph

Plot them in different colors: Oxford blue ('b'), Cambridge blue ('c') and UCL black ('k')

Use/modify your for loop to get plots for the mean rainfall in each month at Oxford, Cambridge and UCL

Given the above plots, should you really have come to study here?!

The for loop again

Let's look at the for loop in more detail.

You made a for loop that went through the months 1-12 (January-December) and found the mean sunlight hours for each month, and stored them in elements 1-12 of a matrix meanSunlight

This was a special case of a for loop, in which the variable month played three roles:

As a counter variable

we wanted to process 12 months, so we asked the for loop to run for month = 1:12

To tell us which rows to pull out if the input matrix

when month was 1, we searched for rows where oxweather(:,2)==1

when month was 2, we searched for rows where oxweather(:,2)==2

...

when month was 12, we searched for rows where oxweather(:,2)==12

As an index to the output matrix - so

when month was 1, the mean hours of sunlight were stored in meanSunlight(1)

when month was 2, the mean hours of sunlight were stored in meanSunlight(2)

...

when month was 12, the mean hours of sunlight were stored in meanSunlight(12)

To understand the distinction between these roles, first imagine we are only interested in sunlight hours during the punting season (optimistically advertised as March-October by the Cherwell Boathouse)

What would the for loop look like?

?

You may have come up with something like this:

>> for month=3:10; >> disp(['month = ' num2str(month)])% print the value of 'month' to the screen each time I go round the loop >> ix = find(oxweather(:,2)==month); >> meanSunlight(1, month) = month; % top row of the output table tells us which month is which >> meanSunlight(2, month) = nanmean(oxweather(ix,7)); >> end;

But if you take a look at meanSunlight, it should have a bunch of entries that are just zeros, for the non-punting months which your for loop didn't run over. (if it doesn't, you should clear the workspace and then run the for loop again - the data in meanSunlight(:,1:2) must be from a previous exercise).

In general, we probably want our output matrix to contain only the data we are interested in, so we might want to dissociate the counter variable, the input variable, and the output variable.

HELP

How about something like this:

>> months=3:10 >> nMonths = length(months) >> >> for i=1:nMonths; >> disp(['month = ' num2str(months(i)) ', i = ' num2str(i)])% this is just to help you understand what is going on! >> ix = find(oxweather(:,2)==months(i)); >> meanSunlight(i,1) = months(i); % add a row to the output array, which tells us which month is which >> meanSunlight(i,2) = nanmean(oxweather(ix,7)); >> end;

Don't worry if you didn't come up with that first try!

Let's unpack what I did. We have

an input array months which is a list of the punting months we are interested in

an output array meanSunlight with one column per input month and two rows

row 1 is the identity of each input month

row 2 is the number of sunlight hours in that month

a counter variable i which tells Matlab how many times to go round the loop, and is used as an index to the input and output matrices

on the first pass through the loop, i==1 and the input month is months(1) ie 3

on the second pass through the loop, i==1 and the input month is months(2) ie 4

...

on the last pass through the loop, i==nMonths ie i==8 and the input month is months(nMonths) ie months(8) ie 10

To test whether you understood this, let's try another case. Can you get the total rainfall in each of the years 1981-2015?

?

How about something like this:

>> clear; >> >> years=1981:2014 >> nYears = length(years) >> >> for i=1:nYears; >> disp(['year = ' num2str(years(i)) ', i = ' num2str(i)])% this is just to help you understand what is going on! >> ix = find(oxweather(:,1)==years(i)); >> totalRainfall(i,1) = years(i); % add a row to the output array, which tells us which month is which >> totalRainfall(i,2) = sum(oxweather(ix,6)); >> end;

Plot totalRainfall and work out which was the rainiest year of my life!

>>> More for loops in the Getting Started Guide, sections 5-2 to 5-8

>>> More for loops in the Stormy Attaway book chapter 5 (p148-162)

The while loop

In Matlab (and most programming languages) there are two types of loops, for and while.

We will be using mainly the for loop in this course, but for the sake of completeness we will have a quick look at the while loop here.

To illustrate the difference between for and while loops, we will make a script that rolls a 'virtual dice'.

To simulate one dice roll, we can use the random number generator randi to generate a random integer between 1 and 6.

Use help randi to work out how to do this

?

>> clear; >> >> randi(6)

Can you make a for loop that rolls the random dice 10,000 times and saves the outcomes into a vector called dicerolls?

?

>> for i=1:10000 % i is the counter variable >>      dicerolls(i)=randi(6) >> end

Can you modify this script to roll two dice on each trial instead of one?

?

>> for i=1:10000 % i is the counter variable >>      dicerolls(i,1)=randi(6) >>      dicerolls(i,2)=randi(6) >> end

This should result in a matrix called dicerolls with two columns, for the two dice

Plot three histograms for the outcomes for each dice individually, and the sum of the outcomes on the two dice, using the function hist

?

>> figure(1); hist(dicerolls(:,1)); % one dice >> figure(2); hist(dicerolls(:,1)); % second dice >> figure(3); hist(dicerolls(:,1)+dicerolls(:,2)); % both dice

Did this look how you expected?

Now let's try something different. Say I'd like to simulate rolling the dice until we get a six, and record how many rolls were required. How am I going to do that?

One option is to use a while loop.

?

>> i=1; % i is the counter variable >> flag=0; % what is this for? look at the loop below >> while flag==0 >>      thisroll=randi(6) >>      if thisroll==6 >>          flag=1 >>          nRolls=i >>      end; >>      i=i+1; % increment i >> end

Can you work out what is going on here?

What is flag for?

What is stored in nRolls

What is different about the counter variable i compared to the for loop?

Let's think in more detail about the differences between the for and while loops.

The difference between for and while loops is how we determine when to stop going round and round the loop.

In the case of the for loop, we have a predefined number of passes through the loop - one for each listed value of the counter variable. So a loop that rolls a virtual "dice" 4 times might look like this:

>> for i=1:4 >>      x(i) = randi(6);% roll a virtual dice by picking a random integer from 1 to 6 >> end

I know that I will have to pass through the for loop 4 times:

On pass 1, i=1
On pass 2, i=2
On pass 3, i=3
On pass 4, i=4
... and then stop.

You can think of the for loop as a kind of list, where everything it needs to do is laid out in advance.

In contrast, a while loop doesn't have a predefined list of iterations. It loops round and round indefinitely until some criterion is met.

>>> More while loops in the Getting Started Guide, section 5

>>> More while loops in the Stormy Attaway book chapter 5

Exercises

Say I would like to find out the distribution of the number of dicerolls needed to score a 1 for dice with 4,6,8,12 and 20 sides (these are the only dice shapes that are Platonic Solids - see Wikipedia if intrigued!)

Can you modify your while to simulate rolling the 4 sided dice until I get a 1?

?

>> i=1; % i is the counter variable >> flag=0; % what is this for? look at the loop below >> while flag==0 >>      thisroll=randi(4) >>      if thisroll==1 >>          flag=1 >>          nRolls=i >>      end; >>      i=i+1; % increment i >> end

Why am I using the number 1 not the number 6 here by the way?

?
Later we are goign to look at all the different shaped dice.
All of the different dice have a number 1 on them.
In contrast, it is difficult to roll a 6 with a the 4-sided dice!

Now, I would like to find out the distribution of the number of rolls needed to score a 1 on each dice.

To do this I will need to roll each dice until I get a 1 lots of times. Think of this as an experiment with 10000 trials. On each 'trial' I roll the dice until I get a 1, and the outcome of the trial is the number of rolls requires.

My while loop runs one trial.

Can you figure out how to make 10000 trials by embedding the while loop in a for loop?

?

>> nSides = 4 % 4-sided dice >> for t=1:10000 % i is the counter variable for trials >>     i=1; % i is the counter variable for rolls >>     flag=0; % what is this for? look at the loop below >>     while flag==0 >>          thisroll=randi(nSides) >>          if thisroll==1 >>              flag=1 >>              nRolls(t)=i >>          end; >>          i=i+1; % increment i >>     end >> end

Run the whole experiment for each of the Platonic dice and in each case plot a histogram of the number of rolls needed to get a 1.

What is the most common number of rolls needed for each dice?

What do you notice about the distributions for the different dice?

Welcome to Data Analysis for Neuroscientists II

Using this tutorial

Download the data files

Look at the data

Indexing data

Scripts

So why use scripts?

The `for` loop

More exercises

The `for` loop again

The `while` loop

Exercises

Welcome to Data Analysis for Neuroscientists II

Using this tutorial

Download the data files

Look at the data

Indexing data

Scripts

So why use scripts?

The for loop

More exercises

The for loop again

The while loop

Exercises

The `for` loop

The `for` loop again

The `while` loop