16 April 2012
30 April 2012

# Use real data when teaching statistics

In statistical analysis the context of the data is integral, not a story added on afterwards to make it more interesting. It is not like algebra where “making it real” means  you make up a reason for the equation, and require the students to give the correct units for the answer. In statistics the analysis involves understanding what is happening in the data.
For this reason, as much as possible, data must be real.
In a previous incarnation I have been guilty of making up data. I was even quite proud of being able to make sure my fake multivariate data displayed heteroscedasticity and multicollinearity. That was fine for an assessment item, I reasoned at the time, as I wanted to make sure that students could recognise those effects.
I recently reviewed a case which had been submitted for publication. The case story was great, with some interesting soft aspects, based on a real-life scenario. Then the second part of the case involved analysing data, which was openly fake. I decided to see how I would go, downloaded the data and started playing around in it. I found it disturbing that there was an R-squared value of more than 99%. Then the more I explored, the worse it got, and the more convinced I was that the problem lay in the generation of the data. This would have caused perplexity for students who really wanted to understand what was going on. It is not acceptable to have badly faked data in a case.

# What is so great about real data?

With appropriate topics, the outcome interests the students. It can cause them to think, and realise that there is a use for statistics. It can be exciting! You can have discussions about why this result might have happened.
An interesting bonus, that you can choose to use or not, is that the data is dirty! (See my post about dirty data). Students learn that data does not arrive beautifully sanitised like the pristine textbook sets. They meet with the problems of real data, so they are better prepared for real data in the real world.

## The failings of fake data

1. Effects may seem really interesting, but they were put there by the instructor (sometimes by mistake) so there is no basis in reality. I see this as rather the equivalent of the movie, “the Truman Show”, where a whole world is generated for Truman Burbank with exactly the events needed to make a television series interesting.  Sure you may find a relationship in the data, but only because you put it there in the first place!
2. You can get odd artefacts of the generation process. Some interesting pattern shows up when a student looks at the data a different way from what you expect. This pattern could be just because you didn’t think to get rid of it.
3. Generating good fake data is actually quite tricky to do if you want to get it right.
4. Using fake data trivialises the statistical process to mechanistic algorithm application. Fake data may be better that numeric data with no context, but not by much.

## Sources of real data

The internet abounds with data. We can just about drown in it. This is one source of data, but it is mostly clean, which removes one of the advantages of real data.
However I prefer to get the data from the students themselves. Each year I have a questionnaire which the students fill out anonymously on-line at the start of the course. Then I use this a source of data for use in class examples, exercises and testing. Over the years I have found some interesting effects among the data from our students. An important thing to remember is to make sure you have a range of levels of data. It is very easy to collect nominal/categorical data, but it’s not much use for teaching regression. Paired difference of two means can also be difficult, so you have to think ahead on that one. Here are some example questions for each level of measurement.

## Nominal

• What type of chocolate do you prefer?
• What kind of mobile phone do you own?
• Sex?
• Nationality?
• How did you travel to university today?
• What subject are you majoring in?

## Ordinal

• How useful do you think this course will be in your future career? (Very useful, somewhat useful, not useful)
• How successful have you been in mathematics in the past? (Very successful, somewhat successful, not successful)
• How often do you check Facebook? (More than once a day, about once a day, several times a week, about once a week, less often than once a week.)

## Interval

• How many pairs of trousers do you own?
• What is the most you have ever paid for a pair of trousers.
• What annual income do you expect to be earning in ten years’ time?
• What do you think the average income for the class with be in ten years’ time?
• How many children would you like to have?
• What is the ideal age to get married?

## Real data in Operations Research

Unfortunately it is more difficult to find real-life problems in OR which can be solved in the classroom. One possible approach is to start with a real-life case, and then provide a cut-down version for the students to work on. When we make up exercises for OR, we search the web to make sure that the figures used are realistic estimations of real costs.
In a lesson on Multi-criteria Decision Making we had the case of locating a landfill. This was especially pertinent as our city had recently gone through the political process to set up a new landfill. A helpful website gave ballpark figures on costs for many of the aspects. With the internet at our fingertips there is no excuse for unrealistic figures.
There is work involved in collecting real data, but if we want students to accept that statistics and operations research are relevant, it must be done.

##### Dr Nic

1. […] abstract method. In probability, real and business world examples are used, and in inference we use data generated by the students themselves. All is taught using Excel for calculations, and we even do Pivot-tables and […]

2. […] a REAL graph of REAL data and finding out what it REALLY tells them. I have already blogged about the importance of real data in teaching, so those of you who have recently started following you might like to take a look. I also gave […]

3. […] in some of his examples were made-up data. I’ll direct you to one of Nicola Petty’s blog post as to why this should be discouraged when teaching […]

4. […] schools will use different examples than classes in psychology or forestry. Whatever the context, the data should be real, so that students can really engage with […]

5. […] must be multiple contexts, preferably using real data. When discovering patterns, students need to be able to tell what is general from what is specific […]

6. mpledger says:

From my experience, if you gave real data to students then you’d spend 80% of the course cleaning it up. While worthwhile to do once as a learning experience, it’s not something you’d want to do again and again and again – it just takes time from teaching what you really want.

• Dr Nic says:

True, but there is a difference between cleaned data and fake data.

7. […] deviation? I was writing questions involving the normal distribution for practice by students. I am a strong follower of Cobb’s view that all data should be real, so I went looking for some interesting results I could use, with a mean and standard deviation. […]

8. […] Use real data. […]

9. […] effectively is to use real data. I have written about the need for real data (not faked) in my post Stop faking it, data should be real. I’d like to apologise here and now for my arrogant assertion that “The internet abounds with […]

10. […] brings us to the second use for on-line resources. Real problems with real data are much more meaningful for students, and totally possible now that we don’t need to calculate […]

11. […] means about the heart-rate in swimmers and non-swimmers, or whatever the context is. For this reason every data set needs to be real. We cannot expect students to want to find real meaning in manufactured data. And students need to […]