In statistical analysis the context of the data is integral, not a story added on afterwards to make it more interesting. It is not like algebra where “making it real” means you make up a reason for the equation, and require the students to give the correct units for the answer. In statistics the analysis involves understanding what is happening in the data.
For this reason, as much as possible, data must be real.
In a previous incarnation I have been guilty of making up data. I was even quite proud of being able to make sure my fake multivariate data displayed heteroscedasticity and multicollinearity. That was fine for an assessment item, I reasoned at the time, as I wanted to make sure that students could recognise those effects.
I recently reviewed a case which had been submitted for publication. The case story was great, with some interesting soft aspects, based on a real-life scenario. Then the second part of the case involved analysing data, which was openly fake. I decided to see how I would go, downloaded the data and started playing around in it. I found it disturbing that there was an R-squared value of more than 99%. Then the more I explored, the worse it got, and the more convinced I was that the problem lay in the generation of the data. This would have caused perplexity for students who really wanted to understand what was going on. It is not acceptable to have badly faked data in a case.
With appropriate topics, the outcome interests the students. It can cause them to think, and realise that there is a use for statistics. It can be exciting! You can have discussions about why this result might have happened.
An interesting bonus, that you can choose to use or not, is that the data is dirty! (See my post about dirty data). Students learn that data does not arrive beautifully sanitised like the pristine textbook sets. They meet with the problems of real data, so they are better prepared for real data in the real world.
The internet abounds with data. We can just about drown in it. This is one source of data, but it is mostly clean, which removes one of the advantages of real data.
However I prefer to get the data from the students themselves. Each year I have a questionnaire which the students fill out anonymously on-line at the start of the course. Then I use this a source of data for use in class examples, exercises and testing. Over the years I have found some interesting effects among the data from our students. An important thing to remember is to make sure you have a range of levels of data. It is very easy to collect nominal/categorical data, but it’s not much use for teaching regression. Paired difference of two means can also be difficult, so you have to think ahead on that one. Here are some example questions for each level of measurement.
Unfortunately it is more difficult to find real-life problems in OR which can be solved in the classroom. One possible approach is to start with a real-life case, and then provide a cut-down version for the students to work on. When we make up exercises for OR, we search the web to make sure that the figures used are realistic estimations of real costs.
In a lesson on Multi-criteria Decision Making we had the case of locating a landfill. This was especially pertinent as our city had recently gone through the political process to set up a new landfill. A helpful website gave ballpark figures on costs for many of the aspects. With the internet at our fingertips there is no excuse for unrealistic figures.
There is work involved in collecting real data, but if we want students to accept that statistics and operations research are relevant, it must be done.