Effective multimedia teaching videos
10 February 2012
No more lectures!
21 February 2012
Show all
Messy girl

Assume that all data collected from people is dirty

Dirty data is real data as it is collected before someone gets hold of it and takes out the tricky bits. You won’t find dirty data in textbooks. Dirty data is what real researchers have to deal with. And even amateur researchers and students doing real-life projects will have to deal with dirty data. Yet not much is said about dirty data, and what to do with it.

Elements of dirty data

Mistakes – people put down the current year for their date of birth, give their weight in the wrong unit, put an extra decimal point.
Missing data – people leave gaps, possibly by mistake and possibly intentionally, or give up before the end.
Mindless response – people just tick all the middle responses to Likert scales, or answer “no” to everything.
Silly answers – people state that they expect to earn over 1 billion dollars next year, say they want to have 28 children or are 105 years old and weigh 500 pounds.

Detecting dirty data

Messy girl

Assume that all data collected from people is dirty

First of all assume your data is dirty, particularly if humans have been involved, and even more so if students have been involved. To find the problem areas you need to make tables, graphs and summary statistics of all the variables, and look for outliers. Look for consistent missing values. Look at the highest and lowest values. Scatter-charts are also good for identifying anomalous data.

Dealing with dirty data

Well – this is where mathematics and statistics inextricably part company. There is no single right answer. It all depends! (Students hate that phrase.) Sometimes you should take the response out. Sometimes you should make it a missing value. Sometimes you should correct it. Sometimes you should remove a complete record or observation. Always you should document and justify your decisions, and be aware of any possible implications. There is a fine line between cleaning data and massaging it into something that will give the results you are seeking. There are some actions that are insupportable.

Teaching with dirty data

If students do their own projects they will need to deal with dirty data. It is a wonderful opportunity to make them suffer help them learn. Don’t give them the answers, but get them to make the judgment calls – that’s what real researchers have to do.
However not all statistics courses include student projects. (Our first year course doesn’t for reasons I will cover in a later post). I do give postgraduate business students a set of data as it was collected, raw from the students. Part of their assignment is to clean it up before they start, and provide a report on what they have done and why.
For the introductory course for undergraduate students I clean up the data, so that the missing and spurious values don’t injure their fragile confidence. The point in this particular instance is to practice multiple examples of different types of testing, in order to generalise the principles of hypothesis testing. Excel, which I use with reservations (another later post) doesn’t cope well with missing values and would provide barriers too early in their learning. Whether the data is given to them clean or dirty depends on the learning objective of the exercise – and the nature of the students.
I would like to get them using the original data, but the course is not quite long enough. I’m still mulling over that one. Having written this post, I am convinced I need to do something about it.  I’ll get back to you.


  1. […] interesting result is authentic, not just something dreamed up by the instructor. The data should occasionally be dirty even! (but not too early in the course, without warning). And there should be enough data. Don’t […]

  2. […] interesting bonus, that you can choose to use or not, is that the data is dirty! (See my post about dirty data). Students learn that data does not arrive beautifully sanitised like the pristine textbook sets. […]

  3. […] is a value judgment. The sample size, questions asked, order of the questions, manner of sampling, data cleaning methods and choice of which aspects to report or ignore are all judgements made by the person performing […]

  4. An easy way to get dirty data (I do it every year) is to use a Google Form to collect class data. I often ask for height, weight, sex, number of facebook friends, monthly expenses on tobacco, alcohol and phone, etc. The number of mistakes and outrageous values is quite funny and useful to show how one can detect & deal with problems. Then we proceed on using the data in linear regressions.

  5. […] what should you do with unusual observations? I’ve written a bit more about this in my post on dirty data. And there is uneven scatter, or heteroscedastiticity, which does not affect model definition, so […]

  6. […] about previously. Data students collect themselves is much more likely to have errors in it, or be “dirty” (which is a good thing). When students are only given clean datasets, such as those usually provided with textbooks, they […]

  7. […] the concept of inference or the relationship between the model and reality. My experience is that data cleaning is one of the most challenging parts of analysis, especially for novice […]

Leave a Reply

Your email address will not be published. Required fields are marked *