# A dearth of raw data

27 May 2013
10 June 2013

The desired outcome of this post is to be proved wrong.
Here is my assertion: It is really difficult to find appropriate sets of data to use for teaching and assessing statistical analysis.
This is a problem; one of the key factors in teaching statistics effectively is to use real data. I have written about the need for real data (not faked) in my post Stop faking it, data should be real. I’d like to apologise here and now for my arrogant assertion that “The internet abounds with data. We can just about drown in it.” I feel like the ancient mariner staring at the data abounding, with no drop fit to drink, let alone drown in.
Recently a teacher contacted me to help her find a set of data for an assessment task in Year 13 statistics. The data set needs to have the following characteristics:

• It must be real
• A sample (not a population)
• Multivariate so that the students have a choice of variables to model
• Have at least one variable of interval/ratio data
• Have at least one way of dividing the sample into two groups
• It should not be a set that has previously been used for assessment in the public domain in New Zealand.
• It should be of interest to the students
• It should be open to background research
• Ideally it should be randomly sampled
• It should preferably be from New Zealand (Australia is near enough), and not too old.

How hard could that be? ( I joke of course – it is very hard)
I fancy I am pretty good at ferretting things out on the internet, but though I found wonderful sites with lots of sets of data, I could not find one set to fit the criteria. And the problem is, this will need to happen every year in every school in New Zealand, often more than once.
This is not a unique problem, I suspect. When I taught at university I was challenged to come up with appropriate data sets each year for assessment exercises. Consequently we would sometimes rotate data sets in a three year cycle, or (oh the shame) make fake data.
All over the world people are collecting data and doing analysis. Why is it so difficult to find raw data?
One issue is that of privacy – in New Zealand we have strict laws with regard to privacy and informed consent, which means that it is easier to keep the data hidden rather than try to anonymise it for general consumption. Surely that is not the case in non-human research, though. It takes a bit of work to make data available, and academics and researchers do not have time to spare. Some data is commercially sensitive, forbidding its release to the public domain. Often what look like promising data sets are not at a unit level, but a summarised into tables for the reader.
I went searching for links to data sets, and found the following. So I guess there is data out there, but it is time-consuming to find appropriate sets. And very little of it relates to NZ, sadly. And baseball, basketball and medical sets abound.
http://www.statsci.org/datasets.html looks promising, and I am grateful for the efforts. However very few of the sets meet the criteria.
http://www.amstat.org/publications/jse/jse_data_archive.htm This one has the most informative layout, in terms of finding out whether the data base is likely to be useful.
So in a way I have proved myself wrong already. There are datasets out there. But difficult to find one that is just right! I feel for teachers having to trawl through so many sites to find something, though.I had hoped that there would be sets of data along with PhD thesis dissertations, but even in the area of statistics education, I couldn’t find any.
I don’t have an answer to this problem. As a uni lecturer I solved it for my own class by collecting data from them, pretending that it was a random sample of first year university students, and giving it back to them  to play with. Obviously not ideal, but fun!

##### Dr Nic

1. Peter Lane says:

When I worked for the pharma company GSK (up to last year) I was on a team looking at “data transparency”, considering whether and how to release raw data from clinical trials for wider use. It is happening now, but mostly to organizations like the Cochrane Collaboration who can provide a rationale and plan for using the data for serious research. The main problems are patient consent and anonymization: both are time-consuming and hence expensive operations.
I haven’t looked for illustrative data for some years, but two potentially rich types of source occur to me. One is software: most of the stats packages have worked examples using real data, provided either with the software or on a website or both. Of course, the example are often old, and you see the same old favourites being used again and again. Perhaps a better bet are the websites associated with recently published books. I recently contributed two chapters to a Springer book on graphics “A Picture is Worth a Thousand Tables: Graphics in Life Sciences”. The data for some of the chapters (I think almost all from real examples) is on the companion website: http://www.elmo.ch/doc/life-science-graphics/, and I hope more will follow. You should find the same system for many recently published books, I think. So I guess a useful resource for teachers would be a portal that listed books and their associated websites.

• Dr Nic says:

Thanks. I feel a bit reluctant to use data from a book I am not using, or haven’t bought. What do you think about that?

2. Geoffrey Brent says:

I sympathise! A while back I was co-writing a training course for new graduates at ABS. We have lots of unit-level data from surveys and Census, but access is managed on a need-to-know basis and “training purposes” wouldn’t have been adequate justification.
I ended up using a complicated model to generate a fake data set with no confidentiality requirements.
Using fake data isn’t all bad. Because you have access to the generating model, you can compare it to the analysis results and get a feel for the limitations of these methods. (On at least one occasion, this helped me realise that I was misinterpreting the program outputs!) But it would be nice to have more real-life data to look at.
Some possible sources:
– Political statistics, matched to demographics of their regions (cf FiveThirtyEight.com)
– Sporting stats
– Country-level social/economic data, e.g. CIA World Factbook
All of these would require the user to do their own random sampling, but it’s a start.

3. Ian Barnes says:

One approach is to simulate data to fit the results of published studies, thereby maintaining practical interest in real-world topics and avoiding privacy issues. This is the approach used by Glantz in Primer of Biostatistics.

4. Don Shearman says:

A couple that you might be interested in:
https://data.qld.gov.au/ Data sets from the Queensland government
http://www.kdnuggets.com/datasets/ A site containing links to a wide range of data sources (but mostly non Australian/NZ)
http://www.guardian.co.uk/technology/page/2009/jun/17/1 Articles from the Guardian (UK) newspaper but also include links to data sets used for the articles.

5. I sympathise with you as a fellow trainer but I am fortunate in having a consultancy business as well which gives access to a lot of data from clients. Commercial confidentiality is an issue but there are ways to get around that. Obviously, you still need permission from the client but if you show them that it is the nature of the data itself that you are interested in rather than the client, then I think the following options can be offered to them. In some cases, companies will be willing to be identified if there is a benefit to them to doing so. I myself am working with a group in the UK on putting together data sets for 6th form students and my data set is from a dating site client of mine,
The most obvious way is to not to give the company name and say something like “a company in the clothing industry”, etc. If that is not sufficient, then transplant the data into another industry or generalise the industry e.g. “a public sector organisation”, “a London based financial company”, etc.
If that is not sufficient, then change the names of the variables and/or the objects. Alternatively rescale the data whilst keeping the same structure. Often, it is the shape of the data that offers the greatest learnings rather than where it came from.
Two other issues with real data are firstly, the data may be messy and need a lot of cleaning. Second, the volume may be too much for the students to deal with. Of course, if these are part of your objectives then that is less of an issue.

6. At Massey University some of us have been trying to address the lack of publicly available data relating to business activity in New Zealand. We have set up a Data and Story library at http://bizstats.massey.ac.nz based on the CMU and OzDASL sites. It currently contains about 25 datasets, but much more is needed – we welcome submissions of any datasets!
We have encountered some of the problems that you raised in your post, namely; lack of availability due to issues of confidentiality and ownership; data only available in summarised form; and a lack of cross-sectional data (as opposed to time series data which is relatively plentiful). The original driver for the project was the lack of publically available NZ data for use in teaching first year university Business Statistics courses, but the developments at year 13 NCEA will dramatically increase the demand for this sort of data.
Perhaps this is an issue which the Education section of the NZ Statistical Association might wish to take up?

• Dr Nic says:

Hi Howard
Yes I think this is going to be more of an issue now. It is a bit much to expect teachers to find suitable data sets for assessment on top of teaching and learning the material themselves. Your site has helpful descriptions. There needs to be some serious curation as well for any database, which takes a lot of time and expertise.

7. Carmel Woods says:

Statistics New Zealand has a number of synthetic unit-record files available for use in schools. I haven’t used them myself, so not sure if they fit all your criteria, but no doubt worth a look. http://www.stats.govt.nz/tools_and_services/services/schools_corner/SURF%20for%20schools.aspx

8. Richard says:

you could try looking here: http://researchdata.ands.org.au/
R

9. nshephard says:

Have a look at the Statistics Sub-Reddit, lists a load of data sources on the right-hand side of the page, see http://www.reddit.com/r/statistics
There is even the Datasets Sub-Reddit linked from there (http://www.reddit.com/r/datasets/).
As mentioned R comes with a wealth of sample datasets e.g. http://www.ats.ucla.edu/stat/r/faq/data_sets_avaiable_R.htm but there is more than just the data that comes with the MASS package.

10. Susan Hurring says:

Can you help…!!

I was thrilled to see the link to electrity usage and am busy seeing if I use it to create an assessment for my 3.8 assessment. My problem is that whenever I try to import the csv file (for this or any other file I have saved) into iNZight, it tells me there is a hole in my data. Any suggestions?

• Dr Nic says:

iNZight is notoriously picky about its data. Copy all the data including the headings and no extra rows into a clean spreadsheet. Make sure it is in Sheet1. Save as .csv. Then try loading it. It seems that if there has been anything below the data at any time the program perceives it to be gaps in the data.
If that doesn’t work, put a question on the NZ Stats 3 site and I’ll see what I can do – or email me the file and I’ll whip it into submission.

11. Mark P says:

“Of interest to students” is a ridiculous restriction when applied to senior students.
And why does it need to be from NZ or Australia? Are we now trying to be insular? I take the opposite line — it”s a big world out there, let’s have a look at it.
The data needs to be comprehensible and in a context they can relate to, sure. But being local and of direct interest limits the field without any pedagogical benefits. In fact they can be much more dispassionate about data of no intrinsic interest, instead of going down the rabbit holes of irrelevant comments based on what they think is interesting, rather than relevant.
I have used the method suggested above of expanding real world data from a study to solve the problems. Recently from an experiment about applying fertiliser and water to tree growth I regenerated data to give a “population” that could be sampled. Of all the problems the students faced, none have ever mentioned that they were concerned that the trees were grown in the US, or that they had no natural interest in trees.

12. Stephen George says:

My primary need of raw data is for student projects. For several years, I kept things simple and provided my students data from a CD-ROM that accompanied the textbook Workshop Statistics (http://www.wiley.com/WileyCDA/WileyTitle/productCd-EHEP001989.html), which isn’t our primary text but is a great source for “real world” problems and datasets. This past year I elected to have my students go out and find their own data on the internet — assuming, as you did, that they would have no problem finding good sample data. The main problem I encountered is that the bulk of data out there is population data, not sample data.
As Geoffrey Brent implied (I think), a workaround is to have students download population data and then randomly sample from that dataset — I showed my students a way of doing this using VLOOKUP in Excel. Obviously, this isn’t a suitable practice if you want your students to do actual, rigorous research; if the goal is simply to have them apply the statistical procedures they have learned, this approach has the benefit of allowing students to compare and evaluate how well their sampling methods reflect the population reality (it also gets them to dig a bit deeper into Excel than just displaying data and calculating descriptive statistics).

13. 1. DIY
get students to collect data in survey – recommend surveygizmo.com free to students
e.g. expenditure/month by: employment (1st year employed); kind (actual, desired); category(food, clothes, entertainment, transport, etc)
2. ATHLETICS
marathon by age, sex, event http://tiktok.biz/list/christchurchmarathon/2013/42r/
3. Rossling’s gapminder data sets: one can search for NZ