I have a video channel with about 40 videos about statistics, and I love watching to see which videos are getting the most viewing each day. As the Fall term has recently started in the northern hemisphere, the most popular video over the last month is “Types of Data: Nominal, Ordinal, Interval/Ratio.” Similarly one of the most consistently viewed posts in this blog is one I wrote over a year ago, entitled, “Oh Ordinal Data, what do we do with you?”. Understanding about the different levels of data, and what we do with them, is obviously an important introductory topic in many statistical courses. In this post I’m going to look at why this is, as it may prove useful to learner and teacher alike.
And I’m happy to announce the launch of our new Snack-size course: Types of Data. For $2.50US, anyone can sign up and get access to video, notes, quizzes and activities that will help them, in about an hour, gain a thorough understanding of types of data.
Data is essential to statistical analysis. Without data there is no investigative process. Data can be generated through experiments, through observational studies, or dug out from historic sources. I get quite excited at the thought of the wonderful insights that good statistical analysis can produce, and the stories it can tell. A new database to play with is like Christmas morning!
But all data is not the same. We need to categorise the data to decide what to do with it for analysis, and what graphs are most appropriate. There are many good and not-so-good statistical tools available, thanks to the wonders of computer power, but they need to be driven by someone with some idea of what is sensible or meaningful.
A video that becomes popular later in the semester is entitled, “Choosing the test”. This video gives a procedure for deciding which of seven common statistical tests is most appropriate for a given analysis. It lists three things to think about – the level of data, the number of samples, and the purpose of the analysis. We developed this procedure over several years with introductory quantitative methods students. A more sophisticated approach may be necessary at higher levels, but for a terminal course in statistics, this helped students to put their new learning into a structure. Being able to discern what level of data is involved is pivotal to deciding on the appropriate test.
In many textbooks and courses, the types of data are split into two – categorical and measurement. Most state that nominal and ordinal data are categorical. With categorical data we can only count the responses to a category, rather than collect up values that are measurements or counts themselves. Examples of categorical data are colour of car, ethnicity, choice of vegetable, or type of chocolate.
With Nominal data, we report frequencies or percentages, and display our data with a bar chart, or occasionally a pie chart. We can’t find a mean of nominal data. However if the different responses are coded as numbers for ease of use in a database, it is technically possible to calculate the mean and standard deviation of those numbers. A novice analyst may do so and produce nonsense output.
The very first data most children will deal with is nominal data. They collect counts of objects and draw pictograms or bar charts of them. They ask questions such as “How many children have a cat at home?” or “Do more boys than girls like Lego as their favourite toy?” In each of these cases the data is nominal, probably collected by a survey asking questions like “What pets do you have?” and “What is your favourite toy?”
Another category of data is ordinal, and this is the one that causes the most problems in understanding. My blog discusses this. Ordinal data has order, and numbers assigned to responses are meaningful, in that each level is “more” than the previous level. We are frequently exposed to ordinal data in opinion polls, asking whether we strongly disagree, disagree, agree or strongly agree with something. It would be acceptable to put the responses in the opposite order, but it would have been confusing to list them in alphabetical order: agree, disagree, strongly agree, strongly disagree. What stops ordinal data from being measurement data is that we can’t be sure about how far apart the different levels on the scale are. Sometimes it is obvious that we can’t tell how far apart they are. An example of this might be the scale assigned by a movie reviewer. It is clear that a 4 star movie is better than a 3 star movie, but we can’t say how much better. Other times, when a scale is well defined and the circumstances are right, ordinal data is appropriately, but cautiously treated as interval data.
The most versatile data is measurement data, which can be split into interval or ratio, depending on whether ratios of numbers have meaning. For example temperature is interval data, as it makes no sense to say that 70 degrees is twice as hot as 35 degrees. Weight, on the other hand, is ratio data, as it is true to say that 70 kg is twice as heavy as 35kg.
A more useful way to split up measurement data, for statistical analysis purposes, is into discrete or continuous data. I had always explained that discrete data was counts, and recorded as whole numbers, and that continuous data was measurements, and could take any values within a range. This definition works to a certain degree, but I recently found a better way of looking at it in the textbook published by Wiley, Chance Encounters, by Wild and Seber.
“In analyzing data, the main criterion for deciding whether to treat a variable as discrete or continuous is whether the data on that variable contains a large number of different values that are seldom repeated or a relatively small number of distinct values that keep reappearing. Variables with few repeated values are treated as continuous. Variables with many repeated values are treated as discrete.”
An example of this is the price of apps in the App store. There are only about twenty prices that can be charged – 0.99, 1.99, 2.99 etc. These are neither whole numbers, nor counts, but as you cannot have a price in between the given numbers, and there is only a small number of possibilities, this is best treated as discrete data. Conversely, the number of people attending a rock concert is a count, and you cannot get fractions of people. However, as there is a wide range of possible values, and it is unlikely that you will get exactly the same number of people at more than one concert, this data is actually continuous.
Maybe I need to redo my video now, in light of this!
And please take a look at our new course. If you are an instructor, you might like to recommend it for your students.