An important part of statistical analysis is being able to look at graphical representation of data, extract meaning and make comments about it, particularly related to the context. Graph interpretation is a difficult skill to teach as there is no clear algorithm, such as mathematics teachers are used to teaching, and the answers are far from clear-cut.
This post is about the challenges of teaching scatterplot interpretation, with some suggestions.
When undertaking an investigation of bivariate measurement data, a scatterplot is the graph to use. On a scatterplot we can see what shape the data seems to have, what direction the relationship goes in, how close the points are to the line, if there are clear groups and if there are unusual observations.
The problem is that when you know what to look for, spurious effects don’t get in the way, but when you don’t know what to look for, you don’t know what is spurious. This can be likened to a master chess player who can look at a game in play and see at a glance what is happening, whereas the novice sees only the individual pieces, and cannot easily tell where the action is taking place. What is needed is pattern recognition.
In addition, there is considerable room for argument in interpreting scatterplots. What one person sees as a non-linear relationship, another person might see as a line with some unusual observations. My experience is that people tend to try for more complicated models than is sensible. A few unusual observations can affect how we see the graph. There is also a contextual content to the discussion. The nature of the individual observations, and the sample can make a big difference to the meaning drawn from the graph. For example, a scatterplot of the sodium content vs the energy content in food should not really have a strong relationship. However, if the sample of food taken is predominantly fast food, high sodium content is related to high fat content (salt on fries!) and this can appear to be a relationship. In the graph below, is there really a linear relationship, or is it just because of the choice of sample?
In a set of data about fast food, there appears to be a relationship between sodium content and energy.
An effective way to give students practice, with timely feedback, is through on-line materials. Graphs take up a lot of room on paper, so textbooks cannot easily provide the number of examples that are needed to develop fluency. With our on-line materials we provide many examples of graphs, both standard, and not so well-behaved. Students choose from statements about the graphs. Most of the questions provide two graphs, as pattern recognition is easier to develop when looking at comparisons. For example if you give one graph and say “How strong is this relationship?”, it can be difficult to quantify. This is made easier when you ask which of two graphs has a stronger relationship.
Students get immediate feedback in a “low-jeopardy” situation. When a tutor is working one-on-one with a student, it can be worrying to the student if they get wrong answers. The computer is infinitely patient and the student can keep trying over and over until they get their answers correct, thus reinforcing correct understanding.
This system and set of questions is part of our on-line resources for teaching Bivariate investigations, which occurs within the NZ Stats 3 course. You can find out more about our resources at www.statslc.com, and any teachers who wish to explore the materials for free should email me at n.petty(at)statslc.com.
6 Comments
In general I like the framework you summarize here. However, I think your description of “Trend” and “Association” is confusing, maybe misleading, and does not necessarily match the intent of K and P, as far as I can tell from their slides. Or maybe I just don’t understand the distinction that K and P are trying to draw here between “Trend” and “Association”. To me, with 2 continuous-valued variables, “Trend” and “Association” are pretty much synonymous in this context. You can have a positive linear association, a positive non-linear association, no association, a negative linear association, or a negative non-linear association. And in some cases, such as a “U shaped” association, it is difficult to know whether to call it positive or negative; it might be positive over part of the domain and negative (or indeed flat) over a different part.
To say that “association” is about “direction” just doesn’t make sense to me. A direction word, such as positive or negative, can modify “association”, but that doesn’t make “association” and “direction” synonyms. You can also have “no association”. If you want a word for direction, why not use “direction”?
Likewise (as you suggest), if you want to emphasize the distinction between linear and non-linear associations, “shape” makes a lot more sense than “trend”. Yes, you can have a “linear trend” or a “non-linear trend”. But that doesn’t mean you can define “trend” to mean the linearity or non-linearity of something. This is simply not logical.
In my experience, “trend” and “association” part company – that is, cease being interchangeable – when 1 or both of your variables is categorical. In this case, we still can talk about associations (though “direction” of association may or may not be a meaningful concept), but we never use the word “trend” in such contexts.
Don’t get me wrong. I agree completely with the importance of teaching people how to interpret scatterplots, and appreciate your blog entry. And for that matter, it’s also important to teach people when to use scatterplots in the first place. I work with many professional scientists who muddle through their data looking at bar plots of one variable at a time, never even thinking to create some scatterplots. And the framework is great. But using “association” and “trend” in such confusing ways is not going to be helpful.
My modest proposal for a framework: association, shape, direction, strength, groups, unusual observations.
Thank you for that, Scott. I had found the trend/association distinction difficult to get my head around too, and you have put your finger on the problem. So do we need ‘association’ at all, or would a better framework by shape, direction, strength, groups, unusual observations? What is association talking about that isn’t covered in shape and direction?
I do think that “association” is a key concept worthy of an identifier. It gets at the general question of whether or not 2 variables are related – that is, whether or not the value of x tells you anything about the distribution of y (or vice versa) – regardless of the particular form of the potential relationship. It involves, of course, more precise ideas such as statistical independence, conditional distributions, correlation, maybe even causation. But more simply put, you can indeed think about whether or not one variable “tends to” do anything at all as the other one changes. Only once you have an (at least tentative) answer to that question does it make sense to start thinking about shape, direction, and strength of the association.
Nice. I really like that.
[…] Patterns, vocab and practice, practice, practice An important part of statistical analysis is being able to look at graphical representation of data, extract meaning and make comments about it, pa… […]
Could you provider a graphical illustration of the forms of relationships in scatterplots? For example, curvelinear relationship.