I hope you committed to a response in the box before reading this post. This is an important topic. Recently I read an amusing blog regarding poor sampling technique. The tweet that led to the link called it “a humorous look at sample error”. I’m hoping the person who tweeted meant bad sampling, because the problem is, the story was not about sampling error. And that is because sampling error isn’t. Isn’t what? It isn’t error. It doesn’t occur by mistake. It is not caused by bad procedures. There is nothing practical you can do when sampling to avoid sampling error. Sampling error exists because you are taking a sample. The only way to avoid sampling error is to test the entire population – in which case it isn’t a sample, it’s a census. This is a vivid example of when a word in common use is given a different very specific meaning within a discipline that then confuses the heck out of everyone. It has been found that even students who get A grades in first year statistics at university, often have serious flaws and gaps in their understanding of statistics. I would predict that the idea of sampling error will be a cavernous hole of misunderstanding for most. The problem is not sampling error, but bias. Take a perfect random sample, where each object in the population has an equal probability of selection. This will reduce, and perhaps even eliminate bias. But sampling error will remain.
Because of natural variation it is unlikely that all people send the same number of texts in a day.
So how do you teach this? I use the approach of talking about variation*. Variation is inherent in all natural, human and manufacturing processes. We then classify variation into four categories: Natural, Explainable, Sampling and Bias. The term “natural variation” describes the omnipresence of variation in real life. “Explainable variation” is what we are often looking for in statistical analysis – can we use age of a car to help explain some of the variation in prices of cars, for instance. Sampling variation (also known as sampling error) occurs when we take a sample and use it to draw conclusions about the population. We would not expect two samples from the same population to yield exactly the same results. The fourth category is variation due to biased sampling. This approach is not comprehensive, and can be a bit clunky in the terminology, jumping between variation and error. But it gives a framework for students to identify the difference between sampling error/variation and error due to biased sampling. We do classroom activities where students get different samples from the same population to illustrate sampling variation/error. This is important. It is important that people in general understand that samples are not going to represent the population exactly. They also need to understand that through the use of theoretical probability models statisticians and analysts do allow for that sampling error. Bias, however, is another story for another day. You can see how we explain the different kinds of variation in this YouTube video:
By the way – the correct answer to the question at the start of the post is False. No sampling method, no matter how good it is, will eliminate sampling error. Let’s see if you get it – here are some statements about variation. Classify each of the following as examples of natural variation, explainable variation, sampling variation or variation due to biased sampling. I’ll put the answers in the comments to this blog.
When I bike to work, sometimes it takes me longer than other times.
When I bike to work with a head wind, it generally takes me longer than with a tail wind.
Two students each took random samples of ten students from their class and asked them how many friends they have on Facebook. They got different values for their means.
Two students each asked eight of their friends how many friends they have on Facebook. They got different values for their means.
*Note: This approach is based on the thought-provoking work by Wild and Pfannkuch, reported in “Statistical Thinking in Empirical Enquiry” International Statistical Review (1999) p235.