Among the many contenders for Dr Nic’s confusing terminology award is the term “Sampling distribution.” One problem is that it is introduced around the same time as population, distribution, sample and the normal distribution. A common confusion is between the standard error and the standard deviation.
Here is how I explain it:
Whenever we take a sample it will contain sampling error, which can also be described as sampling variation. No sample is a perfect representation of the population. When you calculate a sample mean, you do not expect it to be exactly the population mean. But statisticians have discovered that the means of samples behave a certain way, and we can use this information to form our confidence intervals and test hypotheses.
The population is all the objects of interest. It exists, but we don’t know everything about it. We cannot know everything about the population. This is possibly because it is too big, or too tricky to measure, or too expensive to measure, or maybe measuring it will destroy it. For whatever reason, we cannot find out exactly what we wish to.
The distribution of the population is consequently unknown. We do not know the mean, the spread or the shape of the distribution of the population. All of these values exist, but we do not know them. We may or may not know the size of the population.
We take a sample from the population. As much as possible it will be a random sample. You can read my thoughts on the myth of random sampling here. We know all about the sample. We know the mean, the spread and the shape of the distribution of the sample. We know how big the sample is. We do not know exactly how well the sample approximates the population, but we do know that it is going to be similar to the population. We also believe that our sample is the best information we have about the population.
This is where lots of people get unstuck. The sampling distribution of the mean does not exist. It is the distribution of the means we would get if we took infinite numbers of samples of the same size as our sample. We do not know the mean or the spread of this distribution, but we can use information from our sample, and from the Central Limit Theorem to have a fair idea of what the sampling distribution of the mean looks and acts like. The mean of the sampling distribution is best estimated with the sample mean, and is a good estimate of the population mean. The spread of the sampling distribution is related to the spread of the sample, and the size of the sample. We estimate the spread of the sampling distribution to be the standard deviation of the population divided by the square-root of the sample size. But because the standard deviation of the population is unknown, we use the standard deviation of the sample instead.
So when we create confidence intervals of means, we are using the sampling distribution of the mean to say within which interval we would expect our population mean to lie, with specified levels of confidence.
Sampling distribution of the mean
|All the objects of interest||The set of objects drawn from the population||The means we might get if we took lots of samples of the same size|
|Exists, but unknown||Exists and known||Theoretical|
|Population distribution – the variation in the values in the population|
|Sample distribution – the variation in the values in the sample|
|Sampling distribution of the mean (sometimes shortened to sampling distribution) – the variation in the sample means we might draw from the population|
|Population standard deviation (σ) a measure of how spread the population values are|
|Sample standard deviation (s) a measure of how spread the sample values are|
|Standard error ( s/√n) is a measure of how spread out we would expect sample means to be if we had a whole lot of them.|
|Population mean – the thing we are interested in, and do not know.|
|Sample mean – the mean value calculated from the sample values|
|The mean of the sampling distribution is the mean of the sample means, and is theoretically equal to the population mean.|
|We do not know the population mean.||We find just one sample mean.||We use the Central Limit Theorem to estimate how spread out a whole lot of sample means might be.|
In this diagram you can see that the population distribution is bimodal, and far from bell shaped. A sample taken from the population will lead to the sample mean in black. The sampling distribution of the mean is bell-shaped and narrower than the population distribution. This is explained in the following video, understanding the Central Limit theorem. This video uses an imaginary data set to illustrate how the Central Limit Theorem, or the Central Limit effect works. In a real-life analysis we would not have population data, which is why we would take a sample.