9 August 2018
8 October 2018

# Population, Sample, Sampling distribution of the mean

Among the many contenders for Dr Nic’s confusing terminology award is the term “Sampling distribution.” One problem is that it is introduced around the same time as population, distribution, sample and the normal distribution. A common confusion is between the standard error and the standard deviation.

Here is how I explain it:

## Basic idea of inferential statistics

Whenever we take a sample it will contain sampling error, which can also be described as sampling variation. No sample is a perfect representation of the population. When you calculate a sample mean, you do not expect it to be exactly the population mean. But statisticians have discovered that the means of samples behave a certain way, and we can use this information to form our confidence intervals and test hypotheses.

## Population

The population is all the objects of interest. It exists, but we don’t know everything about it. We cannot know everything about the population. This is possibly because it is too big, or too tricky to measure, or too expensive to measure, or maybe measuring it will destroy it. For whatever reason, we cannot find out exactly what we wish to.

The distribution of the population is consequently unknown. We do not know the mean, the spread or the shape of the distribution of the population. All of these values exist, but we do not know them. We may or may not know the size of the population.

## Sample

We take a sample from the population. As much as possible it will be a random sample. You can read my thoughts on the myth of random sampling here. We know all about the sample. We know the mean, the spread and the shape of the distribution of the sample. We know how big the sample is. We do not know exactly how well the sample approximates the population, but we do know that it is going to be similar to the population. We also believe that our sample is the best information we have about the population.

## Sampling distribution of the mean

This is where lots of people get unstuck. The sampling distribution of the mean does not exist. It is the distribution of the means we would get if we took infinite numbers of samples of the same size as our sample. We do not know the mean or the spread of this distribution, but we can use information from our sample, and from the Central Limit Theorem to have a fair idea of what the sampling distribution of the mean looks and acts like. The mean of the sampling distribution is best estimated with the sample mean, and is a good estimate of the population mean. The spread of the sampling distribution is related to the spread of the sample, and the size of the sample. We estimate the spread of the sampling distribution to be the standard deviation of the population divided by the square-root of the sample size. But because the standard deviation of the population is unknown, we use the standard deviation of the sample instead.

So when we create confidence intervals of means, we are using the sampling distribution of the mean to say within which interval we would expect our population mean to lie, with specified levels of confidence.

## Population, Sample, Sampling distribution

### Sampling distribution of the mean

All the objects of interestThe set of objects drawn from the populationThe means we might get if we took lots of samples of the same size
Exists, but unknownExists and knownTheoretical
Population distribution – the variation in the values in the population

Sample distribution – the variation in the values in the sample

Sampling distribution of the mean (sometimes shortened to sampling distribution) – the variation in the sample means we might draw from the population

Population standard deviation (σ) a measure of how spread the population values are

Sample standard deviation (s) a measure of how spread the sample values are

Standard error ( s/√n) is a measure of how spread out we would expect sample means to be if we had a whole lot of them.
Population mean – the thing we are interested in, and do not know.

Sample mean – the mean value calculated from the sample values

The mean of the sampling distribution is the mean of the sample means, and is theoretically equal to the population mean.
We do not know the population mean.We find just one sample mean.We use the Central Limit Theorem to estimate how spread out a whole lot of sample means might be.

In this diagram you can see that the population distribution is bimodal, and far from bell shaped. A sample taken from the population will lead to the sample mean in black. The sampling distribution of the mean is bell-shaped and narrower than the population distribution. This is explained in the following video, understanding the Central Limit theorem. This video uses an imaginary data set to illustrate how the Central Limit Theorem, or the Central Limit effect works. In a real-life analysis we would not have population data, which is why we would take a sample.

##### Dr Nic

1. Rohan says:

Simple way to explain this issue through example is given below:

First define the population we are interested, then tell audience we can’t collect all information from the population due to various reasons (expensive, time…). To manage this situation, sampling is required. Ok now person A collected the sample from the population and similarly person B collected the sample from the same population. Estimates (mean) from persons A and B are different because they have different samples, so estimate has a variation due to sampling. This needs to be measured and it is defined sampling error.

Your detail information is understandable for mathematicians / statisticians but non-statisticians???

• Dr Nic says:

Hi Rohan
Thanks for that. Your explanation is great at the level you say. Many of my videos are aimed at that level.
However many courses teach about the sampling distribution of the mean and it is very confusing, which is what this post is about. Hopefully it will help teachers to explain it better.

• Rohan says:

Thanks Nic.
Yes, I agreed with your comment especially ‘confusing’ (some people explain the simple things into complicated way).

I explained only two sampling situations. Likewise, if we increase number of people to collect sample, we will have number of means, which formed distribution. It is known as sampling distribution of ‘mean’. But, in practice, we often collect only one sample, so what to do?
Fortunately, we have CLT, which allows us to define the sampling distribution of the mean from one sample. Then explain CLT…..

I prefer to explain the statistical term in simple language (like a story) rather than statistical language.