23 July 2014
4 September 2014

# Why do we teach about random variables, and why is it so difficult to understand?

Probability and statistics go together pretty well and basic probability is included in most introductory statistics courses. Often maths teachers prefer the probability section as it is more mathematical than inference or exploratory data analysis. Both probability and statistics deal with the idea of uncertainty and chance, statistics mostly being about what has happened, and probability about what might happen. Probability can be, and often is, reduced to fun little algebraic puzzles, with little link to reality. But a sound understanding of the concept of probability and distribution, is essential to H.G. Wells’s “efficient citizen”.
When I first started on our series of probability videos, I wrote about the worth of probability. Now we are going a step further into the probability topic abyss, with random variables. For an introductory statistics course, it is an interesting question of whether to include random variables. Is it necessary for the future marketing managers of the world, the medical practitioners, the speech therapists, the primary school teachers, the lawyers to understand what a random variable is? Actually, I think it is. Maybe it is not as important as understanding concepts like risk and sampling error, but random variables are still important.

## Random variables

Like many concepts in our area, once you get what a random variable is, it can be hard to explain. Now that I understand what a random variable is, it is difficult to remember what was difficult to understand about it. But I do remember feeling perplexed, trying to work out what exactly a random variable was. The lecturers use the term freely, but I remember (many decades ago) just not being able to pin down what a random variable is. And why it needed to exist.
To start with, the words “random variable” are difficult on their own. I have dedicated an entire post to the problems with “random”, and in the writing of it, discovered another inconsistency in the way that we use the word. When we are talking about a random sample, random implies equal likelihood. Yet when we talk about things happening randomly, they are not always equally likely. The word “variable” is also a problem. Surely all variables vary? Students may wonder what a non-random variable is – I know I did.
I like to introduce the idea of variables, as part of mathematical modelling. We can have a simple model:

Cost of event = hall hire + per capita charge x number of guests.

In this model, the hall hire and per capita charge are both constants, and the number of guests is a variable. The cost of the event is also a variable, and can be expressed as a function of the number of guests. And vice versa! Now if we know the number of guests, we can then calculate the cost of the event. But the number of guests may be uncertain – it could be something between 100 and 120. It is thus a random variable.
Another way to look at a random variable is to come from the other direction – start with the random part and add the variable part. When something random happens, sometimes the outcome is discrete and non-numerical, such as the sex of a baby, the colour of a tulip, or the type of fruit in a lunchbox. But when the random outcome is given a value, then it becomes a random variable.

## Distributions

Then we come to distributions. I fear that too often distributions are taught in such a way that students believe that the normal or bell curve is a property guiding the universe, rather than a useful model that works in many different circumstances. (Rather like Adam Smith’s invisible hand that economists worship.) I’m pretty sure that is what I believed for many years, in my fog of disconnected statistical concepts. Somewhat telling, is the tendency for examples to begin with the words, “The life expectancy of a particular brand of lightbulb is normally distributed with a mean of …” or similar. Worse still, they don’t even mention the normal distribution, and simply say “The mean income per household in a certain state is \$9500 with a standard deviation of \$1750. The middle 95% of incomes are between what two values?” Students are left to assume that the normal distribution will apply, which in the second case is only a very poor approximation as incomes are likely to be skewed. This sloppy question-writing perpetuates the idea of the normal distribution as the rule that guides the universe.
Take a look at the textbook you use, and see what language it uses when asking questions about the normal distribution. The two examples above are from a popular AP statistics test preparation text.
I thought I’d better take a look at what Khan Academy did to random variables. I started watching the first video and immediately got hit with the flipping coin and rolling dice. No, people – this is not the way to introduce random variables! No one cares how many coins are heads. And even worse he starts with a zero/one random variable because we are only flipping one coin. And THEN he says that he could define a head as 100 and tail as 703 and…. Sorry, I can’t take it anymore.

## A good way to introduce random variables

After LOTS of thinking and explaining, and trying stuff out, I have come up with what I think is a revolutionary and fabulous way to introduce random variables and distributions. To begin with we use a discrete empirical distribution to illustrate the idea of a random variable. The random variable models the number of ice creams per customer.

Then we use that discrete distribution to teach about expected value and standard deviation, and combining random variables.The third video introduces the idea of families of distributions, and shows how different distributions can be used to model the same random process.
Another unusual feature, is the introduction of the triangular distribution, which is part of the New Zealand curriculum. You can read here about the benefits of teaching the triangular distribution. ##### Dr Nic

1. Anna says:

Hi Dr Nic. I like the way your videos combine formulas with explanation and visual cues. Just wondering though – why are categorical random variables ignored (the ethnicity of the next customer, for example)?

• Dr Nic says:

Hi Anna
Thanks. To be honest I’ve always taught that something had to take numeric values to be classed as a “random variable”. At introductory level, it works. You’ve got me wondering now.

• David Butler says:

It is indeed something to wonder about, even just considering terminology. Most stats courses have a section about variables in data, classifying them as quantitative/numerical or qualitative/categorical. When we study probability we have questions that ask about the probabilities of certain events, that often are described in words with no reference to numbers. However, when we come to RANDOM variables, they suddenly are only allowed to be numbers. This is highly confusing!
I think the reason is that when doing statistical modeling you need to have equations and expected values, which don’t work with things that aren’t numbers. If you think carefully about how statistical modelling is done, the non-number variable is always converted into one or several variables that are 0 or 1, so that the mathematical formulas work. Another point worth making is that the only summary statistic that makes sense for a word-variable is the mode.

2. Cheryl says:

Where exactly do I find the videos?! I seem to be overlooking them…

• Dr Nic says:

Hi Cheryl
Thanks for drawing my attention to this. Here is a link to the first one. https://www.youtube.com/watch?v=lHCpYeFvTs0 The other two videos have been made private in order to make enough money to keep our business going. However, if you email me at n.petty@statsLC.com, I’ll happily give you access.

3. Darcey Puckett says:

Great Video!, would love the others

4. Rolf Arnesen says:

Why is the number of ice cream cones the next person buys a random variable? Isn’t that number based on a lot of factors and not simply based on chance? I do not see the randomness of this.

• Dr Nic says:

Hi
Randomness means that it can take a number of different values, and we don’t know ahead of time which one it will take. There is very little that is simply based on chance. Some would say that nothing is based on chance, but rather that we do not know the contributing factors. For example, the air temperature can be modelled as a random variable, even though there is a definite cause for it. Remember that probability is a model to help explain reality. I hope this explanation helps, as it is a very good question, and one that continues to vex philosophers, mathematicians and statisticians.
Nic

5. Alan Mainwaring says:

Not sure I agree that a discreet random variable has to be associated with a uniform distribution to be called a random process.? The axiomatic definition given in many advanced text books defines a random variable as a REAL valued function defined on a sample space. It is quite possible that each simple discreet event could have different probabilities and yet the process associated with repetitions of the experiment could still be a random experiment . Note the assignment of the real numbers to the actual events is completely arbitrary,When we associate Money to each outcome we could choose anything we like. No wonder its confusing
I still find the concept of a random variable a very difficult idea mathematically in equations they can even look like conventional algebraic variables

• Dr Nic says:

Hi Alan
I don’t think I said that a discrete r.v. has to be associated with a uniform distribution. Far from it – a random variable can be modelled with a binomial or poisson distribution, or an experimental distribution.

• Alan Mainwaring says:

Yes Dr Nic, I can see what I have done I pulled something out of context form a previous post, Quote “When we are talking about a random sample, random implies equal likelihood”. That was right at the start of this discussion about how confusing the term “Random variable” after re reading it you were in fact challenging this view. Sorry about that. But I have seen this statement even in undergraduate text books and it is not correct. As you say probability distributions can be any shape you like. Anyway this is a good discussion and an important one.
I like the story about one of the greatest mathematician in the 20th century Paul Erdos who completely got the interpretation of the Monty Hall game completely wrong this one is about the Goat and the car. Its a classic case that anything to do with probability and sample spaces is really difficult, it even confused Paul Erdos