The rule of 5 | Chewing Data

“We must become more comfortable with probability and uncertainty.” ~ Nate Silver

If you told me that your processing team turned around over a thousand widgets per day and that you just asked five people how long their last widget took to process: 28, 36, 24, 30 and 31 minutes. Now if I say that I predict your median process time is somewhere between 24 and 36 minutes you will probably think that’s a reasonable guess based on the few examples. However, if I then said I was over 93% sure of my prediction, you might think I was being rather over confident. Lets look at the maths behind why I can be so sure with so few data points.

So this is a concept I picked up from Douglas Hubbard’s book, How to Measure Anything. A fine read for anyone for whom measurement of things is important,. Especially in a business environment where very often people think something can’t be measured properly or, conversely, needs to be perfectly precise to be valuable.

So anyway, as explained in the book, the above situation and the precision expressed is actually fairly easy to reach. First of we need to remember that we’re talking about estimating the median of a range. The median, by it’s nature has half the values in the population greater than it and half less.

So what’s the chance of any one data point being above or below the median? Fifty-fifty for each. Just like a coin toss.

Now we’re predicting that the median lies between the smallest and the largest sampled value. That means for us to be wrong, every value would need to be above the true median or every value be below it. We just worked out the chance of that happening once so it’s trivial to work out the chance of it happening five times.

0.5 x 0.5 x 0.5 x 0.5 x 0.5 = 0.03125

So there is a 3.125% chance of getting all five sampled values above the median. There’s another 3.125% chance of them all being lower than the median. These two values combined are the probability that we were wrong in the prediction. So we can invert this figure to establish our confidence.

1 – (0.03125 + 0.03125) = 0.9375

So we have a 93.75% confidence that the sampled values weren’t all greater or all smaller than the median of the whole set of values. Or, put another way, than the median lies between the smallest and largest value of the sample set.

Want over 99% confidence? That’s only 8 data points. Lucky enough to have 20? You can push to 99.9998% certainty that the median will lie between the min and max values. Really though, this goes to show that as a guideline we can reach a good level of understanding with very little effort. That’s not to say the effort to get a better understanding is pointless. It may be very valuable. However, if the Rule of 5 gets you close enough you can save a lot of time and effort.

It’s very easy for us as analysts to want more data – this isn’t a bad thing but we also need to appreciate the value of what we have. A small sample may not give us the precise answer we’re looking for but it can give us a range to work with. Even if this range is fairly large (my example was pretty consistent) it may provide us more information than we had before. This gives us a better position from which to make decisions or direct our efforts.