Explaining two-tailed tests

Question

I am looking for various ways of explaining to my students (in an elementary statistics course) what is a two tailed test, and how its P value is calculated.

How do you explain to your students the two- vs one- tailed test?

Max Gordon · Accepted Answer · 2015-07-16T19:39:48.780

This is a great question and I'm looking forward to everyones version of explaining the p-value and the two-tailed v.s. one-tailed test. I've been teaching fellow orthopaedic surgeons statistics and therefore I tried to keep it as basic as possible since most of them haven't done any advanced math for 10-30 years.

My way of explaining calculating p-values & the tails

I start with a explaining that if we believe that we have a fair coin we know it should end up tails 50 % of the flips on average ($=H_0$). Now if you wonder what the probability of getting only 2 tails out of 10 flips with this fair coin you can calculate that probability as I've done in the bar graph. From the graph you can see that the probability of getting 8 out of 10 flips with a fair coin is about about $\approx 4.4\%$.

Since we would question the fairness of the coin if we got 9 or 10 tails we have to include these possibilities, the tail of the test. By adding the values we get that the probability now is a little more than $\approx 5.5\%$ of getting 2 tails or less.

Now if we would get only 2 heads, ie 8 heads (the other tail), we would probably be just as willing to question the fairness of the coin. This means that you end up with a probability of $5.4...\%+5.4...\% \approx 10.9\%$ for a two-tailed test.

Since we in medicine usually are interested in studying failures we need to include the opposite side of the probability even if our intent is to do good and to introduce a beneficial treatment.

My flipping coins graph

Reflections slightly out of topic

This simple example also shows how dependent we are on the null hypothesis to calculate the p-value. I also like to point out the resemblance between the binomial curve and the bell curve. When changing into 200 flips you get a natural way of explaining why the probability of getting exactly 100 flips starts to lack relevance. The defining intervals of interest is a natural transition to probability density/mass function functions and their cumulative counterparts.

In my class I recommend them the Khan academy statistics videos and I also use some of his explanations for certain concepts. They also get to flip coins where we look into the randomness of the coin flipping - the thing that I try to show is that randomness is more random than what we usually believe inspired by this Radiolab episode.

The code

I usually have one graph/slide, the R-code that I used to create the graph:

library(graphics)

binom_plot_function <- function(x_max, my_title = FALSE, my_prob = .5, edges = 0, 
                                col=c("green", "gold", "red")){
  barplot(
    dbinom(0:x_max, x_max, my_prob)*100, 
    col=c(rep(col[1], edges), rep(col[2], x_max-2*edges+1), rep(col[3], edges)),
    #names=0:x_max,
    ylab="Probability %",
    xlab="Number of tails", names.arg=0:x_max)
  if (my_title != FALSE ){
    title(main=my_title)
  }
}

binom_plot_function(10, paste("Flipping coins", 10, "times"), edges=0, col=c("#449944", "gold", "#994444"))
binom_plot_function(10, edges=3, col=c(rgb(200/255, 0, 0), "gold", "gold"))
binom_plot_function(10, edges=3, col=c(rgb(200/255, 0, 0), "gold", rgb(200/255, 100/255, 100/255)))

Great answer Max - and thank you for recognizing the non-triviality of my question :) — Tal Galili, Dec 01 '11 at 12:31
+1 nice answer, very thorough. Forgive me, but I'm going to nitpick on two things. 1) the p-value is understood as the probability of data being as extreme or more extreme as yours under the null, thus your answer is right. However, when using discrete data like your coin flips, this is inappropriately conservative. It's best to use what's called the "mid p-value", i.e. 1/2 the probability of data as extreme as yours + the probability of data being more extreme. An easy discussion of these issues can be found in Agresti (2007) 2.6.3. (cont.) — gung - Reinstate Monica, Dec 02 '11 at 05:48
Thank you @gung for your input. I've actually not heard of the mid-pvalue - it makes sense though. I'm not sure about if it's something I would mention when teaching basic statistics since it may give a feeling of loosing the hands-on feeling that I try to give. Concerning randomness we mean exactly the same - when seeing a truly random number we are fooled to think there's a pattern to it. I think I heard on the Freakonomics podcast folly of prediction that... — Max Gordon, Dec 02 '11 at 16:43
... the human mind has over the years learned that failing to detect a predator is costlier than thinking it's probably nothing. I like that analogy and I try to tell my colleagues that one of the primary reasons for using statistics is to help us with this defect that we're all born with. — Max Gordon, Dec 02 '11 at 16:47

varty · Answer 2 · 2011-12-01T04:47:15.783

Suppose that you want to test the hypothesis that the average height of men is "5 ft 7 inches". You select a random sample of men, measure their heights and calculate the sample mean. Your hypothesis then is:

$H_0: \mu = 5\ \text{ft} \ 7 \ \text{inches}$

$H_A: \mu \ne 5\ \text{ft} \ 7 \ \text{inches}$

In the above situation you do a two-tailed test as you would reject your null if the sample average is either too low or too high.

In this case, the p-value represents the probability of realizing a sample mean that is at least as extreme as the one we actually obtained assuming that the null is in fact true. Thus, if observe the sample mean to be "5 ft 8 inches" then the p-value will represent the probability that we will observe heights greater than "5 ft 8 inches" or heights less than "5 ft 6 inches" provided the null is true.

If on the other hand your alternative was framed like so:

$H_A: \mu > 5\ \text{ft} \ 7 \ \text{inches}$

In the above situation you would a one-tailed test on the right side. The reason is that you would prefer to reject the null in favor of the alternative only if the sample mean is extremely high.

The interpretation of the p-value stays the same with the slight nuance that we are now talking about the probability of realizing a sample mean that is greater than the one we actually obtained. Thus, if observe the sample mean to be "5 ft 8 inches" then the p-value will represent the probability that we will observe heights greater than "5 ft 8 inches" provided the null is true.

Formerly, for your second $H_A$ the null should read $H_0:, \mu\le 5\ \text{ft}\ 7\ \text{inches}$, not $H_0:, \mu = 5\ \text{ft}\ 7\ \text{inches}$. See one of @whuber's comments to this question, Do null and alternative hypotheses have be to exhaustive or not?. — chl, Dec 01 '11 at 10:23
@chl I agree. However, for a person who is just being introduced to statistical ideas, re-writing the null for a one-tailed test may be a distraction when the focus is on how and why things change with respect to interpretation of the p-value. — varty, Dec 01 '11 at 14:55
Fair enough. That's worth mentioning though, even for teaching purpose. — chl, Dec 02 '11 at 13:19

Explaining two-tailed tests

2 Answers2

My way of explaining calculating p-values & the tails

Reflections slightly out of topic

The code

Linked

Related