## 1. Motivating Example

Jegadeesh and Titman (1993) show that, if you rank stocks according to their returns over the previous months, then the past winners will outperform the past losers by per month over the next months. But, the authors don’t just test this particular strategy. They also test strategies that rank stocks over the previous , , and months and strategies which hold stocks for the next , , and months, too. Clearly, if they test enough hypotheses, then some of these tests are going to appear statistically significant by pure chance. To address this concern in the original paper, the authors use the Bonferroni method.

This post shows how to use an alternative method—namely, controlling the false-discovery rate—to identify statistically significant results when testing multiple hypotheses.

## 2. Bonferroni Method

First, here’s the logic behind the Bonferroni method. Suppose you want to run different hypothesis tests, . Let if the th null hypothesis is true and if the th null hypothesis is false (i.e., should be rejected). Let denote the p-value associated with some test statistic for the th hypothesis. If there were just one test, then we should simply reject the null whenever . But, if there are many tests, then this no longer works. If you look at a lot of hypotheses, then of the p-values should be less than even if for all of them. The Bonferroni method suggests correcting this problem by lowering the p-value associated with the significance level and only rejecting the null hypothesis when

(1)

i.e., if there are hypothesis tests, then only reject the null at the significance level when the p-value is less than rather than .

This is a nice start, but it turns out that the Bonferroni method is way too strict. Imagine drawing samples of observations from different normal distributions. All samples have the same standard deviation, , but not all of the sample have the same mean. have a mean of , and the rest have a mean of . The figure below shows that, if we use the Bonferroni method to identify which of the samples have a non-zero mean, then we’re only going to choose samples. But, by construction, we know that samples had a non-zero mean! We should be rejecting the null -times more often!

## 3. False-Discovery Rate

Now, let’s talk about false-discovery rates. Define as the total number hypotheses that you reject at the significance level. Similarly, define as the number of hypotheses that you reject at the significance level where the null was actually true—i.e., these are false rejections. The false-discovery rate is then

(2)

Let’s return to the numerical example above to clarify this definition. Suppose we had a test that identified all cases where the sample mean was , . If we wanted a false-discovery rate , then this test could produce at most false rejections, . If the test identified only half of the cases where the sample mean was , , then this test could produce at most one false rejection, .

Benjamini and Hochberg (1995) first introduced the idea that you could use the false-discovery rate to adjust statistical-significance tests when exploring multiple hypotheses. Here’s their recipe. First, run all of your tests and order the resulting p-values,

(3)

Then, for a given false-discovery rate, , define as

(4)

Benjamini and Hochberg (1995) showed that, if you reject any null hypothesis where

(5)

then , guaranteed. If we apply the false-discovery-rate procedure to the same numerical example from above using the threshold, then we see that as the number of hypotheses gets large, , the fraction of rejected null hypotheses hovers around . Improvement! It’s no longer shrinking to . Notice that neither method allows us to pick out the full of null hypotheses that should be rejected in the simulation.

## 4. Why’s This So?

It’s pretty clear how the Bonferroni method works. If there are lots of hypotheses and you are worried about rejecting the null by pure chance, then just make it harder to reject the null. i.e., lower the threshold for significant p-values. It’s much less clear, though, how the false-discovery-rate screening process works. All you get is a recipe and a guarantee. If you do the list of things prescribed by Benjamini and Hochberg (1995), then you’ll falsely reject no more than of your null hypotheses. Let’s explore the proof of this result from Storey, Taylor and Siegmund (2003) to better understand where this guarantee comes from.

What does it mean to have a useful test? Well, if (null is true), then is drawn randomly from a uniform distribution, . However, if you have a useful test statistic, then if (null is false), then is drawn from some other distribution, that is more concentrated around . The distribution of p-values is then given by

(6)

If we reject all p-value less than , then with a little bit of algebra we can see that

(7)

where the last step uses the fact that and denotes “big-O” notation. Since p-values are drawn from a uniform distribution when the null hypothesis is true, we know that . Thus, we can simplify even further:

(8)

Now comes the trick. can be anything we want between and . So, let’s choose as one of the ordered p-values, . If we do this, then and

(9)

If we set the right-hand side equal to the false-discovery-rate tolerance, , and solve for , then we get the threshold value for in Benjamini and Hochberg (1995),

(10)

If we only reject hypotheses where , then our false-discovery rate is capped at .