Statistical inference is a powerful tool for drawing conclusions and making predictions about populations based on sample data. It allows us to make informed decisions and understand the effectiveness of different options. One popular application of statistical inference is A/B testing, where we compare two versions or treatments to determine the superior performer. However, what happens when we introduce more versions or treatments to the experiment?
It may seem that the introduction of additional versions in an experiment is an opportunity for even better decisions. Unfortunately, if not handled properly, the increased number of testable hypotheses can lead to misleading results and incorrect decisions. This challenge is known as the multiple comparisons problem.
In this article, I explain the concept of multiple hypothesis testing, its potential pitfall, and give one possible solution supported by a Python simulation.
To understand multiple hypothesis testing, let's begin by examining the fundamental concepts of a simple A/B test involving two variants.
In an A/B test, we start by formulating two competing hypotheses: the null hypothesis, which represents the absence of a difference between the variants, and the alternative hypothesis, suggesting the presence of a difference.
Then we set a significance level denoted as alpha
. This threshold determines the amount of evidence required to reject the null hypothesis. Commonly used significance levels are 0.05 (5%) and 0.01 (1%), indicating the probability of observing the data if the null hypothesis were true.
After running the experiment and collecting the data, we calculate p-values. The p-value represents the probability of obtaining a result as extreme as, or more extreme than, the observed data if the null hypothesis were true. If the p-value is less than the significance level alpha
, we reject the null hypothesis in favor of the alternative hypothesis.
It is important to note that a low p-value suggests strong evidence against the null hypothesis, indicating that the observed data is unlikely to occur by chance alone. However, this does not imply certainty. There remains a non-zero probability of observing a difference between the samples even if the null hypothesis is true.
When we encounter a situation with multiple alternative hypotheses, we refer to it as multiple hypothesis testing. In such cases, the complexity increases as we need to carefully consider the potential impact of conducting multiple tests simultaneously.
The pitfall of multiple hypothesis testing arises when we test multiple hypotheses without adjusting the significance level alpha
. In such cases, we inadvertently inflate the rate of ”Type I” errors, meaning that we tend to reject a null hypothesis (find the difference) while this null hypothesis is in fact true (no difference at all).
The more hypotheses we test simultaneously, the higher the chances of finding a p-value lower than alpha
for at least one hypothesis and erroneously concluding a significant difference.
To illustrate this issue, consider a scenario where we want to test N
hypotheses to determine which of multiple new webpage designs attracts more customers with desired alpha = 0.05
. Let's assume we know that none of the new designs is better than the default one, meaning the null hypothesis holds for all N
cases.
However, for each case, there is a 5% probability (assuming the null hypothesis is true) of committing a “Type I” error, or a false positive. In other words, there is a 95% probability of correctly not detecting a false positive. Theoretically, the probability of having at least one false positive among the N
tests is equal to 1 - (1 - alpha)^N = 1 - 0.95^N
. For instance, when N = 10
, this probability is approximately 40%, significantly higher than the initial 5%.
The problem becomes more apparent as we increase the number of hypotheses tested. Even in scenarios where only a few variants and subsamples are involved, the number of comparisons can quickly accumulate. For example, comparing three designs D1, D2, and D3 for all users, then separately for users in country C1, and again for users in countries other than C1, results in a total of nine comparisons. It is easy to unwittingly engage in multiple comparisons without realizing the subsequent inflation of Type I error rates.
Let's delve into an intriguing example that highlights the consequences of not controlling “Type I” errors when testing multiple hypotheses.
In 2009, a group of researchers conducted an fMRI scan on a dead Atlantic salmon and astonishingly discovered brain activity as if it were alive! To understand how this unexpected result came about, we need to explore the nature of fMRI scans.
fMRI scanners serve as extensive experimental platforms, where numerous tests are conducted on each patient. The scanners monitor changes in blood oxygenation as an indicator of brain activity. Researchers typically focus on specific regions of interest, which leads them to partition the entire body volume into small cubes called voxels. Each voxel represents a hypothesis testing the presence of brain activity within that particular cube. Due to the desire for high-resolution scans, fMRI scanners end up evaluating thousands of hypotheses during a single procedure.
In this case, the researchers intentionally did not appropriately correct the initial significance level alpha = 0.001
, and they identified three voxels with brain activity in the deceased salmon. However, this outcome contradicts our understanding that the salmon is indeed deceased. Upon correcting the significance level, the resurrection-like phenomenon disappeared, illustrating the significance of addressing the issue of Type I errors.
Whenever multiple hypotheses are tested within a single experiment, it becomes crucial to tackle this challenge and control Type I errors. Various statistical techniques, such as the Bonferroni correction, can be employed to mitigate the inflated false positive rate associated with multiple hypothesis testing.
The Bonferroni correction is a statistical procedure specifically designed to address the challenge of multiple comparisons during hypothesis testing.
Suppose you need to test N
hypotheses during an experiment while ensuring the probability of Type I error remains below alpha
.
The underlying idea of the procedure is straightforward: reduce the significance level required to reject null hypothesis for each alternative hypothesis. Remember the formula for the probability of at least one false positive? In that formula, we have alpha
, which can be decreased to diminish the overall probability.
So, to achieve a lower probability of at least one false positive among multiple hypotheses, you can compare each p-value not with alpha
but with something smaller. But what exactly is "something smaller"?
It turns out that using bonferroni_alpha = alpha / N
as the significance level for each individual hypothesis ensures that the overall probability of Type I error remains below alpha
.
For example, if you are testing 10 hypotheses (N = 10
) and the desired significance level is 5% (alpha = 0.05
), you should compare each individual p-value with bonferroni_alpha = alpha / N = 0.05 / 10 = 0.005
By doing so, the probability of erroneously rejecting at least one true null hypothesis will not exceed the desired level of 0.05.
This remarkable technique works due to Boole's inequality, which states that the probability of the union of events is less than or equal to the sum of their individual probabilities. While a formal mathematical proof exists, visual explanations provide an intuitive understanding:
So, when each individual hypothesis is being tested at bonferroni_alpha = alpha / N
a significance level we have a bonferroni_alpha
probability of a false positive. For N
such tests the probability of the union of “false positive” events is less than or equal to the sum of the individual probabilities. Which, in a worst-case scenario when in all N test null hypothesis holds, is equal to N * bonferroni_alpha = N * (alpha / N) = alpha
To further support the concepts discussed earlier, let's conduct a simulation in Python. We will observe the outcomes and assess the effectiveness of the Bonferroni correction.
Consider a scenario where we have 10 alternative hypotheses within a single test. Let's assume that in all 10 cases, the null hypothesis is true. You agree that a significance level of alpha = 0.05
is appropriate for your analysis. However, without any correction for the inflated Type I error, we expect a theoretical probability of approximately 40% of experiencing at least one false positive result. And after applying the Bonferroni correction we expect that this probability does not exceed 5%.
For a specific experiment, we either get at least one False Positive or not. These probabilities can be seen only on a scale of multiple experiments. Then let’s run the simulation of each individual experiment 100 times and calculate the number of experiments with at least one False Positive (p-value below significance level)!
You can find the .ipynb
file with all the code to run this simulation and generate graphs in my repository on GitHub - IgorKhomyanin/blog/bonferroni-and-salmon
import numpy as np
import matplotlib.pyplot as plt
# To replicate the results
np.random.seed(20000606)
# Some hyperparameters too play with
N_COMPARISONS = 10
N_EXPERIMENTS = 100
# Sample p-values
# As we assume that null hypothesis is true,
# the p-value would be distributed uniformly
sample = np.random.uniform(0, 1, size=(N_COMPARISONS, N_EXPERIMENTS))
# Probability of type I error we are ready to accept
# Probabiltiy of rejecting null hypothesis when it is actually true
alpha = 0.05
# Theoretical False Positive Rate
#
# 1.
# Probability that we cocnlude a significant difference for a given comparison
# is equal to alpha by definition in our setting of true null hypothesis
# Then [(1 - alpha)] is the probability of not rejecting the null hypothesis
#
# 2.
# As experiments are considered independent, the probability of not rejecting
# the null hypothesis in [(1 - alpha)]^N
#
# 3.
# Probability that at least one is a false positive is equal to
# 1 - (probability from 2.)
prob_at_least_one_false_positive = 1 - ((1 - alpha) ** N_COMPARISONS)
# Observed False Positive Rate
# We conclude that experiment is a false positive when p-value is less than alpha
false_positives_cnt = np.sum(np.sum(sample <= alpha, axis=0) > 0)
false_positives_share = false_positives_cnt / N_EXPERIMENTS
# Bonferroni correction
bonferroni_alpha = alpha / N_COMPARISONS
bonferroni_false_positive_comparisons_cnt = np.sum(np.sum(sample <= bonferroni_alpha, axis=0) > 0)
bonferroni_false_positive_comparisons_share = bonferroni_false_positive_comparisons_cnt / N_EXPERIMENTS
print(f'Theoretical False Positive Rate Without Correction: {prob_at_least_one_false_positive:0.4f}')
print(f'Observed False Positive Rate Without Correction: {false_positives_share:0.4f} ({false_positives_cnt:0.0f} out of {N_EXPERIMENTS})')
print(f'Observed False Positive Rate With Bonferroni Correction: {bonferroni_false_positive_comparisons_share:0.4f} ({bonferroni_false_positive_comparisons_cnt:0.0f} out of {N_EXPERIMENTS})')
# Output:
# Theoretical False Positive Rate Without Correction: 0.4013
# Observed False Positive Rate Without Correction: 0.4200 (42 out of 100)
# Observed False Positive Rate With Bonferroni Correction: 0.0300 (3 out of 100)
Here is a visualization of the results:
The top picture represents each square as the p-value of an individual comparison (hypothesis test). The darker the square, the higher the p-value. Since we know that the null hypothesis holds in all cases, any significant result would be a false positive.
When the p-value is lower than the significance level (indicated by squares almost white), we reject the null hypothesis and obtain a false positive result. The middle graph represents the experiments without correction, while the bottom graph represents the experiments with the Bonferroni correction. The experiments with at least one false positive are colored in red.
Clearly, the correction worked effectively. Without correction, we observe 42 experiments out of 100 with at least one false positive, which closely aligns with the theoretical ~40% probability. However, with the Bonferroni correction, we only have 3 experiments out of 100 with at least one false positive, staying well below the desired 5% threshold.
Through this simulation, we can visually observe the impact of the Bonferroni correction in mitigating the occurrence of false positives, further validating its usefulness in multiple-hypothesis testing.
In this article, I explained the concept of multiple hypothesis testing and highlighted its potential danger. When testing multiple hypotheses, such as conducting multiple comparisons during an A/B test, the probability of observing a rare event of “False Positive“ increases. This heightened probability can lead to erroneous conclusions about significant effects that may have occurred by chance if not appropriately addressed.
One possible solution to this issue is the Bonferroni correction, which adjusts the significance levels for each individual hypothesis. By leveraging Boole's inequality, this correction helps control the overall significance level at the desired threshold, reducing the risk of false positives.
However, it's important to recognize that every solution comes at a cost. When utilizing the Bonferroni correction, the required significance level decreases, which can impact the power of the experiment. This means that a larger sample size or stronger effects may be necessary to detect the same differences. Researchers must carefully consider this trade-off when deciding to implement the Bonferroni correction or other correction methods.
If you have any questions or comments regarding the content discussed in this article, please do not hesitate to share them. Engaging in further discussion and exploration of statistical techniques is essential for better understanding and application in practice.