Spotify’s Approach to Multi-Metric A/B Testing Decisions

by AB TestMarch 31st, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Spotify’s decision rule framework for A/B testing optimizes statistical accuracy and risk management by integrating superiority, non-inferiority, and quality tests.

Company Mentioned

Mention Thumbnail

Coin Mentioned

Mention Thumbnail
featured image - Spotify’s Approach to Multi-Metric A/B Testing Decisions
AB Test HackerNoon profile picture
0-item

Abstract and 1 Introduction

1.1 Related literature

  1. Types of Metrics and Their Hypothesis and 2.1 Types of metrics

    2.2 Hypotheses for different types of metrics

  2. Type I and Type II Error Rates for Decision Rules including Superiority and Non-Inferiority Tests

    3.1 The composite hypotheses of superiority and non-inferiority tests

    3.2 Bounding the type I and type II error rates for UI and IU testing

    3.3 Bounding the error rates for a decision rule including both success and guardrail metrics

    3.4 Power corrections for non-inferiority testing

  3. Extending the Decision Rule with Deterioration and Quality Metrics

  4. Monte Carlo Simulation Study

    5.1 Results

  5. Discussion and Conclusions


APPENDIX A: IMPROVING THE EFFICIENCY OF PROPOSITION 4.1 WITH ADDITIONAL ASSUMPTIONS

APPENDIX B: EXAMPLES OF GLOBAL FALSE AND TRUE POSITIVE RATES

APPENDIX C: A NOTE ON SEQUENTIAL TESTING OF DETERIORATION

APPENDIX D: USING NYHOLT’S METHOD OF EFFICIENT NUMBER OF INDEPENDENT TESTS


Acknowledgments and References

6. DISCUSSION AND CONCLUSIONS

In this paper, we introduce a decision rule framework that uses the results of multiple frequentist hypothesis tests in an experiment to ultimately make one single decision. The primary area of application of our theory is online controlled experiments, known as A/B tests, that technology companies use to run randomized controlled trials at scale to inform their product decision making. Decision rules help move the focus from the results of individual hypothesis tests to what really matters in the end: the overall conclusion of whether the tested change is good enough to release more widely. By carefully tailoring the decision rules, modern-day experimentation platforms can leverage decision rules to standardize decision


TABLE 3Simulation results of the type I error rates under the null hypotheses of the non-inferiority and the superiority tests, respectively. The rejection of the deterioration, non-inferiority, and superiority tests are presented along with the rejection rate of Decision Rule 2.


TABLE 4Simulation results of the rejection rates under the alternative hypothesis of the non-inferiority and the null hypothesis of superiority tests. The rejection rates of the deterioration, non-inferiority, and superiority tests are presented along with the rejection rate of Decision Rule 2.


TABLE 5Simulation results of the type I error rate under the alternative hypotheses of the non-inferiority and the superiority tests, respectively. The rejection rates of the deterioration, non-inferiority, and superiority tests are presented along with the rejection rate of Decision Rule 2.


making. For example, experimentation platforms can enforce that a new change should only be released widely if there is evidence that the change does not negatively impact important business metrics. Our framework clearly lays out the appropriate adjustments to the experiment design to ensure that the tolerated type I and type II error rates are not exceeded at the experiment level. Experimentation is primarily about understanding and managing risks, and with decision rules we put the risks that are relevant to the business front and center.


Our decision rules use measured outcomes, known as metrics, that we classify into success metrics, guardrail metrics, deterioration metrics and quality metrics. Success metrics are metrics where we want to see an improvement, and guardrail metrics are metrics where we want to make sure no regression happens. We show that for a decision rule that includes both success metrics and guardrail metrics, we only need to adjust the type I error α used in tests for the number of success metrics. This result stands in contrast to the wide use of multiple correction methods, which generally do not take into account the role of the metrics they adjust for. For guardrail metrics, we introduce the concept of β, or type II, corrections. We show that decision rules may be grossly under-powered without correcting the type II error the experiment is designed to achieve for the number of guardrail metrics. Finally, we introduce deterioration tests and quality tests, such as sample ratio mismatch and tests for pre-exposure bias, to the decision rule, and show how these tests affect the type I and type II errors. We combine all of our findings into the decision rule that is used by Spotify’s experimentation platform, which includes tests for superiority, noninferiority, inferiority, and quality. In a Monte Carlo simulation study, we illustrate how the design that our decision rule implies bounds the risks appropriately under various data-generating processes.


There are many ways to construct decision rules incorporating many metrics, and many ways to control the type I and II errors for them. The decision rules and their corresponding α and β corrections that we present in this paper are motivated mainly by simplicity. The straightforward Bonferroni-like corrections are easy to implement, and importantly, easy to explain and understand for most experimenters. In our experience, simplicity is often more important than finding the statistically optimal solution when it comes to experimentation methods that are to be used by many experimenters in large companies.


Two important aspects of more sophisticated multiple-correction methods e.g., Holm and Hommel, is that they tend to complicate sample size calculations and they do not support straightforward calculation of confidence intervals, if at all. The net effect of using these more sophisticated correction methods is, in our experience, negative as the increased complexity leads to poorer planning of experiments and results that are harder to interpret. An appealing compromise for A/B tests is Nyholt’s method that adjusts for the efficient number of independent tests [14]. In Section D in the Appendix, we discuss how Nyholt’s method can be used and similarly to [9] we find that for a small number of metrics (5 success and 5 guardrail) the improvement in efficiency is marginal unless the correlation is strong. However, the improvement becomes substantial when the number of correlated metrics is large. In addition, methods other than sophisticated multiple-testing corrections can help increase efficiency, for example multivariate variance reduction via regression adjustment.


For success metrics it is possible to replace the inferiority tests used in addition to the superiority tests, with non-inferiority tests. Strictly speaking, failing to reject the null of the inferiority test provides no evidence for the null hypothesis, and thus a more formal approach would be to require non-inferiority of all success metrics too. The reason for why inferiority tests are proposed here instead of non-inferiority tests is pragmatic. Using noninferiority tests would require specifying a non-inferiority margin in addition to the hypothetical effect which would complicate the experiment setup for non-technical experimenters. Any deterioration metric could be replaced by a guardrail metric if the higher standard of non-inferiority tests is desired. Again, we find that having a few business important metrics as automatic deterioration metrics is a reasonable compromise: the experimenter does not need to configure anything for these metrics (no non-inferiority margin), but the company will detect major regressions to these metrics. This is a great example of the trade-off between rigour and not adding too much friction to the product development.


In our experience, it is always the case that there is information outside the test that decision makers include in their product decision. However, to benefit from A/B testing as a risk management apparatus, we think it is important to move as much as possible into the experiment and explicitly define the decision rule with all the possible aspects in mind. As this paper shows, including caveats for shipping in terms of quality and deterioration tests affect the error rates of the decision. In practice, experimenters hesitate to include certain metrics as guardrail metrics because the sample size needed to power the experiment is too high. If these excluded metrics are still used in the decision making, the effect of the exclusion is only poor management of risk. Including as much as possible in the experiment and its decision rule means a better understanding of the risks you are facing for the the decision you ultimately need to make.


Authors:

(1) Mårten Schultzberg, Experimentation Platform team, Spotify, Stockholm, Sweden;

(2) Sebastian Ankargren, Experimentation Platform team, Spotify, Stockholm, Sweden;

(3) Mattias Frånberg, Experimentation Platform team, Spotify, Stockholm, Sweden.


This paper is available on arxiv under CC BY 4.0 DEED license.


Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks