How Companies Decide Which Product Changes to Keep or Scrap

by AB TestMarch 30th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Extending A/B test decision rules with deterioration and quality metrics helps catch regressions and ensures experiment integrity. Spotify’s Decision Rule 2 formalizes this approach.

Coin Mentioned

Mention Thumbnail
featured image - How Companies Decide Which Product Changes to Keep or Scrap
AB Test HackerNoon profile picture
0-item

Abstract and 1 Introduction

1.1 Related literature

  1. Types of Metrics and Their Hypothesis and 2.1 Types of metrics

    2.2 Hypotheses for different types of metrics

  2. Type I and Type II Error Rates for Decision Rules including Superiority and Non-Inferiority Tests

    3.1 The composite hypotheses of superiority and non-inferiority tests

    3.2 Bounding the type I and type II error rates for UI and IU testing

    3.3 Bounding the error rates for a decision rule including both success and guardrail metrics

    3.4 Power corrections for non-inferiority testing

  3. Extending the Decision Rule with Deterioration and Quality Metrics

  4. Monte Carlo Simulation Study

    5.1 Results

  5. Discussion and Conclusions


APPENDIX A: IMPROVING THE EFFICIENCY OF PROPOSITION 4.1 WITH ADDITIONAL ASSUMPTIONS

APPENDIX B: EXAMPLES OF GLOBAL FALSE AND TRUE POSITIVE RATES

APPENDIX C: A NOTE ON SEQUENTIAL TESTING OF DETERIORATION

APPENDIX D: USING NYHOLT’S METHOD OF EFFICIENT NUMBER OF INDEPENDENT TESTS


Acknowledgments and References

4. EXTENDING THE DECISION RULE WITH DETERIORATION AND QUALITY METRICS

Deterioration tests are inferiority tests that aim to capture regressions in metrics. They can be applied to metrics already present as a success or a guardrail metric, or to metrics that are only included in this category as deterioration metrics. The test checks if the metric is significantly deteriorating. That is, if the treatment group is significantly inferior to the control group with respect to the metric of interest. Deterioration tests for success and guardrail metrics attempt to identify significant regressions, which would, if they exist, speak against the success of the experiment. Neither the superiority test used for success metrics or the non-inferiority test used for guardrail metrics would on its own indicate a regression. Knowing when regressions occur is essential when running experiments.


In practice, in addition to success, guardrail and detrioriation metrics, also experiment-quality metrics are used. For example, at Spotify, we include a set of tests and metrics that evaluate the quality of the experiment. These include a test for balanced traffic through a sample ratio mismatch test, and a test for pre-exposure bias. By including deterioration and quality tests in the decision rule, the complexity to manage the risks of an incorrect decision increases. Decision Rule 2 formalizes the complete decision rule that is used at Spotify.


DECISION RULE 2. Ship the change if and only if:


  1. at least one success metric is significantly superior


  2. all guardrail metrics are significantly non-inferior


  3. none of the success, guardrail or deterioration metrics are significantly inferior


  4. none of the quality tests are significantly rejecting quality


That is, ship if and only if the global superiority/noninferiority null hypothesis is rejected in favor of the alternative hypothesis, the inferiority null hypothesis is not rejected for any metric, and no quality test is significant.


Proposition 4.1 displays the corresponding correction, to bound the error rates of Decision Rule 2.




This directly implies that



In other words, the effect of the deterioration tests can only make the false positive rate lower than α. This concludes the first part of the proof.




Plugging this into the above yield



To find the corrected β for achieving the intended type II error rate, we solve




The condition that α_ < β should be interpreted as that is only possible to correct the false negative rate (and thus ensuring the intended power) for decision rules that include deterioration and/or quality tests as long as the intended false negative rate for the decision is larger than the intended false positive rate for those tests. This is quite natural, if we add a large enough chance of rejecting no deterioration or quality, even when there is no deteriration or problem with the quality, this will at some point (for some α_) limit our ability to find a positive decision, regardless of the sample size.


For success and guardrail metrics, there is a dependency between the rejection of the deterioration test and the superiority or non-inferiority test for any given metric, respectively. That is, if the superiority or non-inferiority test is rejected, this affects the probability of the deterioration test to be rejected too. It is possible to utilize this dependency to improve the efficiency of Proposition 4.1 slightly by making additional assumptions about the relation between α, α_, and β. See Appendix A for details.


Authors:

(1) Mårten Schultzberg, Experimentation Platform team, Spotify, Stockholm, Sweden;

(2) Sebastian Ankargren, Experimentation Platform team, Spotify, Stockholm, Sweden;

(3) Mattias Frånberg, Experimentation Platform team, Spotify, Stockholm, Sweden.


This paper is available on arxiv under CC BY 4.0 DEED license.


Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks