Ensuring Reliable A/B Test Decisions with Guardrail Metrics

by AB TestMarch 30th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

A/B testing decisions must balance Type I and Type II errors. This section explains UI/IU testing principles, Bonferroni corrections, and power adjustments for non-inferiority testing.

Coin Mentioned

Mention Thumbnail
featured image - Ensuring Reliable A/B Test Decisions with Guardrail Metrics
AB Test HackerNoon profile picture
0-item

Abstract and 1 Introduction

1.1 Related literature

  1. Types of Metrics and Their Hypothesis and 2.1 Types of metrics

    2.2 Hypotheses for different types of metrics

  2. Type I and Type II Error Rates for Decision Rules including Superiority and Non-Inferiority Tests

    3.1 The composite hypotheses of superiority and non-inferiority tests

    3.2 Bounding the type I and type II error rates for UI and IU testing

    3.3 Bounding the error rates for a decision rule including both success and guardrail metrics

    3.4 Power corrections for non-inferiority testing

  3. Extending the Decision Rule with Deterioration and Quality Metrics

  4. Monte Carlo Simulation Study

    5.1 Results

  5. Discussion and Conclusions


APPENDIX A: IMPROVING THE EFFICIENCY OF PROPOSITION 4.1 WITH ADDITIONAL ASSUMPTIONS

APPENDIX B: EXAMPLES OF GLOBAL FALSE AND TRUE POSITIVE RATES

APPENDIX C: A NOTE ON SEQUENTIAL TESTING OF DETERIORATION

APPENDIX D: USING NYHOLT’S METHOD OF EFFICIENT NUMBER OF INDEPENDENT TESTS


Acknowledgments and References

3.2 Bounding the type I and type II error rates for UI and IU testing

Because the at-least-one testing for superiority is based on the UI testing principle, individual tests must be carried out at an adjusted significance level to ensure that the family-wise error rate is bounded by the nominal significance level. On the other hand, no adjustment is necessary to the nominal power level used when designing the experiment. The at-least-one testing principle implies that the composite test has a family-wise type II error rate that is bounded from above by the nominal level. For the allor-none testing used to establish non-inferiority, it is not necessary to multiplicity adjust the individual significance level used in the tests.





Contrary to the UI case, only the type II error rate must be adjusted for tests that rely on the IU principle. We formalize this in Lemma 3.2.


LEMMA 3.2 (Significance and power level adjustments for intersection-union testing.). Consider an intersection union testing problem of M hypotheses on the form




The results of the lemmas can be intuitively understood by considering in what situations multiple chances arise. In the UI testing setting of Lemma 3.1, it suffices that at least one individual null hypothesis is rejected for the overall hypothesis to be rejected. Since each individual hypothesis Hi constitutes one chance of rejecting the overall hypothesis, we must correct for these multiple possibilities. On the other hand, in the IU testing case of Lemma 3.2, all individual null hypotheses must be rejected for the overall hypothesis to be rejected. Equivalently stated, there are multiple chances of not rejecting— and this is what we must correct for. Therefore, to ensure that the power of the IU test is bounded from below by the desired nominal level, the type II error rate β used in each individual test is multiplicity adjusted using a standard Bonferroni correction.

3.3 Bounding the error rates for a decision rule including both success and guardrail metrics

The decision rule we use combines superiority and non-inferiority testing, and appropriate adjustments to individual-level tests follow immediately from the results of the previous section. Before we discuss these adjustments, we first describe the decision rule.


DECISION RULE 1. Ship the change if and only if:


  1. at least one success metric is significantly superior


  2. all guardrail metrics are significantly non-inferior.


That is, ship if and only if the global superiority/noninferiority null hypothesis is rejected in favor of the alternative hypothesis.


We can now establish how to adjust Decision Rule 1 to ensure that family-wise type I or type II error rates are not inflated. The decision rule consists of two levels of testing and can be mathematically stated as:





The testing problem is thus an intersection-union test consisting of G + 1 hypotheses—G hypotheses for the guardrail metrics, and one additional composite hypothesis concerning the at-least-one test for superiority. By applying Lemma 3.1 and 3.2, we can establish the necessary adjustments for the decision rule, formalized in Proposition 3.1.











By Lemmas 3.2 and 3.1, and the significance level used by each test given the conditions of the proposition, we know that the upper bound for both these probabilities is α, which means that





Second, we prove that





Proposition 3.1 assumes the existence of at least one success metric. If there is none, Lemma 3.2 can be used directly to see that the same result holds, but with individual power levels 1 − β/G.

3.4 Power corrections for non-inferiority testing

Power corrections (or, equivalently, "beta corrections") have, to the best of our knowledge, not been discussed in the online experimentation literature. However, the importance of adjusting power in the presence of IU testing is known in the statistical literature (9; 16; 3, ch. 4).



FIG 1. Power for simultaneous rejection of all non-inferiority null hypotheses when the number of guardrail metrics increase and the power level is not adjusted appropriately. All metrics are independent, and individually powered to 80 percent.





Authors:

(1) Mårten Schultzberg, Experimentation Platform team, Spotify, Stockholm, Sweden;

(2) Sebastian Ankargren, Experimentation Platform team, Spotify, Stockholm, Sweden;

(3) Mattias Frånberg, Experimentation Platform team, Spotify, Stockholm, Sweden.


This paper is available on arxiv under CC BY 4.0 DEED license.


Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks