Ensuring Reliable A/B Test Decisions with Guardrail Metrics

Table of Links

Abstract and 1 Introduction

1.1 Related literature

Types of Metrics and Their Hypothesis and 2.1 Types of metrics

2.2 Hypotheses for different types of metrics
Type I and Type II Error Rates for Decision Rules including Superiority and Non-Inferiority Tests

3.1 The composite hypotheses of superiority and non-inferiority tests

3.2 Bounding the type I and type II error rates for UI and IU testing

3.3 Bounding the error rates for a decision rule including both success and guardrail metrics

3.4 Power corrections for non-inferiority testing
Extending the Decision Rule with Deterioration and Quality Metrics
Monte Carlo Simulation Study

5.1 Results
Discussion and Conclusions

APPENDIX A: IMPROVING THE EFFICIENCY OF PROPOSITION 4.1 WITH ADDITIONAL ASSUMPTIONS

APPENDIX B: EXAMPLES OF GLOBAL FALSE AND TRUE POSITIVE RATES

APPENDIX C: A NOTE ON SEQUENTIAL TESTING OF DETERIORATION

APPENDIX D: USING NYHOLT’S METHOD OF EFFICIENT NUMBER OF INDEPENDENT TESTS

Acknowledgments and References

3.2 Bounding the type I and type II error rates for UI and IU testing

Because the at-least-one testing for superiority is based on the UI testing principle, individual tests must be carried out at an adjusted significance level to ensure that the family-wise error rate is bounded by the nominal significance level. On the other hand, no adjustment is necessary to the nominal power level used when designing the experiment. The at-least-one testing principle implies that the composite test has a family-wise type II error rate that is bounded from above by the nominal level. For the allor-none testing used to establish non-inferiority, it is not necessary to multiplicity adjust the individual significance level used in the tests.

Contrary to the UI case, only the type II error rate must be adjusted for tests that rely on the IU principle. We formalize this in Lemma 3.2.

LEMMA 3.2 (Significance and power level adjustments for intersection-union testing.). Consider an intersection union testing problem of M hypotheses on the form

The results of the lemmas can be intuitively understood by considering in what situations multiple chances arise. In the UI testing setting of Lemma 3.1, it suffices that at least one individual null hypothesis is rejected for the overall hypothesis to be rejected. Since each individual hypothesis Hi constitutes one chance of rejecting the overall hypothesis, we must correct for these multiple possibilities. On the other hand, in the IU testing case of Lemma 3.2, all individual null hypotheses must be rejected for the overall hypothesis to be rejected. Equivalently stated, there are multiple chances of not rejecting— and this is what we must correct for. Therefore, to ensure that the power of the IU test is bounded from below by the desired nominal level, the type II error rate β used in each individual test is multiplicity adjusted using a standard Bonferroni correction.

3.3 Bounding the error rates for a decision rule including both success and guardrail metrics

The decision rule we use combines superiority and non-inferiority testing, and appropriate adjustments to individual-level tests follow immediately from the results of the previous section. Before we discuss these adjustments, we first describe the decision rule.

DECISION RULE 1. Ship the change if and only if:

at least one success metric is significantly superior
all guardrail metrics are significantly non-inferior.

That is, ship if and only if the global superiority/noninferiority null hypothesis is rejected in favor of the alternative hypothesis.

We can now establish how to adjust Decision Rule 1 to ensure that family-wise type I or type II error rates are not inflated. The decision rule consists of two levels of testing and can be mathematically stated as:

The testing problem is thus an intersection-union test consisting of G + 1 hypotheses—G hypotheses for the guardrail metrics, and one additional composite hypothesis concerning the at-least-one test for superiority. By applying Lemma 3.1 and 3.2, we can establish the necessary adjustments for the decision rule, formalized in Proposition 3.1.

By Lemmas 3.2 and 3.1, and the significance level used by each test given the conditions of the proposition, we know that the upper bound for both these probabilities is α, which means that

Second, we prove that

Proposition 3.1 assumes the existence of at least one success metric. If there is none, Lemma 3.2 can be used directly to see that the same result holds, but with individual power levels 1 − β/G.

3.4 Power corrections for non-inferiority testing

Power corrections (or, equivalently, "beta corrections") have, to the best of our knowledge, not been discussed in the online experimentation literature. However, the importance of adjusting power in the presence of IU testing is known in the statistical literature (9; 16; 3, ch. 4).

Authors:

(1) Mårten Schultzberg, Experimentation Platform team, Spotify, Stockholm, Sweden;

(2) Sebastian Ankargren, Experimentation Platform team, Spotify, Stockholm, Sweden;

(3) Mattias Frånberg, Experimentation Platform team, Spotify, Stockholm, Sweden.

This paper is available on arxiv under CC BY 4.0 DEED license.

Ensuring Reliable A/B Test Decisions with Guardrail Metrics

Too Long; Didn't Read

Coin Mentioned

Table of Links

3.2 Bounding the type I and type II error rates for UI and IU testing

3.3 Bounding the error rates for a decision rule including both success and guardrail metrics

3.4 Power corrections for non-inferiority testing

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Categories

Trending Topics

Ensuring Reliable A/B Test Decisions with Guardrail Metrics

Too Long; Didn't Read

Coin Mentioned

Table of Links

3.2 Bounding the type I and type II error rates for UI and IU testing

3.3 Bounding the error rates for a decision rule including both success and guardrail metrics

3.4 Power corrections for non-inferiority testing

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Categories

Trending Topics