Evaluating False Positives and Sequential Testing in Experimentation

Table of Links

Abstract and 1 Introduction

1.1 Related literature

Types of Metrics and Their Hypothesis and 2.1 Types of metrics

2.2 Hypotheses for different types of metrics
Type I and Type II Error Rates for Decision Rules including Superiority and Non-Inferiority Tests

3.1 The composite hypotheses of superiority and non-inferiority tests

3.2 Bounding the type I and type II error rates for UI and IU testing

3.3 Bounding the error rates for a decision rule including both success and guardrail metrics

3.4 Power corrections for non-inferiority testing
Extending the Decision Rule with Deterioration and Quality Metrics
Monte Carlo Simulation Study

5.1 Results
Discussion and Conclusions

APPENDIX A: IMPROVING THE EFFICIENCY OF PROPOSITION 4.1 WITH ADDITIONAL ASSUMPTIONS

APPENDIX B: EXAMPLES OF GLOBAL FALSE AND TRUE POSITIVE RATES

APPENDIX C: A NOTE ON SEQUENTIAL TESTING OF DETERIORATION

APPENDIX D: USING NYHOLT’S METHOD OF EFFICIENT NUMBER OF INDEPENDENT TESTS

Acknowledgments and References

APPENDIX B: EXAMPLES OF GLOBAL FALSE AND TRUE POSITIVE RATES

APPENDIX C: A NOTE ON SEQUENTIAL TESTING OF DETERIORATION

In the previous sections it is shown that if a fixed horizon test is used, the type I error rate of the non-inferiority and superiority tests are not affected by adding deterioration tests under mild conditions on β and α_. However, deterioration tests are typically added to ensure that the experiment can be aborted if some metrics start to deteriorate. To be able t

maintain the type I error rate for the deterioration tests (α_), this requires using sequential tests for the deterioration tests. Sequential deterioration tests add complexity to the error rate calculations. This follows because even under conditions where the rejection regions for the deterioration and the superiority/non-inferiority tests are not overlapping at the last time point, it is possible that the deterioration test is significant at a previous intermittent analysis even in the case where the superiority or non-inferiority test is significant when the experiment is concluded.

It is difficult to bound these risks without referring to a specific test. E.g., in the Group Sequential Test (GST) literature, stopping if a treatment effect is going in the wrong direction is possible using so called futility bounds. It is well known that the incorporation of such bounds decrease the type I error rate and power for the superiority test. If one commits to always stopping if futility bounds are crossed, so-called "binding futility bounds", one can adjust the bounds for the superiority test to achieve the intended type I error rate. As we will show below, in practice, adding sequential deterioration tests will only mildly affect the rates.

Here, we simply estimate the type I error rate and power of the superiority test and non-inferiority test with deterioration, respectively, by Monte Carlo simulation. One hundred intermittent analyses are performed for the deterioration test using GST[2]. The alternatives and NIM are selected to yield 80% power for the fixed-horizon tests without deterioration. One million replications were performed for each setting.

Table 9 displays the results. For success metrics, the results indicate that using deterioration tests in addition to the superiority test for success metrics imposes negligible conservativeness under the null and the alternative.

Under the non-inferiority null hypothesis, the type I error rate decreases slightly due to deterioration, but to a degree without much practical relevance. The power of the deterioration test is 72%, which is lower than 80% because the deterioration test uses a GST test with a 100 intermittent analyses which decreases the overall power as compared to a fixed horizon test. Under the non-inferiority alternative, the rejection rate of the deterioration test decreases to the expected around 5%, and the power of the non-inferiority is not practically affected by the deterioration test.

It is of course natural to consider sequential tests also for quality tests, such as sample ratio mismatch. However, since these tests tends to separate from the success and guardrail metrics, using sequential tests has no implications. As long as the quality test that are used are valid, i.e., bounds the type I error rates, the propositions apply also with sequential tests for the quality metrics. In the next section we formalize the inclusion of quality metrics in the decision rule.

Authors:

(1) Mårten Schultzberg, Experimentation Platform team, Spotify, Stockholm, Sweden;

(2) Sebastian Ankargren, Experimentation Platform team, Spotify, Stockholm, Sweden;

(3) Mattias Frånberg, Experimentation Platform team, Spotify, Stockholm, Sweden.

This paper is available on arxiv under CC BY 4.0 DEED license.

[2] The simulation code is available in the online supplementary material, and it is easy to to exchange the GST for any other sequential test.

Evaluating False Positives and Sequential Testing in Experimentation

Too Long; Didn't Read

Coin Mentioned

Table of Links

APPENDIX B: EXAMPLES OF GLOBAL FALSE AND TRUE POSITIVE RATES

APPENDIX C: A NOTE ON SEQUENTIAL TESTING OF DETERIORATION

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Categories

Trending Topics

Evaluating False Positives and Sequential Testing in Experimentation

Too Long; Didn't Read

Coin Mentioned

Table of Links

APPENDIX B: EXAMPLES OF GLOBAL FALSE AND TRUE POSITIVE RATES

APPENDIX C: A NOTE ON SEQUENTIAL TESTING OF DETERIORATION

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Categories

Trending Topics