Evaluating False Positives and Sequential Testing in Experimentation

by AB TestApril 1st, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

This appendix examines the effects of sequential deterioration tests on statistical accuracy. Monte Carlo simulations show minimal impact on error rates and power, ensuring reliable decision-making in experiments.

Coin Mentioned

Mention Thumbnail
featured image - Evaluating False Positives and Sequential Testing in Experimentation
AB Test HackerNoon profile picture
0-item

Abstract and 1 Introduction

1.1 Related literature

  1. Types of Metrics and Their Hypothesis and 2.1 Types of metrics

    2.2 Hypotheses for different types of metrics

  2. Type I and Type II Error Rates for Decision Rules including Superiority and Non-Inferiority Tests

    3.1 The composite hypotheses of superiority and non-inferiority tests

    3.2 Bounding the type I and type II error rates for UI and IU testing

    3.3 Bounding the error rates for a decision rule including both success and guardrail metrics

    3.4 Power corrections for non-inferiority testing

  3. Extending the Decision Rule with Deterioration and Quality Metrics

  4. Monte Carlo Simulation Study

    5.1 Results

  5. Discussion and Conclusions


APPENDIX A: IMPROVING THE EFFICIENCY OF PROPOSITION 4.1 WITH ADDITIONAL ASSUMPTIONS

APPENDIX B: EXAMPLES OF GLOBAL FALSE AND TRUE POSITIVE RATES

APPENDIX C: A NOTE ON SEQUENTIAL TESTING OF DETERIORATION

APPENDIX D: USING NYHOLT’S METHOD OF EFFICIENT NUMBER OF INDEPENDENT TESTS


Acknowledgments and References


APPENDIX B: EXAMPLES OF GLOBAL FALSE AND TRUE POSITIVE RATES



APPENDIX C: A NOTE ON SEQUENTIAL TESTING OF DETERIORATION

In the previous sections it is shown that if a fixed horizon test is used, the type I error rate of the non-inferiority and superiority tests are not affected by adding deterioration tests under mild conditions on β and α_. However, deterioration tests are typically added to ensure that the experiment can be aborted if some metrics start to deteriorate. To be able t



maintain the type I error rate for the deterioration tests (α_), this requires using sequential tests for the deterioration tests. Sequential deterioration tests add complexity to the error rate calculations. This follows because even under conditions where the rejection regions for the deterioration and the superiority/non-inferiority tests are not overlapping at the last time point, it is possible that the deterioration test is significant at a previous intermittent analysis even in the case where the superiority or non-inferiority test is significant when the experiment is concluded.



It is difficult to bound these risks without referring to a specific test. E.g., in the Group Sequential Test (GST) literature, stopping if a treatment effect is going in the wrong direction is possible using so called futility bounds. It is well known that the incorporation of such bounds decrease the type I error rate and power for the superiority test. If one commits to always stopping if futility bounds are crossed, so-called "binding futility bounds", one can adjust the bounds for the superiority test to achieve the intended type I error rate. As we will show below, in practice, adding sequential deterioration tests will only mildly affect the rates.


Here, we simply estimate the type I error rate and power of the superiority test and non-inferiority test with deterioration, respectively, by Monte Carlo simulation. One hundred intermittent analyses are performed for the deterioration test using GST[2]. The alternatives and NIM are selected to yield 80% power for the fixed-horizon tests without deterioration. One million replications were performed for each setting.


Table 9 displays the results. For success metrics, the results indicate that using deterioration tests in addition to the superiority test for success metrics imposes negligible conservativeness under the null and the alternative.


Under the non-inferiority null hypothesis, the type I error rate decreases slightly due to deterioration, but to a degree without much practical relevance. The power of the deterioration test is 72%, which is lower than 80% because the deterioration test uses a GST test with a 100 intermittent analyses which decreases the overall power as compared to a fixed horizon test. Under the non-inferiority alternative, the rejection rate of the deterioration test decreases to the expected around 5%, and the power of the non-inferiority is not practically affected by the deterioration test.


It is of course natural to consider sequential tests also for quality tests, such as sample ratio mismatch. However, since these tests tends to separate from the success and guardrail metrics, using sequential tests has no implications. As long as the quality test that are used are valid, i.e., bounds the type I error rates, the propositions apply also with sequential tests for the quality metrics. In the next section we formalize the inclusion of quality metrics in the decision rule.


TABLE 9False and true positive rates for the deterioration tests, the superiority test and non-inferiority test under the null and alternatives of the superiority and non-inferiority test. The "sig. decision" column displays the proportion of replications that had no significant deterioration and rejected the null hypothesis of the superiority or non-inferiority test, respectively.


Authors:

(1) Mårten Schultzberg, Experimentation Platform team, Spotify, Stockholm, Sweden;

(2) Sebastian Ankargren, Experimentation Platform team, Spotify, Stockholm, Sweden;

(3) Mattias Frånberg, Experimentation Platform team, Spotify, Stockholm, Sweden.


This paper is available on arxiv under CC BY 4.0 DEED license.

[2] The simulation code is available in the online supplementary material, and it is easy to to exchange the GST for any other sequential test.

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks