Table of Links
-
Types of Metrics and Their Hypothesis and 2.1 Types of metrics
-
Type I and Type II Error Rates for Decision Rules including Superiority and Non-Inferiority Tests
3.1 The composite hypotheses of superiority and non-inferiority tests
3.2 Bounding the type I and type II error rates for UI and IU testing
3.3 Bounding the error rates for a decision rule including both success and guardrail metrics
-
Extending the Decision Rule with Deterioration and Quality Metrics
APPENDIX A: IMPROVING THE EFFICIENCY OF PROPOSITION 4.1 WITH ADDITIONAL ASSUMPTIONS
APPENDIX B: EXAMPLES OF GLOBAL FALSE AND TRUE POSITIVE RATES
APPENDIX C: A NOTE ON SEQUENTIAL TESTING OF DETERIORATION
APPENDIX D: USING NYHOLT’S METHOD OF EFFICIENT NUMBER OF INDEPENDENT TESTS
Acknowledgments and References
APPENDIX B: EXAMPLES OF GLOBAL FALSE AND TRUE POSITIVE RATES
APPENDIX C: A NOTE ON SEQUENTIAL TESTING OF DETERIORATION
In the previous sections it is shown that if a fixed horizon test is used, the type I error rate of the non-inferiority and superiority tests are not affected by adding deterioration tests under mild conditions on β and α_. However, deterioration tests are typically added to ensure that the experiment can be aborted if some metrics start to deteriorate. To be able t
maintain the type I error rate for the deterioration tests (α_), this requires using sequential tests for the deterioration tests. Sequential deterioration tests add complexity to the error rate calculations. This follows because even under conditions where the rejection regions for the deterioration and the superiority/non-inferiority tests are not overlapping at the last time point, it is possible that the deterioration test is significant at a previous intermittent analysis even in the case where the superiority or non-inferiority test is significant when the experiment is concluded.
It is difficult to bound these risks without referring to a specific test. E.g., in the Group Sequential Test (GST) literature, stopping if a treatment effect is going in the wrong direction is possible using so called futility bounds. It is well known that the incorporation of such bounds decrease the type I error rate and power for the superiority test. If one commits to always stopping if futility bounds are crossed, so-called "binding futility bounds", one can adjust the bounds for the superiority test to achieve the intended type I error rate. As we will show below, in practice, adding sequential deterioration tests will only mildly affect the rates.
Here, we simply estimate the type I error rate and power of the superiority test and non-inferiority test with deterioration, respectively, by Monte Carlo simulation. One hundred intermittent analyses are performed for the deterioration test using GST[2]. The alternatives and NIM are selected to yield 80% power for the fixed-horizon tests without deterioration. One million replications were performed for each setting.
Table 9 displays the results. For success metrics, the results indicate that using deterioration tests in addition to the superiority test for success metrics imposes negligible conservativeness under the null and the alternative.
Under the non-inferiority null hypothesis, the type I error rate decreases slightly due to deterioration, but to a degree without much practical relevance. The power of the deterioration test is 72%, which is lower than 80% because the deterioration test uses a GST test with a 100 intermittent analyses which decreases the overall power as compared to a fixed horizon test. Under the non-inferiority alternative, the rejection rate of the deterioration test decreases to the expected around 5%, and the power of the non-inferiority is not practically affected by the deterioration test.
It is of course natural to consider sequential tests also for quality tests, such as sample ratio mismatch. However, since these tests tends to separate from the success and guardrail metrics, using sequential tests has no implications. As long as the quality test that are used are valid, i.e., bounds the type I error rates, the propositions apply also with sequential tests for the quality metrics. In the next section we formalize the inclusion of quality metrics in the decision rule.
Authors:
(1) Mårten Schultzberg, Experimentation Platform team, Spotify, Stockholm, Sweden;
(2) Sebastian Ankargren, Experimentation Platform team, Spotify, Stockholm, Sweden;
(3) Mattias Frånberg, Experimentation Platform team, Spotify, Stockholm, Sweden.
This paper is
[2] The simulation code is available in the online supplementary material, and it is easy to to exchange the GST for any other sequential test.