What You Don't Know About AB Testing
The content of this article was originally written by Tom Hoddes on the blog, AvidLifeMedia
It’s true, what you don’t know about AB testing … could kill you! Seriously, if you’ve ever run multiple AB tests at the same time, adjusted proportions during a running test or staged the rollout of a new feature, you may have run into any number of complications without even knowing it. Read on to learn how to safely run complex, dynamic tests.
If you haven’t started AB testing by now, or if you’re still doing it the old fashioned way, you may be missing out on a great deal of potential for increasing signups, conversion, etc. on your site/app.
For those of you who haven’t heard of AB tests, we recommend that you do a bit of reading before diving into this guide. In the meantime, here’s a quick and dirty description of the classic AB test:
You have a new feature and you want to know how it affects your users’ behaviour. You pick a metric that’s important to you (e.g. conversion). You roll out your feature to a subset of users with some fixed percentage (e.g. 50%). Once you have a statistically significant result, you either roll out the feature to all users or scrap it.
So you’ve been AB testing things like crazy with good results. Can this process be improved upon? You bet.
Many people have been going further than classic AB testing and some newer methods have been proven to work better. The tradeoff is increased complexity, and if you’re not careful, you could be making mistakes without realizing.
In this series of posts, we’ll go over some techniques to improve on AB testing as well as a number of easily overlooked problems that may arise as a result of them.
We’ll start with an introduction to dynamic testing. Dynamic testing means constantly changing the proportions of your test groups. There are several reasons for doing this:
Optimal Experimental Design and the Multi-Armed Bandit Problem
The classic AB test runs until you are extremely certain that one option is better than another. The standard for “extremely certain” is usually a Z-Test score of 95% or higher. Until this certainty is reached the proportions remain constant. Once the certainty is reached the losing option is thrown out completely.
Optimal experiment design is an alternate approach to make the best decision you can with the information you have. The idea is that you are constantly changing the proportion of option B based on how certain you are that it is better/worse than option A.
We won’t get into the details of this
since the folks at Untyped have already written a great explanation of
what optimal design is and why it is better than AB testing:
Stop AB Testing And Make Out Like A Bandit
This is another form of dynamic testing. Rolling out a new feature can be risky for a number of reasons. Perhaps your hardware won’t be able to handle some extra load caused by the feature. Perhaps an overly negative response is a concern, so you don’t want to risk testing on a high percentage of users. Whatever the risk, the answer is a staged rollout. Start the feature at a small percentage like 5% and if nothing blows up, slowly ramp it up to 10%, 15% and so on.
So these two testing techniques are both very useful and involve changing proportions over time. What’s the problem? Well let’s say something else on your site is changing your conversion rate while you’re changing the proportions of the test. Combine that with a naive test evaluation and BLAMO! You just skewed your results. How? This is best illustrated with an example:
Lets say we’re running an AB test at 95%/5% respectively. We’ll call this Period 1. Nothing blows up and we get no angry emails, so we bump it up to 15% and let it run a little longer. We’ll call this Period 2. Then we evaluate by calculating conversions of test groups A and B.
We’ll cook up some fake numbers to illustrate the problem.
Assume that B is always 10% better conversion than A.
Assume that in Period 2 the overall conversion drops heavily by 35% due to some independent change on the site, like another new feature or an influx of bad traffic, etc.
So it would look something like this:
|A (Converted / Total)||B (Converted / Total)|
|P1||950 / 9500 (10%)||55 / 500 (11%)|
|P2||553 / 8500 (6.5%)||107 / 1500 (7.13%)|
|Total||1503 / 18000 (8.35%)||162 / 2000 (8.1%)|
The bottom row is the naive evaluation mentioned before. We take:
(Converted of P1 + Converted of P2) / (Total P1 + Total P2) for both A and B.
And there you have it, skewed results. If you look at the naive evaluation it shows that A has a higher conversion than B overall, but if we look at the periods individually it’s clear that B is always better than A.
Before diving into a solution, let’s give a little insight on how we skewed the results. The problem here is that we gave different proportions of the test groups from the “good time” (Period 1) and the “bad time” (Period 2). This is illustrated below.
From these graphs, it is easy to see that Group A has a much higher proportion of “good time” users than Group B. This resulted in Group A appearing better than it really is when compared to Group B.
So a naive evaluation fails here. What would a non-naive evaluation look like?
We can take the conversion of Group A in Period 1 but assume that Group A was the entire user population of Period 1. Do the same for Period 2 and sum them together.
This equates to:E′(C|A)≜E(C|A,P1)⋅P(P1)+E(C|A,P2)⋅P(P2)E(C|A,P1)=“Expected Value of Conversion of users in Group A and Period 1″P(P1)=“Probability of user being in Period 1″E′(C|A)≜“The Normalized Expected Conversion of users in Group A”
Which results in:E′(C|A)=0.10⋅0.5+0.065⋅0.5E′(C|A)=0.0825=8.25%
Using the same equation for B gives:E′(C|B)=0.0907=9.07%
Success! This gives us the answer that B is better than A. More importantly, it shows that B is exactly 10% better than A!
In conclusion, we’ve presented a potential problem with dynamic testing and a solution without much explanation or intuition regarding why it works. We also haven’t mentioned statistical validity yet. In fact, we assumed no statistical noise in this example. In the next few posts, we will show how statistical validity plays a part in all this. We will also go more in-depth into dynamic testing, user segmentation and try to give some insight as to where this equation comes from and how it can be generalized.
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)