Equal or unequal cell sizes in A/B testing?

Equal or unequal cell sizes in A/B testing?
Tom Haxton
Senior Data Scientist, Chegg
August 30, 2016
At Chegg we often run A/B tests to measure differences in conversion rates when
we change a webpage design. Most often, we split traffic evenly into control and
experimental cells. However, for a variety of reasons we sometimes allocate only
a small fraction of our incoming traffic (e.g. 5%) to the experimental cell. In
these cases, we need to decide which control group to compare to the small
experimental group. In a truly randomized experiment, our results should not
depend on our choice of control group, because sample means are unbiased
estimators of population means. Thus, to reach a desired level of statistical
certainty fastest, we would want to use the entire remaining 95% of traffic as
the control group. However, at Chegg we have found that our test results (e.g.
differences in conversion rates) can vary if we use an imbalanced control group
(95%) vs an equal control group (5%, with a 90% “holdout cell” removed from
analysis).
I looked on the web for any discussion on why A/B test results could depend on
the size of the control cell. I found conflicting advice on whether to use equal
or unequal control cell size and no explanation of why, except for those pointing
out that confidence intervals calculated assuming normal distributions will be
less accurate when cell sizes are smaller. So I constructed a minimal theoretical
model for A/B tests measuring conversion and solved the model to find out if
measured conversion rates depend on cell sizes.
For those interested in the details, read on to the next section. If you just
want the punchline, it turns out that the dependence of test results on cell size
comes from a combination of two effects: (1) unconverted (anonymous) visitors
coming back to a website with an identity that cannot be linked to their first
visit (e.g. on a new device or with cookies turned off) and (2) return rates
and/or second-visit conversion rates varying between control and experiment
experiences. The first effect reminds us that A/B tests on anonymous web
traffic are not truly randomized experiments, because the anonymous visitors
we treat as independent may in fact be the same people.
1

So what do we do? In the model I found that test results will usually be most
accurate when we use equal-size experimental and control cells, so I recommend
using equal-size cells with a holdout cell whenever a 50/50 split is not appro-
priate. However, I found that even in this case results will not in general agree
with what we would measure if we could track visitors perfectly. This is another
reminder that A/B test results on anonymous web traffic must be taken with a
grain of salt.
In the following sections, I will discuss (1) the model and math leading to the
results, (2) trends, and (3) the case of equal cell sizes.
1 Model and math
For simplicity, assume we have only one experimental cell. We want to know
whether the difference in conversion rates that we measure depends on the
size of the control cell and, if so, why. This approach should generalize to
multiple experimental cells and to metrics other than conversion that are led by
conversion.
We have an experimental cell of size f and a control cell of size 1 − f. Assume
that visitors convert on their first visit to the control (experimental) cell with
a probability pc
1 (pe
1). Assume that they do not convert but return with a
probability rc
1 (rc
1). Assume that some fraction d of those return with an identity
that cannot be linked with their initial identity, and assume that there is no
interaction between the likelihood to come back with a new identity and the
other probabilities.
The probability to convert on a second visit can depend on the experience in
both the first and second visits, so there may be four distinct probabilities
to convert on the second visit, pcc
2 , pce
2 , pec
2 , and pee
2 , where the first (second)
superscript index refers to the first (second) visit.
For simplicity, let’s assume that no one returns for a third visit, but these results
could be generalized to multiple return visits.
The number of conversions in the control cell (relative to the total number of
visitors) is
(1 − f)pc
1 + (1 − f)rc
1(1 − d)pcc
2 + (1 − f)rc
1d(1 − f)pcc
2 + fre
1d(1 − f)pec
2 . (1)
The first term in Eq. 1 represents visitors who arrive in the control cell and
convert on the first visit. The second term represents visitors who arrive in
the control cell, do not convert but return, return with a same identity, and
convert on the second visit. The third term represents visitors who arrive in the
control cell, do not convert but return, return with a different identity, arrive in
2

the control cell on their second visit, and convert. The fourth term represents
visitors who arrive in the experimental cell on their first visit, do not convert
but return, return with a different identity, arrive in the control cell in their
second visit, and convert.
Similarly, the number of conversions in the experimental cell (relative to the
total number of visitors) is
fpe
1 + fre
1(1 − d)pee
2 + fre
1dfpee
2 + (1 − f)rc
1dfpce
2 . (2)
The number of unique identities counted in the control cell (relative to the total
number of visitors) is
(1 − f) + (1 − f)rc
1d(1 − f) + fre
1d(1 − f). (3)
The first term in Eq. 3 represents visitors who arrive first in the control cell.
The second term represents visitors who arrive in the control cell, do not convert
but return, return with a different identity, and arrive in the control cell the
second time. The third term represents visitors who arrive in the experimental
cell, do not convert but return, return with a different identity, and arrive in
the control cell the second time.
Similarly, the number of unique identities counted in the experimental cell (rel-
ative to the total number of visitors) is
f + fre
1df + (1 − f)rc
1df. (4)
The apparent conversion rates pc
and pe
are obtained by dividing Eq. 1 by Eq. 3
and Eq. 2 by Eq. 4. We get
pc
=
pc
1 + rc
1(1 − d)pcc
2 + rc
1d(1 − f)pcc
2 + fre
1dpec
2
1 + rc
1d(1 − f) + fre
1d
(5)
and
pe
=
pe
1 + re
1(1 − d)pee
2 + re
1dfpee
2 + (1 − f)rc
1dpce
2
1 + re
1df + (1 − f)rc
1d
(6)
From Eqs. 5 and 6 we see that if we can always identify visitors perfectly (d = 0)
there should be no dependence of apparent conversion rates on allocation size.
In that case
pc
= pc
1 + rc
1pcc
2 (7)
and
pe
= pe
1 + re
1pee
2 . (8)
However, if we lose some identities (d > 0), then the apparent conversion rates
will depend on allocation size unless both the return rates are the same, re
1 = rc
1,
3

and the second-visit conversion rates do not depend on the experience in the first
visit, pcc
2 = pec
2 and pce
2 = pee
2 . If both of these types of rates are different between
cells, the dependence on allocation is complicated (Eq. 5 and 6). Usually, we
would expect that the return rate would be more different between cells than
the dependence of second visit conversion on first visit experience, so to get the
dominant behavior we assume that pcc
2 = pec
2 ≡ pc
2 and pce
2 = pee
2 ≡ pe
2. Then
pc
=
pc
1 + rc
1pc
2 + (re
1 − rc
1) dfpc
2
1 + rc
1d + f (re
1 − rc
1) d
(9)
and
pe
=
pe
1 + re
1pe
2 + (rc
1 − re
1) d(1 − f)pe
2
1 + re
1d + (1 − f) (rc
1 − re
1) d
. (10)
Expanding in d,
pc
= pc
1 + rc
1pc
2 + [(re
1 − rc
1)f(pc
2 − pc
1 − rc
1pc
2) − (pc
1 + rc
1pc
2)rc
1] d + O(d2
) (11)
and
pe
= pe
1 + re
1pe
2 + [(rc
1 − re
1)(1 − f)(pe
2 − pe
1 − re
1pe
2) − (pe
1 + re
1pe
2)re
1] d + O(d2
)
(12)
The apparent conversion rates change with allocation size according to
dpc
df
= d (re
1 − rc
1) (pc
2 − pc
1 − rc
1pc
2) + O(d2
) (13)
dpe
df
= d (re
1 − rc
1) (pe
2 − pe
1 − re
1pe
2) + O(d2
), (14)
so that the change in relative conversion rates is
d(pe
− pc
)
df
= d(re
1 − rc
1) ((pe
2 − pc
2) − (pe
1 − pc
1) − (re
1pe
2 − rc
1pc
2)) + O d2
. (15)
Dropping higher order terms in the return rates (assuming these rate are sub-
stantially less than 1), this simplifies to
d(pe
− pc
)
df
= d(re
1 − rc
1) ((pe
2 − pc
2) − (pe
1 − pc
1)) + O d(r1)2
p2 + O d2
. (16)
2 Trends
Depending on the values on the right side of Eq. 16, this effect could go either
way. In general, we expect that second-visit conversion rates are lower than
first-visit conversion rates, so differences between second-visit conversion rates
will also usually be smaller than differences between first-visit conversion rates,
4

|pe
2 − pc
2| < |pe
1 − pc
1|. This means that to estimate the direction of the effect we
can consider the simpler approximation
d(pe
− pc
)
df
∼ −d(re
1 − rc
1)(pe
1 − pc
1). (17)
Additionally, when return rates and second-visit conversion rates are small, we
expect the sign of pe
− pc
to be the same as the sign of pe
1 − pc
1, so the direction
of the effect is given by
Sign
d(pe
− pc
)
df
= − Sign (re
1 − rc
1) Sign (pe
− pc
) . (18)
This means that when the return rate is larger for unconverted visitors from
the experimental cell, the difference in conversion rates (whichever way it goes)
is increasingly overestimated as the control cell gets bigger (f decreases). Con-
versely, when the return rate is smaller for unconverted visitors from the exper-
imental cell, the difference in conversion rates is increasingly underestimated as
the control cell gets bigger. The effect is not likely to switch the sign of the
difference in conversion rates (which would lead to qualitatively wrong results)
because f, d, and |re
1 − rc
1| in Eq. 17 all must be less than 1.
3 Should we trust same-size cells?
Given that our results depend on cell size, our intuition has been to trust the
results of A/B tests with equal-size cells, since this seems to compare the vari-
ations on more equal footing. But should we fully trust these results? That is,
are the results with equal-size cells the same as what we would find in the ideal
experimental design where we perfectly track all visitors’ identities?
If the cells are the same size (f = 1 − f = 1/2) the difference in apparent
conversion rates turns out to be
pe
− pc
=
pe
1 − pc
1 + re
1(1 − d/2)pe
2 − rc
1(1 − d/2)pc
2 + rc
1dpe
2/2 − re
1dpc
2/2
1 + (re
1 + rc
1)d/2
. (19)
Comparing this to the conversion rate we would measure if we lost no identities
(d = 0),
(pe
− pc
)0 = pe
1 − pc
1 + re
1pe
2 − rc
1pc
2, (20)
we find that the lost identities change the apparent difference in conversion rates
by
pe
− pc
− (pe
− pc
)0 =
rc
1pe
2 − re
1pc
2 + (rc
1 + re
1)(rc
1pc
2 − re
1pe
2)
2/d + (re
1 + rc
1)
. (21)
The right side of Eq. 21 does not equal 0 in general, so even if we use equal-size
cells, the difference in conversion rates that we measure is not the same as what
5

we would measure if we could perfectly keep track of everyone’s identity. To
lowest order in return rates Eq. 21 can be written
pe
− pc
− (pe
− pc
)0 =
d
2
(rc
1pe
2 − re
1pc
2)) + O(r2
1p2
). (22)
This effect is small whenever second-visit conversion rates are small (or whenever
their particular combination with return rates in Eq. 22 is small). In general we
expect that first-visit conversion rates are larger than second-visit conversion
rates, so the discrepancy from having unequal cells will usually be larger than
the discrepancy purely from losing visitors’ identities. As for the discrepancy
from unequal cells, the direction of the latter effect can go either way.
4 Bottom line
Whenever we cannot perfectly track visitors’ identities, we must take A/B tests
with a grain of salt: measured conversion rates will be different from what we
would measure if we could perfectly track identities. Although part of this
effect—and usually the larger part—can be avoided by using same-size cells,
even A/B tests with same-size cells will not in general give accurate results
unless we can perfectly track visitors’ identities.
6

Equal or unequal cell sizes in A/B testing?

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Equal or unequal cell sizes in A/B testing?

Similaire à Equal or unequal cell sizes in A/B testing? (20)

Dernier

Dernier (20)

Equal or unequal cell sizes in A/B testing?