This is part of Alpine ML Talk Series:
The talk is called “Frequentist inference only seems easy” and is about the theory of simple statistical inference (based on material from this article http://www.win-vector.com/blog/2014/07/frequenstist-inference-only-seems-easy/ ). The talk includes some simple dice games (I bring dice!) that really break the rote methods commonly taught as statistics. This is actually a good thing, as it gives you time and permission to work out how common statistical methods are properly derived from basic principles. This takes a little math (which I develop in the talk), but it changes some statistics from "do this" to "here is why you calculate like this.” It should appeal to people interested in the statistical and machine learning parts of data science.
Frequentist inference only seems easy By John Mount
1. Frequentist estimation only
seems easy
John Mount
Win-Vector LLC
1
Outline
First example problem: estimating the success rate of coin flips.
Second example problem: estimating the success rate of a dice
game.
Interspersed in both: an entomologist’s view of lots of heavy
calculation.
Image from “HOW TO PIN AND LABEL ADULT INSECTS”
Bambara, Blinn, http://www.ces.ncsu.edu/depts/ent/notes/
4H/insect_pinning4a.html
2 This talk is going to alternate between simple probability games
(like rolling dice) and the detailed calculations needed to bring
the reasoning forward. If you come away with two points from
this talk remember: classic frequentist statistics is not as cut and
dried as teacher claim (so it is okay to ask questions), and
Bayesian statistics is not nearly as complicated as people make it
appear.
The point of this talk
Statistics is a polished field where many of the foundations are no
longer discussed.
A lot of the “math anxiety” felt in learning statistics is from uncertainty
about these foundations, and how they actually lead to common
practices.
We are going to discuss common simple statistical goals (correct
models, unbiasedness, low error) and how the lead to common simple
statistical procedures.
The surprises (at least for me) are:
There is more than one way to do things.
The calculations needed to justify how even simple procedures
are derived from the goals are in fact pretty involved.
3 A lot of the pain of learning is being told there is only “one
way” (when there is more than one) and that a hard step (linking
goals to procedures) is easy (when in fact is is hard). Statistics
would be easier to teach if those two things were true, but they
are not. However, not addressing these issues makes learning
statistics harder than it has to be. We are going to spend some
time on what are appropriate statistical goals, and how they lead
to common statistical procedures (instead of claiming everything
is obvious). You won’t be expected do invent the math, but you
need to accept that it is in fact hard to justify common statistical
procedures without somebody having already done the math.
And I’ll be honest I am a math for math’s sake guy.
2. What you will get from this
presentation
Simple puzzles that present problems for the common rules of estimating rates.
Good for countering somebody who says “everything is easy and you just
don’t get it.”
Examples that expose strong consequences of the seemingly subtle differences
in common statistical estimation methods.
Makes understanding seemingly esoteric distinctions like Bayesianism and
frequentism much easier.
A taste of some of the really neat math used to establish common statistics.
A revival of Wald game-theoretic style inference (as described in Savage “The
Foundations of Statistics”).
4 You will get to roll the die, and we won’t make you do the heavy
math. Aside: we have been telling people that one of the things
that makes data science easy is large data sets allow you to avoid
some of the hard math in small sample size problems. Here we
work through some of the math. In practice you do get small
sample size issues even in large data sets due to heavy-tail like
phenomena and when you introducing conditioning and
segmentation (themselves typical modeling steps).
First example: coin flip game
5
Why do we even care?
The coin problem is a stand-in for something that that is probably
important to us: such as estimating the probability of a sale given
features and past experience: P[ sale | features,evidence ].
Being able to efficiently form good estimates that combine domain
knowledge, current features and past data is the ultimate goal of
analytics/data-science.
6
3. The coin problem
You are watching flips of a coin and want to estimate the probability
p that the coin comes up heads.
For example: "T" "T" "H" "T" "H" "T" "H" "H" "T" “T"
Easy to apply!
Sufficient statistic: 4 heads, 6 tails
Frequentist estimate of p: p ~ heads/(heads+tails) = 0.4
Done. Thanks for your time.
7 # R code
set.seed(2014)
sample = rbinom(10,1,0.5)
print(ifelse(sample>0.5,'H','T'))
Wait, how did we know to do
that?
Why is it obvious h/(h+t) is the best estimate of the unknown true
value of p?
8 Fundamental problem: a mid-range probability prediction (say a
number in the range 1/6 to 5/6) is not falsifiable by a single
experiment. So: how do we know such statements actually have
empirical content? The usual answers are performance on long
sequences (frequentist), appeals to axioms of probability
(essentially additivity of disjoint events), and subjective
interpretations. Each view has some assumptions and takes
some work.
Checking whether a coin is fair - Wikipedia, the free encyclopedia 7/21/14, 12:26 PM
is small when compared with the alternative hypothesis (a biased coin). However, it is not small enough to
cause us to believe that the coin has a significant bias. Notice that this probability is slightly higher than our
presupposition of the probability that the coin was fair corresponding to the uniform prior distribution, which
was 10%. Using a prior distribution that reflects our prior knowledge of what a coin is and how it acts, the
posterior distribution would not favor the hypothesis of bias. However the number of trials in this example (10
tosses) is very small, and with more trials the choice of prior
distribution would be somewhat less relevant.)
Note that, with the uniform prior, the posterior probability
distribution f(r | H = 7,T = 3) achieves its peak at
r = h / (h + t) = 0.7; this value is called the maximum a
posteriori (MAP) estimate of r. Also with the uniform prior,
the expected value of r under the posterior distribution is
The standard easy estimate
comes from frequentism
Plot of the probability density f(x | H = 7,T = 3) =
1320 x7 (1 - x)3 with x ranging from 0 to 1.
The standard answer (this example from http://en.wikipedia.org/
wiki/Checking_whether_a_coin_is_fair ):
Estimator of true probability
The best estimator for the actual value is the estimator .
This estimator has a margin of error (E) where at a particular confidence level.
Answer is correct and simple, but not good (as it lacks context,
assumptions, goals, motivation and explanation).
Stumper: without an appeal to authority how do we know to use the
estimate of heads/(heads+tails). What problem is such an estimate
solving (what criterion is it optimizing)?
Using this approach, to decide the number of times the coin should be tossed, two parameters are required:
1. The confidence level which is denoted by confidence interval (Z)
2. The maximum (acceptable) error (E)
The confidence level is denoted by Z and is given by the Z-value of a standard normal distribution. This
9 Notation is a bit different: here tau is the unknown true value and
http://en.wikipedia.org/wiki/Checking_whether_a_coin_is_fair Page 4 of 8
p is the estimate. Throughout this talk by “coin” we mean an
abstract device that always returns one of two states. For Gelman
and Nolan have an interesting article “You Can Load a Die, But
You Can’t Bias a Coin” http://www.stat.columbia.edu/~gelman/
research/published/diceRev2.pdf about how hard it would be to
bias an actual coin that you allow somebody else to flip (and how
useless articles testing the fairness of the new Euro were).
4. Also, there are other common
estimates
Examples:
A priori belief: p ~ 0.5 regardless of evidence.
Bayesian (Jeffreys prior) estimate: p ~ (heads+0.5)/(heads+tails
+1) = 0.4090909
Laplace smoothed estimate: p ~ (heads+1)/(heads+tails+2) =
0.4166667
Game theory minimax estimates (more on this later in this talk).
The classic frequentist estimate is not the only acceptable estimate.
10 Each of these has its merits. A prior belief has the least sampling
noise (as it ignores the data). Bayesian with Jeffreys prior very
roughly tries to maximize the amount of information captured in
the first observation. Laplace smoothing minimizes expected
square error under a uniform prior.
Each different estimate has its
own characteristic justification
From “The Cartoon Guide to Statistics”
Gonick and Smith.
11 If all of the estimates where “fully compatible” with each other
then they would all be identical. Which they clearly are not.
Notice we are discussing difference in estimates here- not
differences in significances or hypothesis tests. Also Bayesian
priors are not always subjective beliefs (Wald in particular used
an operational definition).
The standard story
There are 1 to 2 ways to do statistics: frequentism and maybe
Bayesianism.
In frequentist estimation the unknown quantity to be estimated is fixed
at a single value and the experiment is considered a repeatable event
(with different possible measurements on each possible).
All probabilities are over possible repetitions of experiment with
observations changing.
In Bayesian estimation the unknown quantity to be estimated is
assumed to have non-trivial distribution and the experimental results
are considered fixed.
All probabilities are over possible values of the quantity to be
estimated. Priors talk about the assumed distribution before
measurement, posteriors talk about the distribution conditioned on
the measurements.
12 There are other differences: such as preference of point-wise
estimates versus full descriptions of distribution. And these are
not the only possible models.
5. Our coin example again
I flip a coin a single time and it comes up heads- what is my best
estimate of the probability the coin comes up heads in repeated
flips?
“Classic”/naive probability: 0.5 (independent of observations/
data)
Frequentist: 1.0
Bayesian (Jeffreys prior): 0.75
13 Laws that are correct are correct in the extreme cases. (if we
have distributed 6-sided dice) Lets try this. Everybody roll your
die. If it comes up odd you win and even you lose. Okay
somebody who one raise your hand. Each one of you if purely
frequentist estimates 100% chance of winning this game (if you
stick only to data from your die). Now please put your hands
down. Everybody who did not win, how do you feel about the
estimate of 100% chance of winning?
What is the frequentist estimate
optimizing?
"Bayesian Data Analysis" 3rd Edition,Gelman, Carlin, Stern,
Dunson, Vehtari, Rubin p. 92 states that frequentist estimates are
designed to be consistent (as the sample size increases they
converge to the unknown value), efficient (they tend to minimize
loss or expected square-error), or even have asymptotic
unbiasedness (the difference in the estimate from the true value
converges to zero as the experiment size increases, even when re-scaled
by the shrinking standard error of the estimate).
If we think about it: frequentism is interpreting probabilities as limits
of rates of repeated experiments. In this form bias is an especially
bad form of error as it doesn’t average out.
14 Why not minimize L1 error? Because this doesn’t always turn out
to be unbiased (or isn’t always a regression).
Bayesians can allow bias. The saving idea: is don’t average
estimators, but aggregate data and form a new estimate.
Frequentist concerns: bias and
efficiency (variance)
From:“The Cambridge Dictionary of Statistics” 2nd Edition, B.S. Everitt.
Bias:
An estimator for which E[ˆ✓] = ✓ is said to be unbiased.
Efficiency:
A term applied in the context of comparing di↵erent methods of estimating
the same parameter; the estimate with the lowest variance being regarded as
the most efficient.
15 There is more than one unbiased estimate. For example a grand
average (unconditioned by features) is an unbiased estimate.
6. A good motivation of the
frequentist estimate
Adapted from “Schaum’s Outlines Statistics” 4th Edition, Spiegel, Stephens,
pp. 204-205.
SAMPLING DISTRIBUTIONS OF MEANS
Suppose that all possible samples of size N are drawn without replacement
from a finite population of size Np > N. If we denote the mean and stan-dard
deviation of the sampling distribution of means by E[ˆμ] and E[ˆ] and the
population mean and standard deviation by μ and respectively, then
E[ˆμ] = μ and E[ˆ] =
pN
s
Np − N
Np − 1
(1)
If the population is infinite or if sampling is with replacement, the above
results reduce to
E[ˆμ] = μ and E[ˆ] =
pN
(2)
SAMPLING DISTRIBUTIONS OF PROPORTIONS
Suppose that a population is infinite and the probability of occurrence of
an event (called its success) is p. ... We thus obtain a sampling distribution of
proportions whose mean E[ˆp] and standard deviation E[ˆ] are given by
E[ˆp] = p and E[ˆ] =
p(1 − p)
pN
(3)
16 A very good explanation. Unbiased views of the unknown
parameter and its variance are directly observable in the
sampling distribution. So you copy the observed values as your
estimates. But to our point: frequentism no longer seems so
simple. Also close the Bayesian justification: build a complete
generative model with complete priors: and then you can copy
averages of what you observe.
Why is the frequentist forced to
use the estimates 0 and 1?
If the frequentist estimate is to be unbiased for any unknown value
of p (in the range 0 through 1) then we must have for each such p:
Xn
h=0
P[h|n, p]en,h =
Xn
h=0
✓
n
h
◆
ph(1 p)nhen,h = p
The frequentist estimate for each possible outcome of seeing h-heads
in n-flips is a simultaneous planned panel of estimates e(n,h)
that must satisfy the above bias-check equations for all p.
These check conditions tend to be independent linear equations
over our planned estimates e(n,h). So the system has at most one
solution and it turns at the solution e(n,h) = h/n works.
Insisting on unbiasedness completely determines the solution.
17 Estimates like 0 and 1 are wasteful in the sense they allow only
one-sided errors. Laplace “add one smoothing” puts estimates
between likely values (lowering expected l2 error under uniform
priors).
!T
he check equations tend to be full rank linear equations in
e(n,h) as the p-s generate something very much like the moment
curve (which itself is a parameterized curve generating sets of
points in general position).
!T
he reason I am showing this is: usually frequentist inference is
described as canned procedures (avoiding trigger math anxiety)
and Bayesian methods are presented as complicated formulas. In
fact you should be as uncomfortable with frequentist methods as
you are with Bayesian methods.
!
sum_{h=0}^{n} text{P}[h|n,p] e_{n,h} = sum_{h=0}^{n} { n
choose h} p^h (1-p)^{n-h} e_{n,h} = p
Argh! That is a lot of painful math.
The math (turning reasonable desiderata to reasonable procedures) has
always been hiding there.
You never need to re-do the math to use the use classic frequentist
inference procedures (just to derive them).
18 We really worked to h/(h+t) the hard way. The frequentist can’t
generate an estimate a single outcome, they must submit a panel
of estimates for every possible outcome and then check that the
panel represents a schedule of estimates that are simultaneously
unbiased for any possible p.
7. Is the frequentist solution
optimal?
It is the only unbiased solution. So it is certainly the most efficient unbiased
solution.
What if we relaxed unbiasedness? Are there more efficient solutions?
Yes: consider estimates e(1,h) = (0,1) and b(1,h) = (1/4,3/4)
Suppose loss is: loss(f,n) = E[ E[(f(n,h)-p)^2 | h ~ p,n] | p ~ P[p] ]
P[p] is an assumed prior probability on p, such as P[p] = 1/3 if p=0,1/2,1
and 0 otherwise.
Then: loss(1,b) = 0.0625 and loss(1,e) = 0.25. So loss(1,b) loss(1,e),
you can think of the Bayesian procedures as being more efficient.
But that isn’t fair. Insisting on a prior is adopting the Bayesian’s
assumptions as truth. Of course that makes them look better.
19
Frequentist response: you can’t
just wish-away bias
Let’s try this lower-loss Bayesian estimate b(1,h) = (0.25,0.75)
Suppose we 50 dice and we record wins as 1, losses as 0.
Suppose in the above experiment there were 50 of us an 8 people won.
Averaging the frequentist estimates: (8*1.0 + 42*0.0)/50 = 0.16 (not
too far from the true value 1/6 = 0.1666667).
Averaging the “improved” Bayesian estimates: (8*0.75 + 42*0.25)/50
= 0.33. Way off and most of the error is bias (not mere sampling
error).
Bayesian response: you don’t average estimates, you aggregate data
and re-estimate. So you treat the group as a single experiment with 8
wins and 42 losses. Estimate is then (8+0.5)/(50+1) = 0.1666667 (no
reason for estimate to be dead on, Bayesians got lucky that time).
20 (if they have dice they can run with this, all roll- count and compute).
Bayesian response: you don’t average individual estimators, ! you collect
# R code
set.seed(2014)
sample = rbinom(50,1,1/6)
sum(sample)/length(sample)
[1] 0.16
sum(ifelse(sample0.5,0.75,0.25))/length(sample)
[1] 0.33
(0.5+sum(sample))/(1+length(sample))
[1] 0.1666667
Second example: dice game
21
8. Dice are a fun example
22 Dice are pretty much designed to obey the axioms of naive/
classical probability theory (indivisible events having equal
probability). Also once you have a lot of dice it is easy to think in
terms of exchangeable repetitions of experiments (frequentist).
Given that you will forgive us if we tilt the game towards the
Bayesians by adding some hidden state.
The dice game
A control die numbered 1 through 5 is either rolled or placed on one
of its sides.
The game die is a fair die numbered 1 through 6. When the game
die is rolled the game is a win if the number shown on the game die
is greater than the number shown on the control die.
The control die is held at the same value even when we re-roll the
game die.
Neither of the dice is ever seen by the player.
23
You only see the win/lose state
not the control die or the game die
24 (if we have distributed 6-sided dice) Let’s play a round of this.
I’ll hold the control die at 3. You all roll your 6-sided die. Okay
everybody who’s die exceeded 1 raise their hands. This time we
will group our observations to estimate the “unknown”
probability p of winning. What we are looking for is that close to
half the room (assuming we have enough people to build a large
sample, and that we don’t get incredibly unlucky) have raised
their hand. From this you should be able to surmise their are
good odds the control die is set at 3, even if you don’t remember
what you saw on the control die or what was on your game die.
9. Multiple plays
The control die is held at a single value and you try to learn the
odds by observing the wins/losses reported by repeated rolls of the
game die (but not ever seeing either of the dice).
25
The empirical frequentist
procedure seems off
After first flip you are forced (by the bias check conditions) to
estimate a win-rate of 0 or 1. The with rate is always one of 1/6,
2/6, 3/6, 4/6, or 5/6. So your first estimate is always out of range.
After 5 flips the bias equations no longer determine a unique
solution. So you can try to decrease variance without adding any
bias. But since your solution is no longer unique, you should have
less faith it is the one true solution.
26 Could try Winsorising and using 1/6 as our estimate if we lose
and 5/6 as our estimate if we win. But we saw earlier that
“tucking in” estimates doesn’t always help (it introduces a bad
bias).
How about other estimates?
Can we find an estimator that uses criteria other than unbiasedness
without the strong assumption of knowing a favorable prior
distribution?
Remember: if we assume a prior distribution (even a so-called
uninformative prior) and the assumption turns out to be very far
off, then our estimate may be very far off (at least until we have
enough data to dominate the priors).
How about a solution that does well for the worst possible selection
of the unknown probability p?
We are not assuming a distribution on p, just that it is picked to
be worst possible for our strategy.
27
10. Leads us to a game theory
minimax solution
We want an estimate f(n,h) such that:
Where loss(u,v) = (u-v)^2 or loss(u,v) = |u-v|. Here the opponent is
submitting a vector p of probabilities of setting the control die to
each of its 5 marks. The standard game-theory way to solve this is
to find a f(n,h) that works well against the opponent picking a single
state of the control die (c) after they see our complete set of
estimates. That is:
28 In practice would just use Bayesian methods with reasonable
priors. The reduction of one very hard form to another slightly
less-hard problem is the core theorem of game theory. Even if
you have been taught not to fear long equations, these should
look nasty (as they have a lot of quantifiers in them and
quantifiers can rapidly increase complexity).
f(n,h) is just a panel or vector of n+1 estimate choices for each n.
Also once you have things down to simple minimization you
essentially have a problem of designing numerical integration or
optimal quadrature.
!f(n,h) = text{argmin}_{f(n,h)} max_{p in mathbb{R}^{5}, p ge
0, 1 cdot p = 1}sum_{c=1}^{5} p_c sum_{k=0}^{n} text{P}[k
text{ wins} | n,p_c] times text{loss}( f(n,k) ,frac{6-c}{6} )
f(n,h) = text{argmin}_{f(n,h)} max_{c in {1,cdots , 5}}
sum_{k=0}^{n} text{P}[k text{ wins} | n,p_c] times text{loss}
( f(n,k) ,frac{6-c}{6} )
Wald already systematized this
29 If you believe the control die is set by a fair roll, then we again
have a game designed to exactly match a specific generative
model (i.e. designed for Bayesian methods to win). If you believe
the die is set by an adversary, you again have a game theory
problem. Player 1 is trying to maximize risk/loss/error and
player 2 is trying to minimize risk. We model the game as both
players submitting their strategies at the same time. The
standard game theory solution is you pick a strategy so strong
that you would do no worse if your opponent peaked at it and
then altered their strategy. This is part of a minimax setup.
!W
ald, A. (1949). Statistical Decision Functions. Ann. Math.
Statist., 20(2):165–205.
Wald was very smart
One of his WWII ideas: armor sections of combat planes that you
never saw damaged on returning planes. Classical thinking: put
armor where you see bullet holes. Wald: put armor where you
have never seen a bullet hole (hence never seen a hit survived).
30 Wald could bring a lot of deep math to the table. Wald’s solution
allows for many different choices of loss (not just variance or L2)
and for probabilistic estimates (i.e. don’t have to return the same
estimate every time you see the same evidence, though that isn’t
really and advantage).
11. Our game
In both cases the loss function is convex, so we expect a unique
connected set of globally optimal solutions (no isolated local
minima).
For the l1-loss case where loss(u,v) = |u-v| we can solve for the
optimal f(n,k) by a linear program.
1-round l1 solution [0.3, 0.7]
2-round l1 solution [0.24, 0.5, 0.76]
For the l2-loss case where loss(u,v) = (u-v)^2 we can solve for
the optimal f(n,k) using Newton’s method.
1-round l2 solution [0.25, 0.75]
2-round l2 solution [0.21, 0.5, 0.79]
31 These solutions are profitably exploiting both the boundedness
of p (in the range 1/6 through 5/6) and the fact that p only takes
one of 5 possible values (though we obviously don’t know which).
!How do we pick between l1 and l2 loss? l2 is traditional as it is
the next natural moment after the first moment (which becomes
the bias conditions). Without the bias conditions l1 loss plausible
(and leads to things like quantile regression). l2 has some
advantages (such as the gradient structure tending to get
expectations right, hence helping enforce regression conditions
and reduce bias).
Another game
Suppose the opponent can pick any probability for a coin (they are
not limited to 1/6,2/6,3/6,4/6,5/6).
In this case we want to pick f(n,h) minimizing:
32 M(n,f(n,h)) = max_{p in [0,1]} sum_{k=0}^{n} text{P}[k
text{ wins} | n,p] times text{loss}( f(n,k) ,p )
The general p l2 minimax
solutions
For the l1-loss case where loss(u,v) = |u-v| we have a convex
program with a different linear constraint for each possible p. A
column generating strategy over a LP solver handles this quite
nicely.
For the l2-loss case where loss(u,v) = (u-v)^2 the solution is:
heads + pheads + tails/2
heads + tails + pheads + tails
33 Savage, L. J. (1972). The Foundations of Statistics. Dover cites
this solution as coming from Hodges, J. L., J. and Lehmann, E. L.
(1950). Some problems in minimax point estimation. The Annals
o!f Mathematical Statistics, 21(2):pp. 182–197.
see http://winvector.github.io/freq/minimax.pdf for details
!
frac{text{heads} + sqrt{text{heads} + text{tails}}/2}
{text{heads} + text{tails} + sqrt{text{heads}+text{tails}}}
12. How can you solve the l2
minimax problem?
Define:
L(n, f(n, h), p) =
Xn
k=0
P[k wins|n, p] ⇥ (f(n, k) − p)2
For every n there is a f(n,h) (essentially a table of n+1 estimates) such
that L(n,f(n,h),p) = g(n) where g(n) is free of p. And further: the partial
derivative of L(n,,) with respect to any of the entries of f(n,h) evaluated at
this f(n,h) are not p-free. In fact there are always p-s that allow us to
freely choose the sign of this gradient.
Enough to claim:
argminf(n,h) max
Examples:
p
L(n, f(n, h), p) = rootf(n,h)L(n, f(n, h), p) f(n, 0)2
L(1,(1/4,3/4),p) = 1/16
L(2,(-1/2 + sqrt(2)/2,1/2,-sqrt(2)/2 + 3/2),p) = -sqrt(2)/2 + 3/4
34 L(n,f(n,h),p) = sum_{k=0}^{n} text{P}[k text{ wins} | n,p] times
( f(n,k)-p )^2
text{argmin}_{f(n,h)} max_p L(n,f(n,h),p) = text{root}_{f(n,h)}
L(n,f(n,h),p) - f(n,0)^2
!W
e know L(n,f(n,h),p) is convex in f(n,h), so max_p L(n,f(n,h)) is
also convex in f(n,h). We are not looking at the usual Karush–
Kuhn–Tucker conditions of optimality. What I think is going on is
M(n,f(n,h)) = max_p L(n,f(n,h),p) is majorized by L(,,), so we are
collecting evidence of the optimal point through p. What is
exciting is we get rid of quantifiers, making the problem much
easier.
!See http://winvector.github.io/freq/explicitSolution.html and
https://github.com/WinVector/Examples/blob/master/freq/
python/explicitSolution.rst for more details.
The l2 minimax solution in a
graph
Solution of the form
L(1,(lambda,1-lambda),p).
Notice best minimax solution is
at f(1,h) = (0.25,0.75).
Notice all p-curves cross there.
Also notice if you move from
0.25, you can always find a p
that makes things worse.
This proves the solution is a
local minima, so by convexity it
is also the global optimum.
35 So it is just a matter of checking the stated solution clears the p’s
out of L(k,,p). Leonard J. Savage gives this example on page 203
of the 1972 edition of “The Foundations of Statistics.” He
attributes it to: “Some Problems in Minimax Point Estimation” J L
Hodges and E L Lehmann, The Annals of Mathematical Statistics,
1950 vol. 21 (2) pp. 182-197.
A few exact l1/l2 solutions
1-round l2 solution: (1/4, 3/4) (also the 1-round l2 solution)
2-round l2 solution: (-1/2 + sqrt(2)/2, 1/2, -sqrt(2)/2 + 3/2)
~ (0.207, 0.5, 0.793)
Not the same as the 2-round l1 solution: (0.192, 0.5, 0.808)
36 Again this game is to build a best l1 or l2 estimate for any p in
the range 0 through 1. Each estimate is biased (as they don’t
agree with the traditional empirical frequentist estimate), but the
bias is going down as n goes up. Also these estimates are not
the traditional Bayesian ones as they don’t three with anything
coming from traditional priors (notice the non-rational values).
These are related to what Wald called “logical Bayes” where the
Bayesian method is used, but we don’t insist on priors (but
instead solve a minimax problem- where we try to do well under
worst-possible initial distributions).
13. Table of estimates
1/1
2/2
1/2
3/3
3/3
3/3
2/3
4/4
4/4
4/4
3/4
2/4
5/5
5/5
5/5
5/5
4/5
4/5
3/5
6/6
6/6
6/6
6/6
5/6
5/6
4/6
3/6
7/7
7/7
7/7
7/6/7
6/7
5/7
5/7
5/7
4/7
8/8
8/8
8/8
7/8/8
7/8
7/8
6/8
6/8
5/8
4/8
9/9
9/9
9/9
8/9
9/9
8/9
8/9
7/9
7/9
6/9
6/9
5/9
10/10
10/10
10/10
9/10
10/10
9/10
9/10
8/10
8/10
7/10
7/10
7/10
6/10
5/10
1/3
1/3
1/3
0/1 0/2
0/3
1/4
1/4
1/4
0/4
2/5
2/5
2/5
1/5
1/5
1/5
0/5
2/6
2/6
2/6
1/6
1/6
1/6
0/6
3/7
3/7
3/7
2/7
2/7
2/7
1/7
1/7
1/7
0/7
3/8
3/8
3/8
2/8
2/8
2/8
1/8
1/8
1/0/8
4/9
4/9
3/9
3/9
3/9
2/9
2/9
2/9
1/9
1/9
1/9
0/9
4/10
4/10
4/10
3/10
3/10
3/10
2/10
2/10
2/10
1/1/10
1/10
0/10
1/1
2/2
2/3
3/4
3/5
4/6
4/7
5/8
6/9
6/10
0/1
0/2
0/3
0/4
0/5
0/6
0/7
0/8
0/9
0/10
2/2
3/3
4/4
4/5
5/6
6/7
6/8
7/9
8/10
0/2
0/3
0/4
0/5
0/6
0/7
0/8
0/9
0/10
2/2
2/3
3/4
3/5
4/6
4/7
5/8
5/9
6/10
0/2
0/3
0/4
0/5
0/6
0/0/8
0/9
0/10
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1 2 3 4 5 6 7 8 9 10
n
phi
estName
aaaa
a
Bayes (Jeffreys)
Frequentist
l1 minimax
l2 minimax
37 For each of the four major estimates we discussed we show the
chosen estimate phi for h-heads out of n-flips. In general
frequentist is outside, Bayes, which is outside l1 minimax which
is outside l2 minimax. l1 and l2 interior solutions are very close.
This is a graph of a ready to go decision table (an user could
forget everything up until here and just pick their phis off the
graph). Notice frequentist solution crosses l2 minimax around
n=8. Also all solutions except l1 minimax are equally spaced
when n is held fixed. For more details see: https://github.com/
WinVector/Examples/blob/master/freq/python/freqMin.rst
Or: consider this table no easier
to use …
Frequentist
h
n 0 1 2 3 4 5
1 0.0000000 1.0000000
2 0.0000000 0.5000000 1.0000000
3 0.0000000 0.3333333 0.6666667 1.0000000
4 0.0000000 0.2500000 0.5000000 0.7500000 1.0000000
5 0.0000000 0.2000000 0.4000000 0.6000000 0.8000000 1.0000000
38 Obviously you don’t need the table for frequentist as h/(h+t) is
easy to remember.
than to use:
l2 minimax
h
n 0 1 2 3 4 5
1 0.2500000 0.7500000
2 0.2071068 0.5000000 0.7928932
3 0.1830127 0.3943376 0.6056624 0.8169873
4 0.1666667 0.3333333 0.5000000 0.6666667 0.8333333
5 0.1545085 0.2927051 0.4309017 0.5690983 0.7072949 0.8454915
39 And the point is: depending on your goals this table might be the
one you want. However, be warned the l2 minimax adding of
sqrt(n) pseudo-observations is an uncommon procedure. You
want to check if you really want that.
14. And that is it
40
What to take away
Deriving or justifying optimal inference techniques on even simple dice games
can bring in a lot of heavy calculation. If you don’t find that worrying, then you
aren’t paying attention.
For standard situations statisticians did the heavy calculations a long time ago
and packaged up good and simple procedures (the justifications are difficult, but
you don’t have to repeat the justifications each time you apply the methods).
Unbiasedness is just one desirable property among many. If you accept it is
required you are often forced to accept traditional empirical frequentists
estimates as only possible and best possible (not always a good thing).
Differences in Bayesian and frequentist assumptions lead not only to different
hypothesis testing paradigms (confidence intervals versus credible intervals)-
they also pick different “optimal” estimates. Best answer depends on your use
case (not your sense of style).
41
Thank you
42
15. Links
! iPython notebook of most of these results/graphs:
https://github.com/WinVector/Examples/blob/master/freq/python/freqMin.rst
!
More on this topic:
http://www.win-vector.com/blog/2014/07/frequenstist-inference-only-seems-easy/
http://www.win-vector.com/blog/2014/07/automatic-bias-correction-doesnt-fix-omitted-variable-bias/
!
For more information please try our blog:
http://www.win-vector.com/blog/
and our book
“Practical Data Science with R”
http://practicaldatascience.com .
!
Please contact us with comments, questions,
ideas, projects at:
jmount@win-vector.com
43 ipython notebook working through all these examples https://
github.com/WinVector/Examples/blob/master/freq/python/
freqMin.rst