SlideShare une entreprise Scribd logo
1  sur  15
Télécharger pour lire hors ligne
Frequentist estimation only 
seems easy 
John Mount 
Win-Vector LLC 
1 
Outline 
First example problem: estimating the success rate of coin flips. 
Second example problem: estimating the success rate of a dice 
game. 
Interspersed in both: an entomologist’s view of lots of heavy 
calculation. 
Image from “HOW TO PIN AND LABEL ADULT INSECTS” 
Bambara, Blinn, http://www.ces.ncsu.edu/depts/ent/notes/ 
4H/insect_pinning4a.html 
2 This talk is going to alternate between simple probability games 
(like rolling dice) and the detailed calculations needed to bring 
the reasoning forward. If you come away with two points from 
this talk remember: classic frequentist statistics is not as cut and 
dried as teacher claim (so it is okay to ask questions), and 
Bayesian statistics is not nearly as complicated as people make it 
appear. 
The point of this talk 
Statistics is a polished field where many of the foundations are no 
longer discussed. 
A lot of the “math anxiety” felt in learning statistics is from uncertainty 
about these foundations, and how they actually lead to common 
practices. 
We are going to discuss common simple statistical goals (correct 
models, unbiasedness, low error) and how the lead to common simple 
statistical procedures. 
The surprises (at least for me) are: 
There is more than one way to do things. 
The calculations needed to justify how even simple procedures 
are derived from the goals are in fact pretty involved. 
3 A lot of the pain of learning is being told there is only “one 
way” (when there is more than one) and that a hard step (linking 
goals to procedures) is easy (when in fact is is hard). Statistics 
would be easier to teach if those two things were true, but they 
are not. However, not addressing these issues makes learning 
statistics harder than it has to be. We are going to spend some 
time on what are appropriate statistical goals, and how they lead 
to common statistical procedures (instead of claiming everything 
is obvious). You won’t be expected do invent the math, but you 
need to accept that it is in fact hard to justify common statistical 
procedures without somebody having already done the math. 
And I’ll be honest I am a math for math’s sake guy.
What you will get from this 
presentation 
Simple puzzles that present problems for the common rules of estimating rates. 
Good for countering somebody who says “everything is easy and you just 
don’t get it.” 
Examples that expose strong consequences of the seemingly subtle differences 
in common statistical estimation methods. 
Makes understanding seemingly esoteric distinctions like Bayesianism and 
frequentism much easier. 
A taste of some of the really neat math used to establish common statistics. 
A revival of Wald game-theoretic style inference (as described in Savage “The 
Foundations of Statistics”). 
4 You will get to roll the die, and we won’t make you do the heavy 
math. Aside: we have been telling people that one of the things 
that makes data science easy is large data sets allow you to avoid 
some of the hard math in small sample size problems. Here we 
work through some of the math. In practice you do get small 
sample size issues even in large data sets due to heavy-tail like 
phenomena and when you introducing conditioning and 
segmentation (themselves typical modeling steps). 
First example: coin flip game 
5 
Why do we even care? 
The coin problem is a stand-in for something that that is probably 
important to us: such as estimating the probability of a sale given 
features and past experience: P[ sale | features,evidence ]. 
Being able to efficiently form good estimates that combine domain 
knowledge, current features and past data is the ultimate goal of 
analytics/data-science. 
6
The coin problem 
You are watching flips of a coin and want to estimate the probability 
p that the coin comes up heads. 
For example: "T" "T" "H" "T" "H" "T" "H" "H" "T" “T" 
Easy to apply! 
Sufficient statistic: 4 heads, 6 tails 
Frequentist estimate of p: p ~ heads/(heads+tails) = 0.4 
Done. Thanks for your time. 
7 # R code 
set.seed(2014) 
sample = rbinom(10,1,0.5) 
print(ifelse(sample>0.5,'H','T')) 
Wait, how did we know to do 
that? 
Why is it obvious h/(h+t) is the best estimate of the unknown true 
value of p? 
8 Fundamental problem: a mid-range probability prediction (say a 
number in the range 1/6 to 5/6) is not falsifiable by a single 
experiment. So: how do we know such statements actually have 
empirical content? The usual answers are performance on long 
sequences (frequentist), appeals to axioms of probability 
(essentially additivity of disjoint events), and subjective 
interpretations. Each view has some assumptions and takes 
some work. 
Checking whether a coin is fair - Wikipedia, the free encyclopedia 7/21/14, 12:26 PM 
is small when compared with the alternative hypothesis (a biased coin). However, it is not small enough to 
cause us to believe that the coin has a significant bias. Notice that this probability is slightly higher than our 
presupposition of the probability that the coin was fair corresponding to the uniform prior distribution, which 
was 10%. Using a prior distribution that reflects our prior knowledge of what a coin is and how it acts, the 
posterior distribution would not favor the hypothesis of bias. However the number of trials in this example (10 
tosses) is very small, and with more trials the choice of prior 
distribution would be somewhat less relevant.) 
Note that, with the uniform prior, the posterior probability 
distribution f(r | H = 7,T = 3) achieves its peak at 
r = h / (h + t) = 0.7; this value is called the maximum a 
posteriori (MAP) estimate of r. Also with the uniform prior, 
the expected value of r under the posterior distribution is 
The standard easy estimate 
comes from frequentism 
Plot of the probability density f(x | H = 7,T = 3) = 
1320 x7 (1 - x)3 with x ranging from 0 to 1. 
The standard answer (this example from http://en.wikipedia.org/ 
wiki/Checking_whether_a_coin_is_fair ): 
Estimator of true probability 
The best estimator for the actual value is the estimator . 
This estimator has a margin of error (E) where at a particular confidence level. 
Answer is correct and simple, but not good (as it lacks context, 
assumptions, goals, motivation and explanation). 
Stumper: without an appeal to authority how do we know to use the 
estimate of heads/(heads+tails). What problem is such an estimate 
solving (what criterion is it optimizing)? 
Using this approach, to decide the number of times the coin should be tossed, two parameters are required: 
1. The confidence level which is denoted by confidence interval (Z) 
2. The maximum (acceptable) error (E) 
The confidence level is denoted by Z and is given by the Z-value of a standard normal distribution. This 
9 Notation is a bit different: here tau is the unknown true value and 
http://en.wikipedia.org/wiki/Checking_whether_a_coin_is_fair Page 4 of 8 
p is the estimate. Throughout this talk by “coin” we mean an 
abstract device that always returns one of two states. For Gelman 
and Nolan have an interesting article “You Can Load a Die, But 
You Can’t Bias a Coin” http://www.stat.columbia.edu/~gelman/ 
research/published/diceRev2.pdf about how hard it would be to 
bias an actual coin that you allow somebody else to flip (and how 
useless articles testing the fairness of the new Euro were).
Also, there are other common 
estimates 
Examples: 
A priori belief: p ~ 0.5 regardless of evidence. 
Bayesian (Jeffreys prior) estimate: p ~ (heads+0.5)/(heads+tails 
+1) = 0.4090909 
Laplace smoothed estimate: p ~ (heads+1)/(heads+tails+2) = 
0.4166667 
Game theory minimax estimates (more on this later in this talk). 
The classic frequentist estimate is not the only acceptable estimate. 
10 Each of these has its merits. A prior belief has the least sampling 
noise (as it ignores the data). Bayesian with Jeffreys prior very 
roughly tries to maximize the amount of information captured in 
the first observation. Laplace smoothing minimizes expected 
square error under a uniform prior. 
Each different estimate has its 
own characteristic justification 
From “The Cartoon Guide to Statistics” 
Gonick and Smith. 
11 If all of the estimates where “fully compatible” with each other 
then they would all be identical. Which they clearly are not. 
Notice we are discussing difference in estimates here- not 
differences in significances or hypothesis tests. Also Bayesian 
priors are not always subjective beliefs (Wald in particular used 
an operational definition). 
The standard story 
There are 1 to 2 ways to do statistics: frequentism and maybe 
Bayesianism. 
In frequentist estimation the unknown quantity to be estimated is fixed 
at a single value and the experiment is considered a repeatable event 
(with different possible measurements on each possible). 
All probabilities are over possible repetitions of experiment with 
observations changing. 
In Bayesian estimation the unknown quantity to be estimated is 
assumed to have non-trivial distribution and the experimental results 
are considered fixed. 
All probabilities are over possible values of the quantity to be 
estimated. Priors talk about the assumed distribution before 
measurement, posteriors talk about the distribution conditioned on 
the measurements. 
12 There are other differences: such as preference of point-wise 
estimates versus full descriptions of distribution. And these are 
not the only possible models.
Our coin example again 
I flip a coin a single time and it comes up heads- what is my best 
estimate of the probability the coin comes up heads in repeated 
flips? 
“Classic”/naive probability: 0.5 (independent of observations/ 
data) 
Frequentist: 1.0 
Bayesian (Jeffreys prior): 0.75 
13 Laws that are correct are correct in the extreme cases. (if we 
have distributed 6-sided dice) Lets try this. Everybody roll your 
die. If it comes up odd you win and even you lose. Okay 
somebody who one raise your hand. Each one of you if purely 
frequentist estimates 100% chance of winning this game (if you 
stick only to data from your die). Now please put your hands 
down. Everybody who did not win, how do you feel about the 
estimate of 100% chance of winning? 
What is the frequentist estimate 
optimizing? 
"Bayesian Data Analysis" 3rd Edition,Gelman, Carlin, Stern, 
Dunson, Vehtari, Rubin p. 92 states that frequentist estimates are 
designed to be consistent (as the sample size increases they 
converge to the unknown value), efficient (they tend to minimize 
loss or expected square-error), or even have asymptotic 
unbiasedness (the difference in the estimate from the true value 
converges to zero as the experiment size increases, even when re-scaled 
by the shrinking standard error of the estimate). 
If we think about it: frequentism is interpreting probabilities as limits 
of rates of repeated experiments. In this form bias is an especially 
bad form of error as it doesn’t average out. 
14 Why not minimize L1 error? Because this doesn’t always turn out 
to be unbiased (or isn’t always a regression). 
Bayesians can allow bias. The saving idea: is don’t average 
estimators, but aggregate data and form a new estimate. 
Frequentist concerns: bias and 
efficiency (variance) 
From:“The Cambridge Dictionary of Statistics” 2nd Edition, B.S. Everitt. 
Bias: 
An estimator for which E[ˆ✓] = ✓ is said to be unbiased. 
Efficiency: 
A term applied in the context of comparing di↵erent methods of estimating 
the same parameter; the estimate with the lowest variance being regarded as 
the most efficient. 
15 There is more than one unbiased estimate. For example a grand 
average (unconditioned by features) is an unbiased estimate.
A good motivation of the 
frequentist estimate 
Adapted from “Schaum’s Outlines Statistics” 4th Edition, Spiegel, Stephens, 
pp. 204-205. 
SAMPLING DISTRIBUTIONS OF MEANS 
Suppose that all possible samples of size N are drawn without replacement 
from a finite population of size Np > N. If we denote the mean and stan-dard 
deviation of the sampling distribution of means by E[ˆμ] and E[ˆ] and the 
population mean and standard deviation by μ and  respectively, then 
E[ˆμ] = μ and E[ˆ] = 
 
pN 
s 
Np − N 
Np − 1 
(1) 
If the population is infinite or if sampling is with replacement, the above 
results reduce to 
E[ˆμ] = μ and E[ˆ] = 
 
pN 
(2) 
SAMPLING DISTRIBUTIONS OF PROPORTIONS 
Suppose that a population is infinite and the probability of occurrence of 
an event (called its success) is p. ... We thus obtain a sampling distribution of 
proportions whose mean E[ˆp] and standard deviation E[ˆ] are given by 
E[ˆp] = p and E[ˆ] = 
p(1 − p) 
pN 
(3) 
16 A very good explanation. Unbiased views of the unknown 
parameter and its variance are directly observable in the 
sampling distribution. So you copy the observed values as your 
estimates. But to our point: frequentism no longer seems so 
simple. Also close the Bayesian justification: build a complete 
generative model with complete priors: and then you can copy 
averages of what you observe. 
Why is the frequentist forced to 
use the estimates 0 and 1? 
If the frequentist estimate is to be unbiased for any unknown value 
of p (in the range 0 through 1) then we must have for each such p: 
Xn 
h=0 
P[h|n, p]en,h = 
Xn 
h=0 
✓ 
n 
h 
◆ 
ph(1  p)nhen,h = p 
The frequentist estimate for each possible outcome of seeing h-heads 
in n-flips is a simultaneous planned panel of estimates e(n,h) 
that must satisfy the above bias-check equations for all p. 
These check conditions tend to be independent linear equations 
over our planned estimates e(n,h). So the system has at most one 
solution and it turns at the solution e(n,h) = h/n works. 
Insisting on unbiasedness completely determines the solution. 
17 Estimates like 0 and 1 are wasteful in the sense they allow only 
one-sided errors. Laplace “add one smoothing” puts estimates 
between likely values (lowering expected l2 error under uniform 
priors). 
!T 
he check equations tend to be full rank linear equations in 
e(n,h) as the p-s generate something very much like the moment 
curve (which itself is a parameterized curve generating sets of 
points in general position). 
!T 
he reason I am showing this is: usually frequentist inference is 
described as canned procedures (avoiding trigger math anxiety) 
and Bayesian methods are presented as complicated formulas. In 
fact you should be as uncomfortable with frequentist methods as 
you are with Bayesian methods. 
! 
sum_{h=0}^{n} text{P}[h|n,p] e_{n,h} = sum_{h=0}^{n} { n 
choose h} p^h (1-p)^{n-h} e_{n,h} = p 
Argh! That is a lot of painful math. 
The math (turning reasonable desiderata to reasonable procedures) has 
always been hiding there. 
You never need to re-do the math to use the use classic frequentist 
inference procedures (just to derive them). 
18 We really worked to h/(h+t) the hard way. The frequentist can’t 
generate an estimate a single outcome, they must submit a panel 
of estimates for every possible outcome and then check that the 
panel represents a schedule of estimates that are simultaneously 
unbiased for any possible p.
Is the frequentist solution 
optimal? 
It is the only unbiased solution. So it is certainly the most efficient unbiased 
solution. 
What if we relaxed unbiasedness? Are there more efficient solutions? 
Yes: consider estimates e(1,h) = (0,1) and b(1,h) = (1/4,3/4) 
Suppose loss is: loss(f,n) = E[ E[(f(n,h)-p)^2 | h ~ p,n] | p ~ P[p] ] 
P[p] is an assumed prior probability on p, such as P[p] = 1/3 if p=0,1/2,1 
and 0 otherwise. 
Then: loss(1,b) = 0.0625 and loss(1,e) = 0.25. So loss(1,b)  loss(1,e), 
you can think of the Bayesian procedures as being more efficient. 
But that isn’t fair. Insisting on a prior is adopting the Bayesian’s 
assumptions as truth. Of course that makes them look better. 
19 
Frequentist response: you can’t 
just wish-away bias 
Let’s try this lower-loss Bayesian estimate b(1,h) = (0.25,0.75) 
Suppose we 50 dice and we record wins as 1, losses as 0. 
Suppose in the above experiment there were 50 of us an 8 people won. 
Averaging the frequentist estimates: (8*1.0 + 42*0.0)/50 = 0.16 (not 
too far from the true value 1/6 = 0.1666667). 
Averaging the “improved” Bayesian estimates: (8*0.75 + 42*0.25)/50 
= 0.33. Way off and most of the error is bias (not mere sampling 
error). 
Bayesian response: you don’t average estimates, you aggregate data 
and re-estimate. So you treat the group as a single experiment with 8 
wins and 42 losses. Estimate is then (8+0.5)/(50+1) = 0.1666667 (no 
reason for estimate to be dead on, Bayesians got lucky that time). 
20 (if they have dice they can run with this, all roll- count and compute). 
Bayesian response: you don’t average individual estimators, ! you collect 
# R code 
 set.seed(2014) 
 sample = rbinom(50,1,1/6) 
 sum(sample)/length(sample) 
[1] 0.16 
 sum(ifelse(sample0.5,0.75,0.25))/length(sample) 
[1] 0.33 
 (0.5+sum(sample))/(1+length(sample)) 
[1] 0.1666667 
Second example: dice game 
21
Dice are a fun example 
22 Dice are pretty much designed to obey the axioms of naive/ 
classical probability theory (indivisible events having equal 
probability). Also once you have a lot of dice it is easy to think in 
terms of exchangeable repetitions of experiments (frequentist). 
Given that you will forgive us if we tilt the game towards the 
Bayesians by adding some hidden state. 
The dice game 
A control die numbered 1 through 5 is either rolled or placed on one 
of its sides. 
The game die is a fair die numbered 1 through 6. When the game 
die is rolled the game is a win if the number shown on the game die 
is greater than the number shown on the control die. 
The control die is held at the same value even when we re-roll the 
game die. 
Neither of the dice is ever seen by the player. 
23 
You only see the win/lose state 
not the control die or the game die 
24 (if we have distributed 6-sided dice) Let’s play a round of this. 
I’ll hold the control die at 3. You all roll your 6-sided die. Okay 
everybody who’s die exceeded 1 raise their hands. This time we 
will group our observations to estimate the “unknown” 
probability p of winning. What we are looking for is that close to 
half the room (assuming we have enough people to build a large 
sample, and that we don’t get incredibly unlucky) have raised 
their hand. From this you should be able to surmise their are 
good odds the control die is set at 3, even if you don’t remember 
what you saw on the control die or what was on your game die.
Multiple plays 
The control die is held at a single value and you try to learn the 
odds by observing the wins/losses reported by repeated rolls of the 
game die (but not ever seeing either of the dice). 
25 
The empirical frequentist 
procedure seems off 
After first flip you are forced (by the bias check conditions) to 
estimate a win-rate of 0 or 1. The with rate is always one of 1/6, 
2/6, 3/6, 4/6, or 5/6. So your first estimate is always out of range. 
After 5 flips the bias equations no longer determine a unique 
solution. So you can try to decrease variance without adding any 
bias. But since your solution is no longer unique, you should have 
less faith it is the one true solution. 
26 Could try Winsorising and using 1/6 as our estimate if we lose 
and 5/6 as our estimate if we win. But we saw earlier that 
“tucking in” estimates doesn’t always help (it introduces a bad 
bias). 
How about other estimates? 
Can we find an estimator that uses criteria other than unbiasedness 
without the strong assumption of knowing a favorable prior 
distribution? 
Remember: if we assume a prior distribution (even a so-called 
uninformative prior) and the assumption turns out to be very far 
off, then our estimate may be very far off (at least until we have 
enough data to dominate the priors). 
How about a solution that does well for the worst possible selection 
of the unknown probability p? 
We are not assuming a distribution on p, just that it is picked to 
be worst possible for our strategy. 
27
Leads us to a game theory 
minimax solution 
We want an estimate f(n,h) such that: 
Where loss(u,v) = (u-v)^2 or loss(u,v) = |u-v|. Here the opponent is 
submitting a vector p of probabilities of setting the control die to 
each of its 5 marks. The standard game-theory way to solve this is 
to find a f(n,h) that works well against the opponent picking a single 
state of the control die (c) after they see our complete set of 
estimates. That is: 
28 In practice would just use Bayesian methods with reasonable 
priors. The reduction of one very hard form to another slightly 
less-hard problem is the core theorem of game theory. Even if 
you have been taught not to fear long equations, these should 
look nasty (as they have a lot of quantifiers in them and 
quantifiers can rapidly increase complexity). 
f(n,h) is just a panel or vector of n+1 estimate choices for each n. 
Also once you have things down to simple minimization you 
essentially have a problem of designing numerical integration or 
optimal quadrature. 
!f(n,h) = text{argmin}_{f(n,h)} max_{p in mathbb{R}^{5}, p ge 
0, 1 cdot p = 1}sum_{c=1}^{5} p_c sum_{k=0}^{n} text{P}[k 
text{ wins} | n,p_c] times text{loss}( f(n,k) ,frac{6-c}{6} ) 
f(n,h) = text{argmin}_{f(n,h)} max_{c in {1,cdots , 5}} 
sum_{k=0}^{n} text{P}[k text{ wins} | n,p_c] times text{loss} 
( f(n,k) ,frac{6-c}{6} ) 
Wald already systematized this 
29 If you believe the control die is set by a fair roll, then we again 
have a game designed to exactly match a specific generative 
model (i.e. designed for Bayesian methods to win). If you believe 
the die is set by an adversary, you again have a game theory 
problem. Player 1 is trying to maximize risk/loss/error and 
player 2 is trying to minimize risk. We model the game as both 
players submitting their strategies at the same time. The 
standard game theory solution is you pick a strategy so strong 
that you would do no worse if your opponent peaked at it and 
then altered their strategy. This is part of a minimax setup. 
!W 
ald, A. (1949). Statistical Decision Functions. Ann. Math. 
Statist., 20(2):165–205. 
Wald was very smart 
One of his WWII ideas: armor sections of combat planes that you 
never saw damaged on returning planes. Classical thinking: put 
armor where you see bullet holes. Wald: put armor where you 
have never seen a bullet hole (hence never seen a hit survived). 
30 Wald could bring a lot of deep math to the table. Wald’s solution 
allows for many different choices of loss (not just variance or L2) 
and for probabilistic estimates (i.e. don’t have to return the same 
estimate every time you see the same evidence, though that isn’t 
really and advantage).
Our game 
In both cases the loss function is convex, so we expect a unique 
connected set of globally optimal solutions (no isolated local 
minima). 
For the l1-loss case where loss(u,v) = |u-v| we can solve for the 
optimal f(n,k) by a linear program. 
1-round l1 solution [0.3, 0.7] 
2-round l1 solution [0.24, 0.5, 0.76] 
For the l2-loss case where loss(u,v) = (u-v)^2 we can solve for 
the optimal f(n,k) using Newton’s method. 
1-round l2 solution [0.25, 0.75] 
2-round l2 solution [0.21, 0.5, 0.79] 
31 These solutions are profitably exploiting both the boundedness 
of p (in the range 1/6 through 5/6) and the fact that p only takes 
one of 5 possible values (though we obviously don’t know which). 
!How do we pick between l1 and l2 loss? l2 is traditional as it is 
the next natural moment after the first moment (which becomes 
the bias conditions). Without the bias conditions l1 loss plausible 
(and leads to things like quantile regression). l2 has some 
advantages (such as the gradient structure tending to get 
expectations right, hence helping enforce regression conditions 
and reduce bias). 
Another game 
Suppose the opponent can pick any probability for a coin (they are 
not limited to 1/6,2/6,3/6,4/6,5/6). 
In this case we want to pick f(n,h) minimizing: 
32 M(n,f(n,h)) = max_{p in [0,1]} sum_{k=0}^{n} text{P}[k 
text{ wins} | n,p] times text{loss}( f(n,k) ,p ) 
The general p l2 minimax 
solutions 
For the l1-loss case where loss(u,v) = |u-v| we have a convex 
program with a different linear constraint for each possible p. A 
column generating strategy over a LP solver handles this quite 
nicely. 
For the l2-loss case where loss(u,v) = (u-v)^2 the solution is: 
heads + pheads + tails/2 
heads + tails + pheads + tails 
33 Savage, L. J. (1972). The Foundations of Statistics. Dover cites 
this solution as coming from Hodges, J. L., J. and Lehmann, E. L. 
(1950). Some problems in minimax point estimation. The Annals 
o!f Mathematical Statistics, 21(2):pp. 182–197. 
see http://winvector.github.io/freq/minimax.pdf for details 
! 
frac{text{heads} + sqrt{text{heads} + text{tails}}/2} 
{text{heads} + text{tails} + sqrt{text{heads}+text{tails}}}
How can you solve the l2 
minimax problem? 
Define: 
L(n, f(n, h), p) = 
Xn 
k=0 
P[k wins|n, p] ⇥ (f(n, k) − p)2 
For every n there is a f(n,h) (essentially a table of n+1 estimates) such 
that L(n,f(n,h),p) = g(n) where g(n) is free of p. And further: the partial 
derivative of L(n,,) with respect to any of the entries of f(n,h) evaluated at 
this f(n,h) are not p-free. In fact there are always p-s that allow us to 
freely choose the sign of this gradient. 
Enough to claim: 
argminf(n,h) max 
Examples: 
p 
L(n, f(n, h), p) = rootf(n,h)L(n, f(n, h), p)  f(n, 0)2 
L(1,(1/4,3/4),p) = 1/16 
L(2,(-1/2 + sqrt(2)/2,1/2,-sqrt(2)/2 + 3/2),p) = -sqrt(2)/2 + 3/4 
34 L(n,f(n,h),p) = sum_{k=0}^{n} text{P}[k text{ wins} | n,p] times 
( f(n,k)-p )^2 
text{argmin}_{f(n,h)} max_p L(n,f(n,h),p) = text{root}_{f(n,h)} 
L(n,f(n,h),p) - f(n,0)^2 
!W 
e know L(n,f(n,h),p) is convex in f(n,h), so max_p L(n,f(n,h)) is 
also convex in f(n,h). We are not looking at the usual Karush– 
Kuhn–Tucker conditions of optimality. What I think is going on is 
M(n,f(n,h)) = max_p L(n,f(n,h),p) is majorized by L(,,), so we are 
collecting evidence of the optimal point through p. What is 
exciting is we get rid of quantifiers, making the problem much 
easier. 
!See http://winvector.github.io/freq/explicitSolution.html and 
https://github.com/WinVector/Examples/blob/master/freq/ 
python/explicitSolution.rst for more details. 
The l2 minimax solution in a 
graph 
Solution of the form 
L(1,(lambda,1-lambda),p). 
Notice best minimax solution is 
at f(1,h) = (0.25,0.75). 
Notice all p-curves cross there. 
Also notice if you move from 
0.25, you can always find a p 
that makes things worse. 
This proves the solution is a 
local minima, so by convexity it 
is also the global optimum. 
35 So it is just a matter of checking the stated solution clears the p’s 
out of L(k,,p). Leonard J. Savage gives this example on page 203 
of the 1972 edition of “The Foundations of Statistics.” He 
attributes it to: “Some Problems in Minimax Point Estimation” J L 
Hodges and E L Lehmann, The Annals of Mathematical Statistics, 
1950 vol. 21 (2) pp. 182-197. 
A few exact l1/l2 solutions 
1-round l2 solution: (1/4, 3/4) (also the 1-round l2 solution) 
2-round l2 solution: (-1/2 + sqrt(2)/2, 1/2, -sqrt(2)/2 + 3/2) 
~ (0.207, 0.5, 0.793) 
Not the same as the 2-round l1 solution: (0.192, 0.5, 0.808) 
36 Again this game is to build a best l1 or l2 estimate for any p in 
the range 0 through 1. Each estimate is biased (as they don’t 
agree with the traditional empirical frequentist estimate), but the 
bias is going down as n goes up. Also these estimates are not 
the traditional Bayesian ones as they don’t three with anything 
coming from traditional priors (notice the non-rational values). 
These are related to what Wald called “logical Bayes” where the 
Bayesian method is used, but we don’t insist on priors (but 
instead solve a minimax problem- where we try to do well under 
worst-possible initial distributions).
Table of estimates 
1/1 
2/2 
1/2 
3/3 
3/3 
3/3 
2/3 
4/4 
4/4 
4/4 
3/4 
2/4 
5/5 
5/5 
5/5 
5/5 
4/5 
4/5 
3/5 
6/6 
6/6 
6/6 
6/6 
5/6 
5/6 
4/6 
3/6 
7/7 
7/7 
7/7 
7/6/7 
6/7 
5/7 
5/7 
5/7 
4/7 
8/8 
8/8 
8/8 
7/8/8 
7/8 
7/8 
6/8 
6/8 
5/8 
4/8 
9/9 
9/9 
9/9 
8/9 
9/9 
8/9 
8/9 
7/9 
7/9 
6/9 
6/9 
5/9 
10/10 
10/10 
10/10 
9/10 
10/10 
9/10 
9/10 
8/10 
8/10 
7/10 
7/10 
7/10 
6/10 
5/10 
1/3 
1/3 
1/3 
0/1 0/2 
0/3 
1/4 
1/4 
1/4 
0/4 
2/5 
2/5 
2/5 
1/5 
1/5 
1/5 
0/5 
2/6 
2/6 
2/6 
1/6 
1/6 
1/6 
0/6 
3/7 
3/7 
3/7 
2/7 
2/7 
2/7 
1/7 
1/7 
1/7 
0/7 
3/8 
3/8 
3/8 
2/8 
2/8 
2/8 
1/8 
1/8 
1/0/8 
4/9 
4/9 
3/9 
3/9 
3/9 
2/9 
2/9 
2/9 
1/9 
1/9 
1/9 
0/9 
4/10 
4/10 
4/10 
3/10 
3/10 
3/10 
2/10 
2/10 
2/10 
1/1/10 
1/10 
0/10 
1/1 
2/2 
2/3 
3/4 
3/5 
4/6 
4/7 
5/8 
6/9 
6/10 
0/1 
0/2 
0/3 
0/4 
0/5 
0/6 
0/7 
0/8 
0/9 
0/10 
2/2 
3/3 
4/4 
4/5 
5/6 
6/7 
6/8 
7/9 
8/10 
0/2 
0/3 
0/4 
0/5 
0/6 
0/7 
0/8 
0/9 
0/10 
2/2 
2/3 
3/4 
3/5 
4/6 
4/7 
5/8 
5/9 
6/10 
0/2 
0/3 
0/4 
0/5 
0/6 
0/0/8 
0/9 
0/10 
1 
0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
0 
1 2 3 4 5 6 7 8 9 10 
n 
phi 
estName 
aaaa 
a 
Bayes (Jeffreys) 
Frequentist 
l1 minimax 
l2 minimax 
37 For each of the four major estimates we discussed we show the 
chosen estimate phi for h-heads out of n-flips. In general 
frequentist is outside, Bayes, which is outside l1 minimax which 
is outside l2 minimax. l1 and l2 interior solutions are very close. 
This is a graph of a ready to go decision table (an user could 
forget everything up until here and just pick their phis off the 
graph). Notice frequentist solution crosses l2 minimax around 
n=8. Also all solutions except l1 minimax are equally spaced 
when n is held fixed. For more details see: https://github.com/ 
WinVector/Examples/blob/master/freq/python/freqMin.rst 
Or: consider this table no easier 
to use … 
Frequentist 
h 
n 0 1 2 3 4 5 
1 0.0000000 1.0000000 
2 0.0000000 0.5000000 1.0000000 
3 0.0000000 0.3333333 0.6666667 1.0000000 
4 0.0000000 0.2500000 0.5000000 0.7500000 1.0000000 
5 0.0000000 0.2000000 0.4000000 0.6000000 0.8000000 1.0000000 
38 Obviously you don’t need the table for frequentist as h/(h+t) is 
easy to remember. 
than to use: 
l2 minimax 
h 
n 0 1 2 3 4 5 
1 0.2500000 0.7500000 
2 0.2071068 0.5000000 0.7928932 
3 0.1830127 0.3943376 0.6056624 0.8169873 
4 0.1666667 0.3333333 0.5000000 0.6666667 0.8333333 
5 0.1545085 0.2927051 0.4309017 0.5690983 0.7072949 0.8454915 
39 And the point is: depending on your goals this table might be the 
one you want. However, be warned the l2 minimax adding of 
sqrt(n) pseudo-observations is an uncommon procedure. You 
want to check if you really want that.
And that is it 
40 
What to take away 
Deriving or justifying optimal inference techniques on even simple dice games 
can bring in a lot of heavy calculation. If you don’t find that worrying, then you 
aren’t paying attention. 
For standard situations statisticians did the heavy calculations a long time ago 
and packaged up good and simple procedures (the justifications are difficult, but 
you don’t have to repeat the justifications each time you apply the methods). 
Unbiasedness is just one desirable property among many. If you accept it is 
required you are often forced to accept traditional empirical frequentists 
estimates as only possible and best possible (not always a good thing). 
Differences in Bayesian and frequentist assumptions lead not only to different 
hypothesis testing paradigms (confidence intervals versus credible intervals)- 
they also pick different “optimal” estimates. Best answer depends on your use 
case (not your sense of style). 
41 
Thank you 
42
Links 
! iPython notebook of most of these results/graphs: 
https://github.com/WinVector/Examples/blob/master/freq/python/freqMin.rst 
! 
More on this topic: 
http://www.win-vector.com/blog/2014/07/frequenstist-inference-only-seems-easy/ 
http://www.win-vector.com/blog/2014/07/automatic-bias-correction-doesnt-fix-omitted-variable-bias/ 
! 
For more information please try our blog: 
http://www.win-vector.com/blog/ 
and our book 
“Practical Data Science with R” 
http://practicaldatascience.com . 
! 
Please contact us with comments, questions, 
ideas, projects at: 
jmount@win-vector.com 
43 ipython notebook working through all these examples https:// 
github.com/WinVector/Examples/blob/master/freq/python/ 
freqMin.rst

Contenu connexe

Tendances

Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)jemille6
 
Hypothesis Testing: Central Tendency – Normal (Compare 1:Standard)
Hypothesis Testing: Central Tendency – Normal (Compare 1:Standard)Hypothesis Testing: Central Tendency – Normal (Compare 1:Standard)
Hypothesis Testing: Central Tendency – Normal (Compare 1:Standard)Matt Hansen
 
Mayo: 2nd half “Frequentist Statistics as a Theory of Inductive Inference” (S...
Mayo: 2nd half “Frequentist Statistics as a Theory of Inductive Inference” (S...Mayo: 2nd half “Frequentist Statistics as a Theory of Inductive Inference” (S...
Mayo: 2nd half “Frequentist Statistics as a Theory of Inductive Inference” (S...jemille6
 
Phil 6334 Mayo slides Day 1
Phil 6334 Mayo slides Day 1Phil 6334 Mayo slides Day 1
Phil 6334 Mayo slides Day 1jemille6
 
Applied Statistics for Part-II
Applied Statistics for Part-IIApplied Statistics for Part-II
Applied Statistics for Part-IIMuteeullah
 
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019jemille6
 
Hypothesis Testing: Statistical Laws and Confidence Intervals
Hypothesis Testing: Statistical Laws and Confidence IntervalsHypothesis Testing: Statistical Laws and Confidence Intervals
Hypothesis Testing: Statistical Laws and Confidence IntervalsMatt Hansen
 
Hypothesis Testing: Overview
Hypothesis Testing: OverviewHypothesis Testing: Overview
Hypothesis Testing: OverviewMatt Hansen
 
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and FalsificationP-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and Falsificationjemille6
 
Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)jemille6
 
Mayo O&M slides (4-28-13)
Mayo O&M slides (4-28-13)Mayo O&M slides (4-28-13)
Mayo O&M slides (4-28-13)jemille6
 
352735322 rsh-qam11-tif-02-doc
352735322 rsh-qam11-tif-02-doc352735322 rsh-qam11-tif-02-doc
352735322 rsh-qam11-tif-02-docFiras Husseini
 
hypothesis testing overview
hypothesis testing overviewhypothesis testing overview
hypothesis testing overviewi i
 
Hypothesis Testing
Hypothesis TestingHypothesis Testing
Hypothesis Testingrishi.indian
 
Statistical Test
Statistical TestStatistical Test
Statistical Testguestdbf093
 

Tendances (20)

Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
 
Hypothesis Testing: Central Tendency – Normal (Compare 1:Standard)
Hypothesis Testing: Central Tendency – Normal (Compare 1:Standard)Hypothesis Testing: Central Tendency – Normal (Compare 1:Standard)
Hypothesis Testing: Central Tendency – Normal (Compare 1:Standard)
 
Mayo: 2nd half “Frequentist Statistics as a Theory of Inductive Inference” (S...
Mayo: 2nd half “Frequentist Statistics as a Theory of Inductive Inference” (S...Mayo: 2nd half “Frequentist Statistics as a Theory of Inductive Inference” (S...
Mayo: 2nd half “Frequentist Statistics as a Theory of Inductive Inference” (S...
 
Phil 6334 Mayo slides Day 1
Phil 6334 Mayo slides Day 1Phil 6334 Mayo slides Day 1
Phil 6334 Mayo slides Day 1
 
Applied Statistics for Part-II
Applied Statistics for Part-IIApplied Statistics for Part-II
Applied Statistics for Part-II
 
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
 
Hypothesis Testing: Statistical Laws and Confidence Intervals
Hypothesis Testing: Statistical Laws and Confidence IntervalsHypothesis Testing: Statistical Laws and Confidence Intervals
Hypothesis Testing: Statistical Laws and Confidence Intervals
 
Hypothesis Testing: Overview
Hypothesis Testing: OverviewHypothesis Testing: Overview
Hypothesis Testing: Overview
 
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and FalsificationP-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
 
Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)
 
Mayo O&M slides (4-28-13)
Mayo O&M slides (4-28-13)Mayo O&M slides (4-28-13)
Mayo O&M slides (4-28-13)
 
50134 09
50134 0950134 09
50134 09
 
Inexact reasoning
Inexact reasoningInexact reasoning
Inexact reasoning
 
352735322 rsh-qam11-tif-02-doc
352735322 rsh-qam11-tif-02-doc352735322 rsh-qam11-tif-02-doc
352735322 rsh-qam11-tif-02-doc
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
Hypothesis
HypothesisHypothesis
Hypothesis
 
hypothesis testing overview
hypothesis testing overviewhypothesis testing overview
hypothesis testing overview
 
Chapter 8
Chapter 8Chapter 8
Chapter 8
 
Hypothesis Testing
Hypothesis TestingHypothesis Testing
Hypothesis Testing
 
Statistical Test
Statistical TestStatistical Test
Statistical Test
 

Similaire à Frequentist inference only seems easy By John Mount

MSL 5080, Methods of Analysis for Business Operations 1 .docx
MSL 5080, Methods of Analysis for Business Operations 1 .docxMSL 5080, Methods of Analysis for Business Operations 1 .docx
MSL 5080, Methods of Analysis for Business Operations 1 .docxAASTHA76
 
Module-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data scienceModule-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data sciencepujashri1975
 
Anastasi Lecture 2008
Anastasi Lecture 2008Anastasi Lecture 2008
Anastasi Lecture 2008behnke3791
 
Barra Presentation
Barra PresentationBarra Presentation
Barra Presentationspgreiner
 
Advanced business mathematics and statistics for entrepreneurs
Advanced business mathematics and statistics for entrepreneursAdvanced business mathematics and statistics for entrepreneurs
Advanced business mathematics and statistics for entrepreneursDr. Trilok Kumar Jain
 
In the last column we discussed the use of pooling to get a be
In the last column we discussed the use of pooling to get a beIn the last column we discussed the use of pooling to get a be
In the last column we discussed the use of pooling to get a beMalikPinckney86
 
PRML Chapter 1
PRML Chapter 1PRML Chapter 1
PRML Chapter 1Sunwoo Kim
 
The Importance of Probability in Data Science.docx
The Importance of Probability in Data Science.docxThe Importance of Probability in Data Science.docx
The Importance of Probability in Data Science.docxnearlearn
 
Statistics 091208004734-phpapp01 (1)
Statistics 091208004734-phpapp01 (1)Statistics 091208004734-phpapp01 (1)
Statistics 091208004734-phpapp01 (1)mandrewmartin
 
DataHandlingStatistics.ppt
DataHandlingStatistics.pptDataHandlingStatistics.ppt
DataHandlingStatistics.pptssuser7f3860
 
Principles of Health Informatics: Informatics skills - searching and making d...
Principles of Health Informatics: Informatics skills - searching and making d...Principles of Health Informatics: Informatics skills - searching and making d...
Principles of Health Informatics: Informatics skills - searching and making d...Martin Chapman
 
BUS308 – Week 1 Lecture 2 Describing Data Expected Out.docx
BUS308 – Week 1 Lecture 2 Describing Data Expected Out.docxBUS308 – Week 1 Lecture 2 Describing Data Expected Out.docx
BUS308 – Week 1 Lecture 2 Describing Data Expected Out.docxcurwenmichaela
 
Quantitative Methods for Lawyers - Class #14 - Power Laws, Hypothesis Testing...
Quantitative Methods for Lawyers - Class #14 - Power Laws, Hypothesis Testing...Quantitative Methods for Lawyers - Class #14 - Power Laws, Hypothesis Testing...
Quantitative Methods for Lawyers - Class #14 - Power Laws, Hypothesis Testing...Daniel Katz
 
Basic statistics by_david_solomon_hadi_-_split_and_reviewed
Basic statistics by_david_solomon_hadi_-_split_and_reviewedBasic statistics by_david_solomon_hadi_-_split_and_reviewed
Basic statistics by_david_solomon_hadi_-_split_and_reviewedbob panic
 
Morestatistics22 091208004743-phpapp01
Morestatistics22 091208004743-phpapp01Morestatistics22 091208004743-phpapp01
Morestatistics22 091208004743-phpapp01mandrewmartin
 

Similaire à Frequentist inference only seems easy By John Mount (20)

MSL 5080, Methods of Analysis for Business Operations 1 .docx
MSL 5080, Methods of Analysis for Business Operations 1 .docxMSL 5080, Methods of Analysis for Business Operations 1 .docx
MSL 5080, Methods of Analysis for Business Operations 1 .docx
 
Zain 333343
Zain 333343Zain 333343
Zain 333343
 
Module-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data scienceModule-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data science
 
Statistics
StatisticsStatistics
Statistics
 
Statistics
StatisticsStatistics
Statistics
 
Anastasi Lecture 2008
Anastasi Lecture 2008Anastasi Lecture 2008
Anastasi Lecture 2008
 
Machine learning session2
Machine learning   session2Machine learning   session2
Machine learning session2
 
Barra Presentation
Barra PresentationBarra Presentation
Barra Presentation
 
Advanced business mathematics and statistics for entrepreneurs
Advanced business mathematics and statistics for entrepreneursAdvanced business mathematics and statistics for entrepreneurs
Advanced business mathematics and statistics for entrepreneurs
 
In the last column we discussed the use of pooling to get a be
In the last column we discussed the use of pooling to get a beIn the last column we discussed the use of pooling to get a be
In the last column we discussed the use of pooling to get a be
 
PRML Chapter 1
PRML Chapter 1PRML Chapter 1
PRML Chapter 1
 
02a one sample_t-test
02a one sample_t-test02a one sample_t-test
02a one sample_t-test
 
The Importance of Probability in Data Science.docx
The Importance of Probability in Data Science.docxThe Importance of Probability in Data Science.docx
The Importance of Probability in Data Science.docx
 
Statistics 091208004734-phpapp01 (1)
Statistics 091208004734-phpapp01 (1)Statistics 091208004734-phpapp01 (1)
Statistics 091208004734-phpapp01 (1)
 
DataHandlingStatistics.ppt
DataHandlingStatistics.pptDataHandlingStatistics.ppt
DataHandlingStatistics.ppt
 
Principles of Health Informatics: Informatics skills - searching and making d...
Principles of Health Informatics: Informatics skills - searching and making d...Principles of Health Informatics: Informatics skills - searching and making d...
Principles of Health Informatics: Informatics skills - searching and making d...
 
BUS308 – Week 1 Lecture 2 Describing Data Expected Out.docx
BUS308 – Week 1 Lecture 2 Describing Data Expected Out.docxBUS308 – Week 1 Lecture 2 Describing Data Expected Out.docx
BUS308 – Week 1 Lecture 2 Describing Data Expected Out.docx
 
Quantitative Methods for Lawyers - Class #14 - Power Laws, Hypothesis Testing...
Quantitative Methods for Lawyers - Class #14 - Power Laws, Hypothesis Testing...Quantitative Methods for Lawyers - Class #14 - Power Laws, Hypothesis Testing...
Quantitative Methods for Lawyers - Class #14 - Power Laws, Hypothesis Testing...
 
Basic statistics by_david_solomon_hadi_-_split_and_reviewed
Basic statistics by_david_solomon_hadi_-_split_and_reviewedBasic statistics by_david_solomon_hadi_-_split_and_reviewed
Basic statistics by_david_solomon_hadi_-_split_and_reviewed
 
Morestatistics22 091208004743-phpapp01
Morestatistics22 091208004743-phpapp01Morestatistics22 091208004743-phpapp01
Morestatistics22 091208004743-phpapp01
 

Plus de Chester Chen

SFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdfSFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdfChester Chen
 
zookeeer+raft-2.pdf
zookeeer+raft-2.pdfzookeeer+raft-2.pdf
zookeeer+raft-2.pdfChester Chen
 
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...Chester Chen
 
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...Chester Chen
 
A missing link in the ML infrastructure stack?
A missing link in the ML infrastructure stack?A missing link in the ML infrastructure stack?
A missing link in the ML infrastructure stack?Chester Chen
 
Shopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdataShopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdataChester Chen
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...Chester Chen
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK... SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...Chester Chen
 
SFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProSFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProChester Chen
 
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scaleSF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scaleChester Chen
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftChester Chen
 
SFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdapSFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdapChester Chen
 
Sf big analytics: bighead
Sf big analytics: bigheadSf big analytics: bighead
Sf big analytics: bigheadChester Chen
 
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformChester Chen
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Chester Chen
 
2018 data warehouse features in spark
2018   data warehouse features in spark2018   data warehouse features in spark
2018 data warehouse features in sparkChester Chen
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 Chester Chen
 
2018 02 20-jeg_index
2018 02 20-jeg_index2018 02 20-jeg_index
2018 02 20-jeg_indexChester Chen
 
Index conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathIndex conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathChester Chen
 

Plus de Chester Chen (20)

SFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdfSFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdf
 
zookeeer+raft-2.pdf
zookeeer+raft-2.pdfzookeeer+raft-2.pdf
zookeeer+raft-2.pdf
 
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
 
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
 
A missing link in the ML infrastructure stack?
A missing link in the ML infrastructure stack?A missing link in the ML infrastructure stack?
A missing link in the ML infrastructure stack?
 
Shopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdataShopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdata
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK... SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 
SFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProSFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a Pro
 
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scaleSF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
 
SFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdapSFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdap
 
Sf big analytics: bighead
Sf big analytics: bigheadSf big analytics: bighead
Sf big analytics: bighead
 
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
 
2018 data warehouse features in spark
2018   data warehouse features in spark2018   data warehouse features in spark
2018 data warehouse features in spark
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3
 
2018 02 20-jeg_index
2018 02 20-jeg_index2018 02 20-jeg_index
2018 02 20-jeg_index
 
Index conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathIndex conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreath
 

Dernier

CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 

Dernier (20)

CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 

Frequentist inference only seems easy By John Mount

  • 1. Frequentist estimation only seems easy John Mount Win-Vector LLC 1 Outline First example problem: estimating the success rate of coin flips. Second example problem: estimating the success rate of a dice game. Interspersed in both: an entomologist’s view of lots of heavy calculation. Image from “HOW TO PIN AND LABEL ADULT INSECTS” Bambara, Blinn, http://www.ces.ncsu.edu/depts/ent/notes/ 4H/insect_pinning4a.html 2 This talk is going to alternate between simple probability games (like rolling dice) and the detailed calculations needed to bring the reasoning forward. If you come away with two points from this talk remember: classic frequentist statistics is not as cut and dried as teacher claim (so it is okay to ask questions), and Bayesian statistics is not nearly as complicated as people make it appear. The point of this talk Statistics is a polished field where many of the foundations are no longer discussed. A lot of the “math anxiety” felt in learning statistics is from uncertainty about these foundations, and how they actually lead to common practices. We are going to discuss common simple statistical goals (correct models, unbiasedness, low error) and how the lead to common simple statistical procedures. The surprises (at least for me) are: There is more than one way to do things. The calculations needed to justify how even simple procedures are derived from the goals are in fact pretty involved. 3 A lot of the pain of learning is being told there is only “one way” (when there is more than one) and that a hard step (linking goals to procedures) is easy (when in fact is is hard). Statistics would be easier to teach if those two things were true, but they are not. However, not addressing these issues makes learning statistics harder than it has to be. We are going to spend some time on what are appropriate statistical goals, and how they lead to common statistical procedures (instead of claiming everything is obvious). You won’t be expected do invent the math, but you need to accept that it is in fact hard to justify common statistical procedures without somebody having already done the math. And I’ll be honest I am a math for math’s sake guy.
  • 2. What you will get from this presentation Simple puzzles that present problems for the common rules of estimating rates. Good for countering somebody who says “everything is easy and you just don’t get it.” Examples that expose strong consequences of the seemingly subtle differences in common statistical estimation methods. Makes understanding seemingly esoteric distinctions like Bayesianism and frequentism much easier. A taste of some of the really neat math used to establish common statistics. A revival of Wald game-theoretic style inference (as described in Savage “The Foundations of Statistics”). 4 You will get to roll the die, and we won’t make you do the heavy math. Aside: we have been telling people that one of the things that makes data science easy is large data sets allow you to avoid some of the hard math in small sample size problems. Here we work through some of the math. In practice you do get small sample size issues even in large data sets due to heavy-tail like phenomena and when you introducing conditioning and segmentation (themselves typical modeling steps). First example: coin flip game 5 Why do we even care? The coin problem is a stand-in for something that that is probably important to us: such as estimating the probability of a sale given features and past experience: P[ sale | features,evidence ]. Being able to efficiently form good estimates that combine domain knowledge, current features and past data is the ultimate goal of analytics/data-science. 6
  • 3. The coin problem You are watching flips of a coin and want to estimate the probability p that the coin comes up heads. For example: "T" "T" "H" "T" "H" "T" "H" "H" "T" “T" Easy to apply! Sufficient statistic: 4 heads, 6 tails Frequentist estimate of p: p ~ heads/(heads+tails) = 0.4 Done. Thanks for your time. 7 # R code set.seed(2014) sample = rbinom(10,1,0.5) print(ifelse(sample>0.5,'H','T')) Wait, how did we know to do that? Why is it obvious h/(h+t) is the best estimate of the unknown true value of p? 8 Fundamental problem: a mid-range probability prediction (say a number in the range 1/6 to 5/6) is not falsifiable by a single experiment. So: how do we know such statements actually have empirical content? The usual answers are performance on long sequences (frequentist), appeals to axioms of probability (essentially additivity of disjoint events), and subjective interpretations. Each view has some assumptions and takes some work. Checking whether a coin is fair - Wikipedia, the free encyclopedia 7/21/14, 12:26 PM is small when compared with the alternative hypothesis (a biased coin). However, it is not small enough to cause us to believe that the coin has a significant bias. Notice that this probability is slightly higher than our presupposition of the probability that the coin was fair corresponding to the uniform prior distribution, which was 10%. Using a prior distribution that reflects our prior knowledge of what a coin is and how it acts, the posterior distribution would not favor the hypothesis of bias. However the number of trials in this example (10 tosses) is very small, and with more trials the choice of prior distribution would be somewhat less relevant.) Note that, with the uniform prior, the posterior probability distribution f(r | H = 7,T = 3) achieves its peak at r = h / (h + t) = 0.7; this value is called the maximum a posteriori (MAP) estimate of r. Also with the uniform prior, the expected value of r under the posterior distribution is The standard easy estimate comes from frequentism Plot of the probability density f(x | H = 7,T = 3) = 1320 x7 (1 - x)3 with x ranging from 0 to 1. The standard answer (this example from http://en.wikipedia.org/ wiki/Checking_whether_a_coin_is_fair ): Estimator of true probability The best estimator for the actual value is the estimator . This estimator has a margin of error (E) where at a particular confidence level. Answer is correct and simple, but not good (as it lacks context, assumptions, goals, motivation and explanation). Stumper: without an appeal to authority how do we know to use the estimate of heads/(heads+tails). What problem is such an estimate solving (what criterion is it optimizing)? Using this approach, to decide the number of times the coin should be tossed, two parameters are required: 1. The confidence level which is denoted by confidence interval (Z) 2. The maximum (acceptable) error (E) The confidence level is denoted by Z and is given by the Z-value of a standard normal distribution. This 9 Notation is a bit different: here tau is the unknown true value and http://en.wikipedia.org/wiki/Checking_whether_a_coin_is_fair Page 4 of 8 p is the estimate. Throughout this talk by “coin” we mean an abstract device that always returns one of two states. For Gelman and Nolan have an interesting article “You Can Load a Die, But You Can’t Bias a Coin” http://www.stat.columbia.edu/~gelman/ research/published/diceRev2.pdf about how hard it would be to bias an actual coin that you allow somebody else to flip (and how useless articles testing the fairness of the new Euro were).
  • 4. Also, there are other common estimates Examples: A priori belief: p ~ 0.5 regardless of evidence. Bayesian (Jeffreys prior) estimate: p ~ (heads+0.5)/(heads+tails +1) = 0.4090909 Laplace smoothed estimate: p ~ (heads+1)/(heads+tails+2) = 0.4166667 Game theory minimax estimates (more on this later in this talk). The classic frequentist estimate is not the only acceptable estimate. 10 Each of these has its merits. A prior belief has the least sampling noise (as it ignores the data). Bayesian with Jeffreys prior very roughly tries to maximize the amount of information captured in the first observation. Laplace smoothing minimizes expected square error under a uniform prior. Each different estimate has its own characteristic justification From “The Cartoon Guide to Statistics” Gonick and Smith. 11 If all of the estimates where “fully compatible” with each other then they would all be identical. Which they clearly are not. Notice we are discussing difference in estimates here- not differences in significances or hypothesis tests. Also Bayesian priors are not always subjective beliefs (Wald in particular used an operational definition). The standard story There are 1 to 2 ways to do statistics: frequentism and maybe Bayesianism. In frequentist estimation the unknown quantity to be estimated is fixed at a single value and the experiment is considered a repeatable event (with different possible measurements on each possible). All probabilities are over possible repetitions of experiment with observations changing. In Bayesian estimation the unknown quantity to be estimated is assumed to have non-trivial distribution and the experimental results are considered fixed. All probabilities are over possible values of the quantity to be estimated. Priors talk about the assumed distribution before measurement, posteriors talk about the distribution conditioned on the measurements. 12 There are other differences: such as preference of point-wise estimates versus full descriptions of distribution. And these are not the only possible models.
  • 5. Our coin example again I flip a coin a single time and it comes up heads- what is my best estimate of the probability the coin comes up heads in repeated flips? “Classic”/naive probability: 0.5 (independent of observations/ data) Frequentist: 1.0 Bayesian (Jeffreys prior): 0.75 13 Laws that are correct are correct in the extreme cases. (if we have distributed 6-sided dice) Lets try this. Everybody roll your die. If it comes up odd you win and even you lose. Okay somebody who one raise your hand. Each one of you if purely frequentist estimates 100% chance of winning this game (if you stick only to data from your die). Now please put your hands down. Everybody who did not win, how do you feel about the estimate of 100% chance of winning? What is the frequentist estimate optimizing? "Bayesian Data Analysis" 3rd Edition,Gelman, Carlin, Stern, Dunson, Vehtari, Rubin p. 92 states that frequentist estimates are designed to be consistent (as the sample size increases they converge to the unknown value), efficient (they tend to minimize loss or expected square-error), or even have asymptotic unbiasedness (the difference in the estimate from the true value converges to zero as the experiment size increases, even when re-scaled by the shrinking standard error of the estimate). If we think about it: frequentism is interpreting probabilities as limits of rates of repeated experiments. In this form bias is an especially bad form of error as it doesn’t average out. 14 Why not minimize L1 error? Because this doesn’t always turn out to be unbiased (or isn’t always a regression). Bayesians can allow bias. The saving idea: is don’t average estimators, but aggregate data and form a new estimate. Frequentist concerns: bias and efficiency (variance) From:“The Cambridge Dictionary of Statistics” 2nd Edition, B.S. Everitt. Bias: An estimator for which E[ˆ✓] = ✓ is said to be unbiased. Efficiency: A term applied in the context of comparing di↵erent methods of estimating the same parameter; the estimate with the lowest variance being regarded as the most efficient. 15 There is more than one unbiased estimate. For example a grand average (unconditioned by features) is an unbiased estimate.
  • 6. A good motivation of the frequentist estimate Adapted from “Schaum’s Outlines Statistics” 4th Edition, Spiegel, Stephens, pp. 204-205. SAMPLING DISTRIBUTIONS OF MEANS Suppose that all possible samples of size N are drawn without replacement from a finite population of size Np > N. If we denote the mean and stan-dard deviation of the sampling distribution of means by E[ˆμ] and E[ˆ] and the population mean and standard deviation by μ and respectively, then E[ˆμ] = μ and E[ˆ] = pN s Np − N Np − 1 (1) If the population is infinite or if sampling is with replacement, the above results reduce to E[ˆμ] = μ and E[ˆ] = pN (2) SAMPLING DISTRIBUTIONS OF PROPORTIONS Suppose that a population is infinite and the probability of occurrence of an event (called its success) is p. ... We thus obtain a sampling distribution of proportions whose mean E[ˆp] and standard deviation E[ˆ] are given by E[ˆp] = p and E[ˆ] = p(1 − p) pN (3) 16 A very good explanation. Unbiased views of the unknown parameter and its variance are directly observable in the sampling distribution. So you copy the observed values as your estimates. But to our point: frequentism no longer seems so simple. Also close the Bayesian justification: build a complete generative model with complete priors: and then you can copy averages of what you observe. Why is the frequentist forced to use the estimates 0 and 1? If the frequentist estimate is to be unbiased for any unknown value of p (in the range 0 through 1) then we must have for each such p: Xn h=0 P[h|n, p]en,h = Xn h=0 ✓ n h ◆ ph(1 p)nhen,h = p The frequentist estimate for each possible outcome of seeing h-heads in n-flips is a simultaneous planned panel of estimates e(n,h) that must satisfy the above bias-check equations for all p. These check conditions tend to be independent linear equations over our planned estimates e(n,h). So the system has at most one solution and it turns at the solution e(n,h) = h/n works. Insisting on unbiasedness completely determines the solution. 17 Estimates like 0 and 1 are wasteful in the sense they allow only one-sided errors. Laplace “add one smoothing” puts estimates between likely values (lowering expected l2 error under uniform priors). !T he check equations tend to be full rank linear equations in e(n,h) as the p-s generate something very much like the moment curve (which itself is a parameterized curve generating sets of points in general position). !T he reason I am showing this is: usually frequentist inference is described as canned procedures (avoiding trigger math anxiety) and Bayesian methods are presented as complicated formulas. In fact you should be as uncomfortable with frequentist methods as you are with Bayesian methods. ! sum_{h=0}^{n} text{P}[h|n,p] e_{n,h} = sum_{h=0}^{n} { n choose h} p^h (1-p)^{n-h} e_{n,h} = p Argh! That is a lot of painful math. The math (turning reasonable desiderata to reasonable procedures) has always been hiding there. You never need to re-do the math to use the use classic frequentist inference procedures (just to derive them). 18 We really worked to h/(h+t) the hard way. The frequentist can’t generate an estimate a single outcome, they must submit a panel of estimates for every possible outcome and then check that the panel represents a schedule of estimates that are simultaneously unbiased for any possible p.
  • 7. Is the frequentist solution optimal? It is the only unbiased solution. So it is certainly the most efficient unbiased solution. What if we relaxed unbiasedness? Are there more efficient solutions? Yes: consider estimates e(1,h) = (0,1) and b(1,h) = (1/4,3/4) Suppose loss is: loss(f,n) = E[ E[(f(n,h)-p)^2 | h ~ p,n] | p ~ P[p] ] P[p] is an assumed prior probability on p, such as P[p] = 1/3 if p=0,1/2,1 and 0 otherwise. Then: loss(1,b) = 0.0625 and loss(1,e) = 0.25. So loss(1,b) loss(1,e), you can think of the Bayesian procedures as being more efficient. But that isn’t fair. Insisting on a prior is adopting the Bayesian’s assumptions as truth. Of course that makes them look better. 19 Frequentist response: you can’t just wish-away bias Let’s try this lower-loss Bayesian estimate b(1,h) = (0.25,0.75) Suppose we 50 dice and we record wins as 1, losses as 0. Suppose in the above experiment there were 50 of us an 8 people won. Averaging the frequentist estimates: (8*1.0 + 42*0.0)/50 = 0.16 (not too far from the true value 1/6 = 0.1666667). Averaging the “improved” Bayesian estimates: (8*0.75 + 42*0.25)/50 = 0.33. Way off and most of the error is bias (not mere sampling error). Bayesian response: you don’t average estimates, you aggregate data and re-estimate. So you treat the group as a single experiment with 8 wins and 42 losses. Estimate is then (8+0.5)/(50+1) = 0.1666667 (no reason for estimate to be dead on, Bayesians got lucky that time). 20 (if they have dice they can run with this, all roll- count and compute). Bayesian response: you don’t average individual estimators, ! you collect # R code set.seed(2014) sample = rbinom(50,1,1/6) sum(sample)/length(sample) [1] 0.16 sum(ifelse(sample0.5,0.75,0.25))/length(sample) [1] 0.33 (0.5+sum(sample))/(1+length(sample)) [1] 0.1666667 Second example: dice game 21
  • 8. Dice are a fun example 22 Dice are pretty much designed to obey the axioms of naive/ classical probability theory (indivisible events having equal probability). Also once you have a lot of dice it is easy to think in terms of exchangeable repetitions of experiments (frequentist). Given that you will forgive us if we tilt the game towards the Bayesians by adding some hidden state. The dice game A control die numbered 1 through 5 is either rolled or placed on one of its sides. The game die is a fair die numbered 1 through 6. When the game die is rolled the game is a win if the number shown on the game die is greater than the number shown on the control die. The control die is held at the same value even when we re-roll the game die. Neither of the dice is ever seen by the player. 23 You only see the win/lose state not the control die or the game die 24 (if we have distributed 6-sided dice) Let’s play a round of this. I’ll hold the control die at 3. You all roll your 6-sided die. Okay everybody who’s die exceeded 1 raise their hands. This time we will group our observations to estimate the “unknown” probability p of winning. What we are looking for is that close to half the room (assuming we have enough people to build a large sample, and that we don’t get incredibly unlucky) have raised their hand. From this you should be able to surmise their are good odds the control die is set at 3, even if you don’t remember what you saw on the control die or what was on your game die.
  • 9. Multiple plays The control die is held at a single value and you try to learn the odds by observing the wins/losses reported by repeated rolls of the game die (but not ever seeing either of the dice). 25 The empirical frequentist procedure seems off After first flip you are forced (by the bias check conditions) to estimate a win-rate of 0 or 1. The with rate is always one of 1/6, 2/6, 3/6, 4/6, or 5/6. So your first estimate is always out of range. After 5 flips the bias equations no longer determine a unique solution. So you can try to decrease variance without adding any bias. But since your solution is no longer unique, you should have less faith it is the one true solution. 26 Could try Winsorising and using 1/6 as our estimate if we lose and 5/6 as our estimate if we win. But we saw earlier that “tucking in” estimates doesn’t always help (it introduces a bad bias). How about other estimates? Can we find an estimator that uses criteria other than unbiasedness without the strong assumption of knowing a favorable prior distribution? Remember: if we assume a prior distribution (even a so-called uninformative prior) and the assumption turns out to be very far off, then our estimate may be very far off (at least until we have enough data to dominate the priors). How about a solution that does well for the worst possible selection of the unknown probability p? We are not assuming a distribution on p, just that it is picked to be worst possible for our strategy. 27
  • 10. Leads us to a game theory minimax solution We want an estimate f(n,h) such that: Where loss(u,v) = (u-v)^2 or loss(u,v) = |u-v|. Here the opponent is submitting a vector p of probabilities of setting the control die to each of its 5 marks. The standard game-theory way to solve this is to find a f(n,h) that works well against the opponent picking a single state of the control die (c) after they see our complete set of estimates. That is: 28 In practice would just use Bayesian methods with reasonable priors. The reduction of one very hard form to another slightly less-hard problem is the core theorem of game theory. Even if you have been taught not to fear long equations, these should look nasty (as they have a lot of quantifiers in them and quantifiers can rapidly increase complexity). f(n,h) is just a panel or vector of n+1 estimate choices for each n. Also once you have things down to simple minimization you essentially have a problem of designing numerical integration or optimal quadrature. !f(n,h) = text{argmin}_{f(n,h)} max_{p in mathbb{R}^{5}, p ge 0, 1 cdot p = 1}sum_{c=1}^{5} p_c sum_{k=0}^{n} text{P}[k text{ wins} | n,p_c] times text{loss}( f(n,k) ,frac{6-c}{6} ) f(n,h) = text{argmin}_{f(n,h)} max_{c in {1,cdots , 5}} sum_{k=0}^{n} text{P}[k text{ wins} | n,p_c] times text{loss} ( f(n,k) ,frac{6-c}{6} ) Wald already systematized this 29 If you believe the control die is set by a fair roll, then we again have a game designed to exactly match a specific generative model (i.e. designed for Bayesian methods to win). If you believe the die is set by an adversary, you again have a game theory problem. Player 1 is trying to maximize risk/loss/error and player 2 is trying to minimize risk. We model the game as both players submitting their strategies at the same time. The standard game theory solution is you pick a strategy so strong that you would do no worse if your opponent peaked at it and then altered their strategy. This is part of a minimax setup. !W ald, A. (1949). Statistical Decision Functions. Ann. Math. Statist., 20(2):165–205. Wald was very smart One of his WWII ideas: armor sections of combat planes that you never saw damaged on returning planes. Classical thinking: put armor where you see bullet holes. Wald: put armor where you have never seen a bullet hole (hence never seen a hit survived). 30 Wald could bring a lot of deep math to the table. Wald’s solution allows for many different choices of loss (not just variance or L2) and for probabilistic estimates (i.e. don’t have to return the same estimate every time you see the same evidence, though that isn’t really and advantage).
  • 11. Our game In both cases the loss function is convex, so we expect a unique connected set of globally optimal solutions (no isolated local minima). For the l1-loss case where loss(u,v) = |u-v| we can solve for the optimal f(n,k) by a linear program. 1-round l1 solution [0.3, 0.7] 2-round l1 solution [0.24, 0.5, 0.76] For the l2-loss case where loss(u,v) = (u-v)^2 we can solve for the optimal f(n,k) using Newton’s method. 1-round l2 solution [0.25, 0.75] 2-round l2 solution [0.21, 0.5, 0.79] 31 These solutions are profitably exploiting both the boundedness of p (in the range 1/6 through 5/6) and the fact that p only takes one of 5 possible values (though we obviously don’t know which). !How do we pick between l1 and l2 loss? l2 is traditional as it is the next natural moment after the first moment (which becomes the bias conditions). Without the bias conditions l1 loss plausible (and leads to things like quantile regression). l2 has some advantages (such as the gradient structure tending to get expectations right, hence helping enforce regression conditions and reduce bias). Another game Suppose the opponent can pick any probability for a coin (they are not limited to 1/6,2/6,3/6,4/6,5/6). In this case we want to pick f(n,h) minimizing: 32 M(n,f(n,h)) = max_{p in [0,1]} sum_{k=0}^{n} text{P}[k text{ wins} | n,p] times text{loss}( f(n,k) ,p ) The general p l2 minimax solutions For the l1-loss case where loss(u,v) = |u-v| we have a convex program with a different linear constraint for each possible p. A column generating strategy over a LP solver handles this quite nicely. For the l2-loss case where loss(u,v) = (u-v)^2 the solution is: heads + pheads + tails/2 heads + tails + pheads + tails 33 Savage, L. J. (1972). The Foundations of Statistics. Dover cites this solution as coming from Hodges, J. L., J. and Lehmann, E. L. (1950). Some problems in minimax point estimation. The Annals o!f Mathematical Statistics, 21(2):pp. 182–197. see http://winvector.github.io/freq/minimax.pdf for details ! frac{text{heads} + sqrt{text{heads} + text{tails}}/2} {text{heads} + text{tails} + sqrt{text{heads}+text{tails}}}
  • 12. How can you solve the l2 minimax problem? Define: L(n, f(n, h), p) = Xn k=0 P[k wins|n, p] ⇥ (f(n, k) − p)2 For every n there is a f(n,h) (essentially a table of n+1 estimates) such that L(n,f(n,h),p) = g(n) where g(n) is free of p. And further: the partial derivative of L(n,,) with respect to any of the entries of f(n,h) evaluated at this f(n,h) are not p-free. In fact there are always p-s that allow us to freely choose the sign of this gradient. Enough to claim: argminf(n,h) max Examples: p L(n, f(n, h), p) = rootf(n,h)L(n, f(n, h), p) f(n, 0)2 L(1,(1/4,3/4),p) = 1/16 L(2,(-1/2 + sqrt(2)/2,1/2,-sqrt(2)/2 + 3/2),p) = -sqrt(2)/2 + 3/4 34 L(n,f(n,h),p) = sum_{k=0}^{n} text{P}[k text{ wins} | n,p] times ( f(n,k)-p )^2 text{argmin}_{f(n,h)} max_p L(n,f(n,h),p) = text{root}_{f(n,h)} L(n,f(n,h),p) - f(n,0)^2 !W e know L(n,f(n,h),p) is convex in f(n,h), so max_p L(n,f(n,h)) is also convex in f(n,h). We are not looking at the usual Karush– Kuhn–Tucker conditions of optimality. What I think is going on is M(n,f(n,h)) = max_p L(n,f(n,h),p) is majorized by L(,,), so we are collecting evidence of the optimal point through p. What is exciting is we get rid of quantifiers, making the problem much easier. !See http://winvector.github.io/freq/explicitSolution.html and https://github.com/WinVector/Examples/blob/master/freq/ python/explicitSolution.rst for more details. The l2 minimax solution in a graph Solution of the form L(1,(lambda,1-lambda),p). Notice best minimax solution is at f(1,h) = (0.25,0.75). Notice all p-curves cross there. Also notice if you move from 0.25, you can always find a p that makes things worse. This proves the solution is a local minima, so by convexity it is also the global optimum. 35 So it is just a matter of checking the stated solution clears the p’s out of L(k,,p). Leonard J. Savage gives this example on page 203 of the 1972 edition of “The Foundations of Statistics.” He attributes it to: “Some Problems in Minimax Point Estimation” J L Hodges and E L Lehmann, The Annals of Mathematical Statistics, 1950 vol. 21 (2) pp. 182-197. A few exact l1/l2 solutions 1-round l2 solution: (1/4, 3/4) (also the 1-round l2 solution) 2-round l2 solution: (-1/2 + sqrt(2)/2, 1/2, -sqrt(2)/2 + 3/2) ~ (0.207, 0.5, 0.793) Not the same as the 2-round l1 solution: (0.192, 0.5, 0.808) 36 Again this game is to build a best l1 or l2 estimate for any p in the range 0 through 1. Each estimate is biased (as they don’t agree with the traditional empirical frequentist estimate), but the bias is going down as n goes up. Also these estimates are not the traditional Bayesian ones as they don’t three with anything coming from traditional priors (notice the non-rational values). These are related to what Wald called “logical Bayes” where the Bayesian method is used, but we don’t insist on priors (but instead solve a minimax problem- where we try to do well under worst-possible initial distributions).
  • 13. Table of estimates 1/1 2/2 1/2 3/3 3/3 3/3 2/3 4/4 4/4 4/4 3/4 2/4 5/5 5/5 5/5 5/5 4/5 4/5 3/5 6/6 6/6 6/6 6/6 5/6 5/6 4/6 3/6 7/7 7/7 7/7 7/6/7 6/7 5/7 5/7 5/7 4/7 8/8 8/8 8/8 7/8/8 7/8 7/8 6/8 6/8 5/8 4/8 9/9 9/9 9/9 8/9 9/9 8/9 8/9 7/9 7/9 6/9 6/9 5/9 10/10 10/10 10/10 9/10 10/10 9/10 9/10 8/10 8/10 7/10 7/10 7/10 6/10 5/10 1/3 1/3 1/3 0/1 0/2 0/3 1/4 1/4 1/4 0/4 2/5 2/5 2/5 1/5 1/5 1/5 0/5 2/6 2/6 2/6 1/6 1/6 1/6 0/6 3/7 3/7 3/7 2/7 2/7 2/7 1/7 1/7 1/7 0/7 3/8 3/8 3/8 2/8 2/8 2/8 1/8 1/8 1/0/8 4/9 4/9 3/9 3/9 3/9 2/9 2/9 2/9 1/9 1/9 1/9 0/9 4/10 4/10 4/10 3/10 3/10 3/10 2/10 2/10 2/10 1/1/10 1/10 0/10 1/1 2/2 2/3 3/4 3/5 4/6 4/7 5/8 6/9 6/10 0/1 0/2 0/3 0/4 0/5 0/6 0/7 0/8 0/9 0/10 2/2 3/3 4/4 4/5 5/6 6/7 6/8 7/9 8/10 0/2 0/3 0/4 0/5 0/6 0/7 0/8 0/9 0/10 2/2 2/3 3/4 3/5 4/6 4/7 5/8 5/9 6/10 0/2 0/3 0/4 0/5 0/6 0/0/8 0/9 0/10 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 9 10 n phi estName aaaa a Bayes (Jeffreys) Frequentist l1 minimax l2 minimax 37 For each of the four major estimates we discussed we show the chosen estimate phi for h-heads out of n-flips. In general frequentist is outside, Bayes, which is outside l1 minimax which is outside l2 minimax. l1 and l2 interior solutions are very close. This is a graph of a ready to go decision table (an user could forget everything up until here and just pick their phis off the graph). Notice frequentist solution crosses l2 minimax around n=8. Also all solutions except l1 minimax are equally spaced when n is held fixed. For more details see: https://github.com/ WinVector/Examples/blob/master/freq/python/freqMin.rst Or: consider this table no easier to use … Frequentist h n 0 1 2 3 4 5 1 0.0000000 1.0000000 2 0.0000000 0.5000000 1.0000000 3 0.0000000 0.3333333 0.6666667 1.0000000 4 0.0000000 0.2500000 0.5000000 0.7500000 1.0000000 5 0.0000000 0.2000000 0.4000000 0.6000000 0.8000000 1.0000000 38 Obviously you don’t need the table for frequentist as h/(h+t) is easy to remember. than to use: l2 minimax h n 0 1 2 3 4 5 1 0.2500000 0.7500000 2 0.2071068 0.5000000 0.7928932 3 0.1830127 0.3943376 0.6056624 0.8169873 4 0.1666667 0.3333333 0.5000000 0.6666667 0.8333333 5 0.1545085 0.2927051 0.4309017 0.5690983 0.7072949 0.8454915 39 And the point is: depending on your goals this table might be the one you want. However, be warned the l2 minimax adding of sqrt(n) pseudo-observations is an uncommon procedure. You want to check if you really want that.
  • 14. And that is it 40 What to take away Deriving or justifying optimal inference techniques on even simple dice games can bring in a lot of heavy calculation. If you don’t find that worrying, then you aren’t paying attention. For standard situations statisticians did the heavy calculations a long time ago and packaged up good and simple procedures (the justifications are difficult, but you don’t have to repeat the justifications each time you apply the methods). Unbiasedness is just one desirable property among many. If you accept it is required you are often forced to accept traditional empirical frequentists estimates as only possible and best possible (not always a good thing). Differences in Bayesian and frequentist assumptions lead not only to different hypothesis testing paradigms (confidence intervals versus credible intervals)- they also pick different “optimal” estimates. Best answer depends on your use case (not your sense of style). 41 Thank you 42
  • 15. Links ! iPython notebook of most of these results/graphs: https://github.com/WinVector/Examples/blob/master/freq/python/freqMin.rst ! More on this topic: http://www.win-vector.com/blog/2014/07/frequenstist-inference-only-seems-easy/ http://www.win-vector.com/blog/2014/07/automatic-bias-correction-doesnt-fix-omitted-variable-bias/ ! For more information please try our blog: http://www.win-vector.com/blog/ and our book “Practical Data Science with R” http://practicaldatascience.com . ! Please contact us with comments, questions, ideas, projects at: jmount@win-vector.com 43 ipython notebook working through all these examples https:// github.com/WinVector/Examples/blob/master/freq/python/ freqMin.rst