Statistical sampling have established itself in all facets of our live from physics to medical research to presidential elections, still when it comes to Big Data we most frequently favor brute force approach and attempt to process our entire data set - it's all or nothing. However we don't really need to count every single grain of sand at the beach to conclude that it will be a great holiday destination. When we analyze our business performance do we compare every digit of last week 365,514,134 visitors to this week?s 366,364,615 or do we want to know one is 0.2% bigger than the other? Or maybe we can say there is no difference? Properly posing questions to Big Data is the key to reducing overall costs of the data systems and getting information faster while preserving brute force crunching for tasks that really have to count every penny and every drop in the ocean. We will present sampling methodologies useful for Hadoop environments, properly structuring the data for export to non-Hadoop systems, discuss establishing proper sampling rate for different tasks, emphasizing its application to digital marketing and variable sampling rate for properly tracking valuable needles in unimportant haystacks.
Linked Data in Production: Moving Beyond Ontologies
Big Data Sampling
1. Big Data Sampling
How to make all of your data useful again
Mikhail Petrenko, Sr. Data Architect, Adobe
mikhail193@gmail.com
2. Agenda
What is sampling?
Why don’t we use Big Data sampling more?
Why sampling is a good idea
When sampling is a bad idea
Accuracy of sampled reports
Variable rate sampling
5. Why we don’t sample
Results are not accurate
It takes time and effort to implement
It is hard to maintain
We can perform all the analysis we want – just give us
more hardware.
12. How Accurate are We?
Profits +/- 30%
EPS + 40%
Sales forecast +/- 15%-20% considered pretty accurate.
13. How big of a sample?
1000 EPS Analysts
30% accuracy
How many do we need to pay to get the same
accuracy?
Just 18
14. How big of a sample?
100,000 site visitors
How many do we need to analyze to get yes/no
answer accurate to +/- 1%
99% accuracy
Just 14,267 (1/7)
95% accuracy
8,763 (1/12)
15. Sample of the big picture
10,000,000 buyers 10% are your visitors
What price to set for SummitSneaker 2013 (€200 +/- €98)?
excluded
included
0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400
21. Results
€ 80,000
€ 70,000
€ 60,000
€ 50,000
Avg loss
€ 40,000
Cost
Loss of profit
€ 30,000
€ 20,000
€ 10,000
€0
107,394 10,739 1,074 107
22. What makes a good sampling
algorithm?
Uniform
Unbiased
Consistent
Can be repeatable or non-repeatable
In Big Data we mostly use Systematic Sampling
23. How
Unique ID
Modulo (remainder of a division)
Hash
Time
Every N-th minute
Every X-th visitor
Location
Use only 1 server out of 6
25. Beware of buckets
CREATE TABLE user_info_bucketed(user_id BIGINT, firstname
STRING, lastname STRING) COMMENT 'A bucketed copy of user_info'
PARTITIONED BY(ds STRING) CLUSTERED BY(user_id) INTO 256
BUCKETS;
Clustering depends on data type
Clustering of INT is different from BIGINT
Strings are even more complicated
Preserve ability of all systems to sample
Use INT or make it an INT
26. Repeatable Non-Repeatable
UserID % 3 1st Visitor of 3
Yesterday Yesterday
Y Y Y Y Y - N N
N N N N N Y N N
N N N N N - Y -
Today Today
Y Y Y Y Y N N N
N N N N N - N N
N N N N - Y Y -
27. Don’t forget the weights
We estimate the whole by adding weights to the
sample
If you sampled 1/10 of the whole data set multiply
appropriate metrics by 10
28. What can go wrong
Unique ID
IDs assigned by some rule
Time
Grab 1sth hour of the day – midnight traffic won’t match
day traffic
Monday won’t match Sunday
Different servers may have different schedules
Location
Servers allocated based on region or storefront
38. Shoes Data
Take avg loss cost loss of profit
All market $ 1,325,994,929
All data - Sample - 1/10
of market $ 1,325,993,312 $ 1,616 $ 420,000 421,616.08
Sample 1/100 of market
or 1/10 of all data $ 1,325,989,167 $ 5,762 $ 42,000 47,761.83
Sample 1/1000 of market $ 1,325,965,877 $ 29,052 $ 4,200 33,251.85
Sample 1/10,000 of
market $ 1,325,576,009 $ 418,920 $ 420 419,339.65
Sample 1/100,000 of
market $ 1,321,523,057 $ 4,471,872 $ 42 4,471,913.92
39. Marketing Data
Take avg loss cost loss of profit
All data € 109,969 €0 € 70,000 € 70,000
Sample - 1/10 of
population € 108,358 € 1,611 € 7,000 € 8,611
Sample 1/100 of
population € 104,610 € 5,359 € 700 € 6,059
Sample 1/1000 of
population € 92,981 € 16,989 € 70 € 17,059
40. Shoes €200 +/- €98 1Million buyers
500000
450000
400000
350000
300000
avg loss
250000
cost
200000 loss of profit
150000
100000
50000
0
all 104,858 10,486 1,049 105
41. Shoes €200 +/- €20 1Million buyers
450000
400000
350000
300000
250000
Avg loss
System cost
200000
Loss of profit
150000
100000
50000
0
all 104,858 10,486 1,049 105
Notes de l'éditeur
Foundation for new discoveries and inventionsSource of additional revenueWe don’t love the data, we love what it gives us
Time – less time to run report, more report in the same time frameMoney – systems cost less, more profit