SlideShare une entreprise Scribd logo
1  sur  41
Big Data Sampling
How to make all of your data useful again
Mikhail Petrenko, Sr. Data Architect, Adobe
          mikhail193@gmail.com
Agenda


What is sampling?
Why don’t we use Big Data sampling more?
Why sampling is a good idea
When sampling is a bad idea
Accuracy of sampled reports
Variable rate sampling
Analysis
Summaries
Why we don’t sample


Results are not accurate
It takes time and effort to implement
It is hard to maintain
We can perform all the analysis we want – just give us
more hardware.
Why do we need Big Data?
The Future!
Your real goals
Biggest benefits of sampling
Legacy Tools
Is sampling always a good idea?
How Accurate are We?


Profits +/- 30%
EPS + 40%
Sales forecast +/- 15%-20% considered pretty accurate.
How big of a sample?


1000 EPS Analysts
30% accuracy
How many do we need to pay to get the same
accuracy?
Just 18
How big of a sample?


100,000 site visitors
How many do we need to analyze to get yes/no
answer accurate to +/- 1%
99% accuracy
Just 14,267 (1/7)
95% accuracy
8,763 (1/12)
Sample of the big picture


                 10,000,000 buyers 10% are your visitors
         What price to set for SummitSneaker 2013 (€200 +/- €98)?




                                                                                          excluded
                                                                                          included




0   20   40   60   80   100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400
Cluster


              Price
hardware      €10,000
software      €4,000
Nodes         30
Total
              €420,000
Results

€ 450,000


€ 400,000


€ 350,000


€ 300,000


€ 250,000
                                                   Avg Loss
                                                   Cost
€ 200,000
                                                   Loss of Profit

€ 150,000


€ 100,000


 € 50,000


      €0
            1,048,576   104,858   10,486   1,049
Adjust for sampling
   Bessel's correction
Online Marketing


100,000 impressions
Buy and sell 3 blocks per day
340 days
PPM € 1.0 (€1.0 profit per 1,000 impressions)
Cluster


              Price
hardware      €10,000
software      €4,000
Nodes         5
Total
              €70,000
Results
€ 80,000


€ 70,000


€ 60,000


€ 50,000

                                            Avg loss
€ 40,000
                                            Cost
                                            Loss of profit
€ 30,000


€ 20,000


€ 10,000


     €0
           107,394   10,739   1,074   107
What makes a good sampling
        algorithm?


Uniform
Unbiased
Consistent
Can be repeatable or non-repeatable

In Big Data we mostly use Systematic Sampling
How


Unique ID
  Modulo (remainder of a division)
  Hash
Time
  Every N-th minute
  Every X-th visitor
Location
  Use only 1 server out of 6
Hadoop/Hive buckets


                              Sample
     Table        Date
                              bucket

                18-Mar-2013     1


                19-Mar-2013     2


     Visitors                   3


                                1

                 20-Mar-
                                2
                  2013

                                3
Beware of buckets
   CREATE TABLE user_info_bucketed(user_id BIGINT, firstname
   STRING, lastname STRING) COMMENT 'A bucketed copy of user_info'
   PARTITIONED BY(ds STRING) CLUSTERED BY(user_id) INTO 256
   BUCKETS;


Clustering depends on data type
Clustering of INT is different from BIGINT
Strings are even more complicated
Preserve ability of all systems to sample
Use INT or make it an INT
Repeatable          Non-Repeatable
        UserID % 3           1st Visitor of 3
         Yesterday               Yesterday
   Y      Y     Y    Y      Y     -     N    N

   N      N     N    N      N    Y      N    N

   N      N     N    N      N     -     Y    -

Today                    Today
   Y      Y     Y    Y      Y    N      N    N

   N      N     N    N      N     -     N    N

   N      N     N    N      -    Y      Y    -
Don’t forget the weights


We estimate the whole by adding weights to the
sample
If you sampled 1/10 of the whole data set multiply
appropriate metrics by 10
What can go wrong


Unique ID
  IDs assigned by some rule
Time
  Grab 1sth hour of the day – midnight traffic won’t match
  day traffic
  Monday won’t match Sunday
  Different servers may have different schedules
Location
  Servers allocated based on region or storefront
Variable Rate Sampling
   Sometimes we want to be biased
Why Variable Rate?
Flat sample

                                x3
x3     x3             x3   x3
x3




 x3   x3
                 x3
      x3
Guarantee inclusion of VIPs

                                    x3
x3       x3             x3   x3
x3




 x3     x3
                                         x1
                   x3
         x3
Careful – include VIPs only once

                               x3
x3     x3            x3   x3
x3

                 x3

 x3   x3
                                    x1
                x3
      x3
Watch out for weights
Variable rate introduces additional skew
Weight correction when needed
Stratified weights
Questions?
Shoes Data


                        Take               avg loss      cost         loss of profit
All market               $ 1,325,994,929
All data - Sample - 1/10
of market                $ 1,325,993,312   $     1,616   $ 420,000       421,616.08
Sample 1/100 of market
or 1/10 of all data      $ 1,325,989,167   $     5,762   $ 42,000         47,761.83

Sample 1/1000 of market $ 1,325,965,877    $    29,052   $   4,200        33,251.85
Sample 1/10,000 of
market                  $ 1,325,576,009    $   418,920   $      420      419,339.65
Sample 1/100,000 of
market                  $ 1,321,523,057    $ 4,471,872   $      42     4,471,913.92
Marketing Data


                   Take               avg loss              cost              loss of profit
All data                  € 109,969                   €0           € 70,000             € 70,000
Sample - 1/10 of
population                € 108,358               € 1,611           € 7,000              € 8,611
Sample 1/100 of
population                € 104,610               € 5,359            € 700               € 6,059
Sample 1/1000 of
population                 € 92,981              € 16,989              € 70             € 17,059
Shoes €200 +/- €98 1Million buyers

500000

450000

400000

350000

300000

                                                avg loss
250000
                                                cost
200000                                          loss of profit

150000

100000

 50000

     0
         all   104,858   10,486   1,049   105
Shoes €200 +/- €20 1Million buyers

450000


400000


350000


300000


250000
                                                 Avg loss
                                                 System cost
200000
                                                 Loss of profit

 150000


100000


 50000


     0
          all   104,858   10,486   1,049   105

Contenu connexe

En vedette

Big Data Analytics in Ecommerce industry
Big Data Analytics in Ecommerce industryBig Data Analytics in Ecommerce industry
Big Data Analytics in Ecommerce industryRashed Moslem
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering AlgorithmLino Possamai
 
Big Data Analytics and its Application in E-Commerce
Big Data Analytics and its Application in E-CommerceBig Data Analytics and its Application in E-Commerce
Big Data Analytics and its Application in E-CommerceUyoyo Edosio
 
Guide to big data analytics
Guide to big data analyticsGuide to big data analytics
Guide to big data analyticsGahya Pandian
 
การเขียนบรรณานุกรมจากหนังสือ
การเขียนบรรณานุกรมจากหนังสือการเขียนบรรณานุกรมจากหนังสือ
การเขียนบรรณานุกรมจากหนังสือusaneetoi
 
a Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resourcesa Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application ResourcesDataWorks Summit
 
From Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for AllFrom Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for AllDataWorks Summit
 
Internet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitInternet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitDataWorks Summit
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterDataWorks Summit
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitDataWorks Summit
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitDataWorks Summit
 
customer behavior in e-commerce
customer behavior in e-commercecustomer behavior in e-commerce
customer behavior in e-commerceNor Rasyidah
 
Sampling and Sample Types
Sampling  and Sample TypesSampling  and Sample Types
Sampling and Sample TypesDr. Sunil Kumar
 

En vedette (18)

Big Data Analytics in Ecommerce industry
Big Data Analytics in Ecommerce industryBig Data Analytics in Ecommerce industry
Big Data Analytics in Ecommerce industry
 
Big Data and E-Commerce
Big Data and E-CommerceBig Data and E-Commerce
Big Data and E-Commerce
 
Nombres de portafolios de 9°
Nombres de  portafolios de 9°Nombres de  portafolios de 9°
Nombres de portafolios de 9°
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering Algorithm
 
Big Data Analytics and its Application in E-Commerce
Big Data Analytics and its Application in E-CommerceBig Data Analytics and its Application in E-Commerce
Big Data Analytics and its Application in E-Commerce
 
18 Data Streams
18 Data Streams18 Data Streams
18 Data Streams
 
Guide to big data analytics
Guide to big data analyticsGuide to big data analytics
Guide to big data analytics
 
การเขียนบรรณานุกรมจากหนังสือ
การเขียนบรรณานุกรมจากหนังสือการเขียนบรรณานุกรมจากหนังสือ
การเขียนบรรณานุกรมจากหนังสือ
 
a Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resourcesa Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resources
 
From Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for AllFrom Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for All
 
Internet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitInternet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop Summit
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
customer behavior in e-commerce
customer behavior in e-commercecustomer behavior in e-commerce
customer behavior in e-commerce
 
Big Data: Issues and Challenges
Big Data: Issues and ChallengesBig Data: Issues and Challenges
Big Data: Issues and Challenges
 
Sampling and Sample Types
Sampling  and Sample TypesSampling  and Sample Types
Sampling and Sample Types
 

Plus de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 

Dernier (20)

201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 

Big Data Sampling

  • 1. Big Data Sampling How to make all of your data useful again Mikhail Petrenko, Sr. Data Architect, Adobe mikhail193@gmail.com
  • 2. Agenda What is sampling? Why don’t we use Big Data sampling more? Why sampling is a good idea When sampling is a bad idea Accuracy of sampled reports Variable rate sampling
  • 5. Why we don’t sample Results are not accurate It takes time and effort to implement It is hard to maintain We can perform all the analysis we want – just give us more hardware.
  • 6. Why do we need Big Data?
  • 11. Is sampling always a good idea?
  • 12. How Accurate are We? Profits +/- 30% EPS + 40% Sales forecast +/- 15%-20% considered pretty accurate.
  • 13. How big of a sample? 1000 EPS Analysts 30% accuracy How many do we need to pay to get the same accuracy? Just 18
  • 14. How big of a sample? 100,000 site visitors How many do we need to analyze to get yes/no answer accurate to +/- 1% 99% accuracy Just 14,267 (1/7) 95% accuracy 8,763 (1/12)
  • 15. Sample of the big picture 10,000,000 buyers 10% are your visitors What price to set for SummitSneaker 2013 (€200 +/- €98)? excluded included 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400
  • 16. Cluster Price hardware €10,000 software €4,000 Nodes 30 Total €420,000
  • 17. Results € 450,000 € 400,000 € 350,000 € 300,000 € 250,000 Avg Loss Cost € 200,000 Loss of Profit € 150,000 € 100,000 € 50,000 €0 1,048,576 104,858 10,486 1,049
  • 18. Adjust for sampling Bessel's correction
  • 19. Online Marketing 100,000 impressions Buy and sell 3 blocks per day 340 days PPM € 1.0 (€1.0 profit per 1,000 impressions)
  • 20. Cluster Price hardware €10,000 software €4,000 Nodes 5 Total €70,000
  • 21. Results € 80,000 € 70,000 € 60,000 € 50,000 Avg loss € 40,000 Cost Loss of profit € 30,000 € 20,000 € 10,000 €0 107,394 10,739 1,074 107
  • 22. What makes a good sampling algorithm? Uniform Unbiased Consistent Can be repeatable or non-repeatable In Big Data we mostly use Systematic Sampling
  • 23. How Unique ID Modulo (remainder of a division) Hash Time Every N-th minute Every X-th visitor Location Use only 1 server out of 6
  • 24. Hadoop/Hive buckets Sample Table Date bucket 18-Mar-2013 1 19-Mar-2013 2 Visitors 3 1 20-Mar- 2 2013 3
  • 25. Beware of buckets CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING) COMMENT 'A bucketed copy of user_info' PARTITIONED BY(ds STRING) CLUSTERED BY(user_id) INTO 256 BUCKETS; Clustering depends on data type Clustering of INT is different from BIGINT Strings are even more complicated Preserve ability of all systems to sample Use INT or make it an INT
  • 26. Repeatable Non-Repeatable UserID % 3 1st Visitor of 3 Yesterday Yesterday Y Y Y Y Y - N N N N N N N Y N N N N N N N - Y - Today Today Y Y Y Y Y N N N N N N N N - N N N N N N - Y Y -
  • 27. Don’t forget the weights We estimate the whole by adding weights to the sample If you sampled 1/10 of the whole data set multiply appropriate metrics by 10
  • 28. What can go wrong Unique ID IDs assigned by some rule Time Grab 1sth hour of the day – midnight traffic won’t match day traffic Monday won’t match Sunday Different servers may have different schedules Location Servers allocated based on region or storefront
  • 29. Variable Rate Sampling Sometimes we want to be biased
  • 31. Flat sample x3 x3 x3 x3 x3 x3 x3 x3 x3 x3
  • 32. Guarantee inclusion of VIPs x3 x3 x3 x3 x3 x3 x3 x3 x1 x3 x3
  • 33. Careful – include VIPs only once x3 x3 x3 x3 x3 x3 x3 x3 x3 x1 x3 x3
  • 34. Watch out for weights Variable rate introduces additional skew
  • 38. Shoes Data Take avg loss cost loss of profit All market $ 1,325,994,929 All data - Sample - 1/10 of market $ 1,325,993,312 $ 1,616 $ 420,000 421,616.08 Sample 1/100 of market or 1/10 of all data $ 1,325,989,167 $ 5,762 $ 42,000 47,761.83 Sample 1/1000 of market $ 1,325,965,877 $ 29,052 $ 4,200 33,251.85 Sample 1/10,000 of market $ 1,325,576,009 $ 418,920 $ 420 419,339.65 Sample 1/100,000 of market $ 1,321,523,057 $ 4,471,872 $ 42 4,471,913.92
  • 39. Marketing Data Take avg loss cost loss of profit All data € 109,969 €0 € 70,000 € 70,000 Sample - 1/10 of population € 108,358 € 1,611 € 7,000 € 8,611 Sample 1/100 of population € 104,610 € 5,359 € 700 € 6,059 Sample 1/1000 of population € 92,981 € 16,989 € 70 € 17,059
  • 40. Shoes €200 +/- €98 1Million buyers 500000 450000 400000 350000 300000 avg loss 250000 cost 200000 loss of profit 150000 100000 50000 0 all 104,858 10,486 1,049 105
  • 41. Shoes €200 +/- €20 1Million buyers 450000 400000 350000 300000 250000 Avg loss System cost 200000 Loss of profit 150000 100000 50000 0 all 104,858 10,486 1,049 105

Notes de l'éditeur

  1. Foundation for new discoveries and inventionsSource of additional revenueWe don’t love the data, we love what it gives us
  2. Time – less time to run report, more report in the same time frameMoney – systems cost less, more profit
  3. Statistical packagesReportingData miningBusiness Intelligence
  4. Data collectionFinanceVery small populataions