SlideShare une entreprise Scribd logo
1  sur  32
30 Billion Events a Day with Hadoop




Michael Brown, CTO, comScore, Inc.
May 10th, 2012
comScore is a Global Leader in Measuring the Digital World

                                                  NASDAQ            SCOR
                                                  Clients           1860+ worldwide
                                                  Employees         1000+
                                                  Headquarters      Reston, VA
                                                                    170+ countries under measurement;
                                                  Global Coverage
                                                                    43 markets reported

                                                  Local Presence    32 locations in 23 countries




                © comScore, Inc.   Proprietary.             2                                      V1011
Some of our Clients
 Media   Agencies   Telecom/Mobile            Financial   Retail   Travel   CPG   Pharma   Technology




                       © comScore, Inc.   Proprietary.      3                                   V1011
The Trusted Source for Digital Intelligence Across Vertical Markets


       9   out of the top   10                               9 out of the top 10
       INVESTMENT BANKS                                      AUTO INSURERS


       4   out of the top   4                                11   out of the top   12
       WIRELESS CARRIERS                                     INTERNET SERVICE
                                                             PROVIDERS

       47 out of the top 50                                  14   out of the top   15
       ONLINE PROPERTIES                                     PHARMACEUTICAL
                                                             COMPANIES

       45    out of the top     50                           11   out of the top   12
       ADVERTISING AGENCIES                                  CONSUMER FINANCE
                                                             COMPANIES

       9 out of the top 10                                   8   out of the top   10
       MAJOR MEDIA COMPANIES                                 CPG COMPANIES


                       © comScore, Inc.   Proprietary.   4                              V1011
Unified Digital Measurement™ (UDM) Establishes Platform For
Panel + Census Data Integration


     Global PERSON                                              Global DEVICE
      Measurement                                               Measurement




         PANEL                                                          CENSUS




             Unified Digital Measurement (UDM)
                                Patent-Pending Methodology
                      Adopted by 90% of Top 100 U.S. Media Properties


                 © comScore, Inc.   Proprietary.   5                             V0411
Beacon Heat Map




              © comScore, Inc.   Proprietary.   6
Worldwide Tags per Month

                                                                        Monthly Records Collection
               1,000,000,000,000


                900,000,000,000


                800,000,000,000


                700,000,000,000


                600,000,000,000
# of records




                500,000,000,000


                400,000,000,000


                300,000,000,000


                200,000,000,000


                100,000,000,000


                              0
                                   Jul
                                         Aug
                                               Sep
                                                     Oct
                                                           Nov
                                                                 Dec
                                                                       Jan
                                                                              Feb


                                                                                          Apr


                                                                                                      Jun
                                                                                                            Jul
                                                                                                                  Aug
                                                                                                                        Sep
                                                                                                                              Oct
                                                                                                                                    Nov
                                                                                                                                          Dec
                                                                                                                                                Jan
                                                                                                                                                      Feb


                                                                                                                                                                  Apr


                                                                                                                                                                              Jun
                                                                                                                                                                                    Jul
                                                                                                                                                                                          Aug
                                                                                                                                                                                                Sep
                                                                                                                                                                                                      Oct
                                                                                                                                                                                                            Nov
                                                                                                                                                                                                                  Dec
                                                                                                                                                                                                                        Jan
                                                                                                                                                                                                                              Feb


                                                                                                                                                                                                                                          Apr
                                                                                    Mar




                                                                                                                                                            Mar




                                                                                                                                                                                                                                    Mar
                                                                                                May




                                                                                                                                                                        May




                                                                                                                                                                                                                                                May
                                               2009                                                   2010                                                                    2011                                              2012

                                                                       Panel Records                    Beacon Records
                                                     © comScore, Inc.        Proprietary.                     7
Our Event Volume in Perspective

                                                   Property            Page Views (MM)

                            FACEBOOK.COM                                       472,814

                                            Google Sites                       302,802

                                            Yahoo! Sites                        90,448

                                                           Total               866,064




Source: comScore MediaMetrix Worldwide April 2012




                         © comScore, Inc.   Proprietary.           8
Growth Slides
1,600,000,000,000


                                                          R² = 0.9335
1,400,000,000,000



1,200,000,000,000



1,000,000,000,000



 800,000,000,000



 600,000,000,000



 400,000,000,000



 200,000,000,000



               -




                    © comScore, Inc.   Proprietary.   9
The Project:
Census Web Agg




           © comScore, Inc.   Proprietary.   10
The Problem Statement

§  Calculate the number of events and unique cookies for each key
§  Key take aways
  –  Data on input will be sessionized daily
  –  Need to process all data for a month
  –  Need to calculate values for Total Internet and for each site under
    measurement




                     © comScore, Inc.   Proprietary.   11
Counting Uniques from a Time Ordered Log File



         A                                                Major Downsides:
                                              Need to keep all key elements in memory.
         D                                 Constrained to one machine for final aggregation.


         B

         C

         B

         A

         A


               © comScore, Inc.   Proprietary.       12
Counting Uniques from a Key Ordered Log File



         A                                                   Major Downsides:
                                                       Need to sort data in advance.
         A                                       The sort time increases as volume grows.


         A

         B

         B

         C

         D


               © comScore, Inc.   Proprietary.     13
Scaling Issue

§  As our volume has grown we have the following stats:
  –  Over 900 billion events per month
  –  Over 150 billion sessions per month
  –  Over 5,000 reportable sites
  –  Over 50 countries
  –  We see 15 billion distinct cookies in a month
  –  5 sites have over 1 billion cookies in a month
  –  The sum of all distinct cookies is 377 billion
  –  We only need to output 15 million rows




                     © comScore, Inc.   Proprietary.   14
Counting Uniques from a Key Ordered Log File




               © comScore, Inc.   Proprietary.   15
Windows v1 (Single Server)

§  Time to process data for first few months
       Month                                Wall Time (hours)

      Jul 2009                              8
      Aug 2009                          10
      Sep 2009                          11
      Oct 2009                          16
      Nov 2009                          37




§  V1 Processed sessions at roughly 250K rows/sec


§  Problems with this version:
  –  Slow
  –  Not Scalable
  –  Dedicated Server
  –  Bottleneck for delivering production


                         © comScore, Inc.   Proprietary.        16
Counting Uniques from Sharded Key Ordered Log Files




               © comScore, Inc.   Proprietary.   17
Windows v2

§  Features of this version
  –  Distributed (32 servers)
  –  Multithreaded
  –  Data Localization
  –  Very low network data transfer
  –  Handling the data growth

§  The V2 code processed data over 8 million rows/sec
  –  1 hour for Dec 2009; 5 hours for April 2012

§  Issues
  –  Data is distributed by ID into 64 parts
  –  Possibilities for skew in distribution key, that impacts performance and high disk usage on a node
  –  All data replication is manual, along with recovery
  –  Results cannot be calculated if any node is down
  –  Adding new servers or change in parts is a ton of effort
  –  Overhead to maintain framework to run distributed jobs




                          © comScore, Inc.   Proprietary.   18
Enter the Elephant

§  Why Hadoop?
 –  Scalable
 –  Low risk to lose data due to replication
 –  Run on a shared production cluster
 –  No overhead to maintain framework
 –  Easy job submission and management




                   © comScore, Inc.   Proprietary.   19
Basic Approach

§  Leverage Pig for POC
  –  Pig Latin is easy for developers and data analysts to learn
  –  Rapid application development vs. M/R applications (i.e. 1 line of Pig Latin = 20 lines in Java Map/
    Reduce)
  –  Extendable via UDFs




                         © comScore, Inc.   Proprietary.   20
Performance of Basic Approach on Various Samples

                                                  Aggregation Performance
                 80.00


                 70.00


                 60.00


                 50.00
Time (minutes)




                 40.00


                 30.00


                 20.00


                 10.00


                  0.00
                         372 GB (3%)                              744 GB (6%)                                  1116 GB (9%)
                                                                 Input data size




                               © comScore, Inc.   Proprietary.   21     Note: Target data size is over 10 TB
M/R Data Flow


       B    C                                         A        B       C       A



     Mapper
       Map                                            Mapper           Mapper
                                                        Map              Map


        A       A                                         B        B       C       C

      Reduce                                          Reduce               Reduce

            A                                                 B                C




                    © comScore, Inc.   Proprietary.           22
Basic Approach Retrospective

§  Processing speed is not scaling to our needs on a sample of the input data
§  Diagnosis
  –  Most aggregations could not take significant advantage of combiners. Not a Pig issue.
  –  Large shuffles caused poor job performance. In some cases large aggregations ran slower on the
    Hadoop cluster compared to the current architecture


§  Diagnosis
  –  A new approach is required to reduce the shuffle




                        © comScore, Inc.   Proprietary.   23
Solution to reduce the shuffle

§  The Problem:
  –  Most aggregations within comScore can not take advantage of combiners, leading to large shuffles
     and job performance issues

§  The Idea:
  –  Partition and sort data on a daily basis
  –  Create a custom input format to merge daily partitions for monthly aggregations




                         © comScore, Inc.   Proprietary.   24
Custom Input Format with Map Side Aggregation


       B       C                                       A        B    C    A



   A Mapper
       Map                                           B Mapper
                                                         Map        C Mapper
                                                                        Map

     Combiner                                        Combiner        Combiner

           A                                               B               C

       Reduce                                          Reduce            Reduce

           A                                               B               C

                   © comScore, Inc.   Proprietary.         25
Performance of v2 on Various Samples

                                                       Aggregation Performance
                 120.00



                 100.00



                  80.00
Time (minutes)




                  60.00



                  40.00



                  20.00



                   0.00
                          372 GB (3%)                             744 GB (6%)                     1116 GB (9%)   10304 GB (100%)
                                                                                Input data size


                                                                      Pig   Custom Input Format



                                    © comScore, Inc.   Proprietary.             26
Partitioning Summary

§  Benefits:
  –  A large portion of the aggregation can be completed in the map phase
  –  Applications can now take advantage of combiners
  –  Shuffles sizes are minimal

§  Risks:
  –  Data locality loss
  –  Map failures might result in long run times. This is dependent on the size of the partitions




                          © comScore, Inc.   Proprietary.   27
Full Sample Performance

§  Full set of data analysis
  –  10 TB of input data
  –  150 billion session rows


§  Total Time
  –  1 hour, 45 minutes
  –  Over 23,000,000 rows/sec




                    © comScore, Inc.   Proprietary.   28
Future Ideas

§  HBase
  –  Unique cookie calculations are free as data is more organized
  –  How will data loading fare?


§  Data Locality
  –  Ideally it would be great to provide additional clues to the storage of the data
  –  Not sure if it will be included in Hadoop


§  Connection to a MPP DB
  –  We also leverage Greenplum DB, we could connect to each sharded instance




                    © comScore, Inc.   Proprietary.   29
Hadoop Cluster

§  Production Hadoop Cluster
  –  80 nodes: Mix of Dell R710 and R510
  –  Each R510 has (12x2TB drives; 64GB RAM; 24 cores)
  –  1768 total CPUs
  –  4.7TB total memory
  –  1200TB total disk space
  –  Our distro is MapR M5 1.2.7




                   © comScore, Inc.   Proprietary.   30
Useful Factoids
  Colorful, bite-sized graphical representations of the best discoveries we unearth.




    Visit www.comscoredatamine.com or follow @datagems for the latest gems.


                   © comScore, Inc.   Proprietary.   31
Thank You!


 Michael Brown
 CTO
 comScore, Inc.


 mbrown@comscore.com




             © comScore, Inc.   Proprietary.   32

Contenu connexe

Tendances (7)

The Rise and Rise of Mobile: a Guardian Case Study
The Rise and Rise of Mobile: a Guardian Case StudyThe Rise and Rise of Mobile: a Guardian Case Study
The Rise and Rise of Mobile: a Guardian Case Study
 
Mba applications report
Mba applications reportMba applications report
Mba applications report
 
Pultry industry in north america
Pultry industry in north americaPultry industry in north america
Pultry industry in north america
 
Marketing Sustainability to Businesses: Strategies & Tactics for Influencing ...
Marketing Sustainability to Businesses: Strategies & Tactics for Influencing ...Marketing Sustainability to Businesses: Strategies & Tactics for Influencing ...
Marketing Sustainability to Businesses: Strategies & Tactics for Influencing ...
 
Cyrela - Company Presentation - 3th Quarter 2008
Cyrela - Company Presentation - 3th Quarter 2008Cyrela - Company Presentation - 3th Quarter 2008
Cyrela - Company Presentation - 3th Quarter 2008
 
10 years of open access at BioMed Central
10 years of open access at BioMed Central10 years of open access at BioMed Central
10 years of open access at BioMed Central
 
Commentary: Hunger Reduction with Agricultural R&D and Policy Change
Commentary: Hunger Reduction with  Agricultural R&D and Policy  ChangeCommentary: Hunger Reduction with  Agricultural R&D and Policy  Change
Commentary: Hunger Reduction with Agricultural R&D and Policy Change
 

Similaire à 30B events a day with hadoop (7)

NWA Collection
NWA CollectionNWA Collection
NWA Collection
 
Consumer Snapshot January 2013
Consumer Snapshot January 2013Consumer Snapshot January 2013
Consumer Snapshot January 2013
 
Amárach Economic Recovery Index February 2013
Amárach Economic Recovery Index February 2013Amárach Economic Recovery Index February 2013
Amárach Economic Recovery Index February 2013
 
Amárach Economic Recovery Index March 2013
Amárach Economic Recovery Index March 2013Amárach Economic Recovery Index March 2013
Amárach Economic Recovery Index March 2013
 
Pp slides
Pp slidesPp slides
Pp slides
 
Office property market overivew 3Q 2011-India
Office property market overivew  3Q 2011-IndiaOffice property market overivew  3Q 2011-India
Office property market overivew 3Q 2011-India
 
Pink pantehrs
Pink pantehrsPink pantehrs
Pink pantehrs
 

Plus de DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Dernier (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

30B events a day with hadoop

  • 1. 30 Billion Events a Day with Hadoop Michael Brown, CTO, comScore, Inc. May 10th, 2012
  • 2. comScore is a Global Leader in Measuring the Digital World NASDAQ SCOR Clients 1860+ worldwide Employees 1000+ Headquarters Reston, VA 170+ countries under measurement; Global Coverage 43 markets reported Local Presence 32 locations in 23 countries © comScore, Inc. Proprietary. 2 V1011
  • 3. Some of our Clients Media Agencies Telecom/Mobile Financial Retail Travel CPG Pharma Technology © comScore, Inc. Proprietary. 3 V1011
  • 4. The Trusted Source for Digital Intelligence Across Vertical Markets 9 out of the top 10 9 out of the top 10 INVESTMENT BANKS AUTO INSURERS 4 out of the top 4 11 out of the top 12 WIRELESS CARRIERS INTERNET SERVICE PROVIDERS 47 out of the top 50 14 out of the top 15 ONLINE PROPERTIES PHARMACEUTICAL COMPANIES 45 out of the top 50 11 out of the top 12 ADVERTISING AGENCIES CONSUMER FINANCE COMPANIES 9 out of the top 10 8 out of the top 10 MAJOR MEDIA COMPANIES CPG COMPANIES © comScore, Inc. Proprietary. 4 V1011
  • 5. Unified Digital Measurement™ (UDM) Establishes Platform For Panel + Census Data Integration Global PERSON Global DEVICE Measurement Measurement PANEL CENSUS Unified Digital Measurement (UDM) Patent-Pending Methodology Adopted by 90% of Top 100 U.S. Media Properties © comScore, Inc. Proprietary. 5 V0411
  • 6. Beacon Heat Map © comScore, Inc. Proprietary. 6
  • 7. Worldwide Tags per Month Monthly Records Collection 1,000,000,000,000 900,000,000,000 800,000,000,000 700,000,000,000 600,000,000,000 # of records 500,000,000,000 400,000,000,000 300,000,000,000 200,000,000,000 100,000,000,000 0 Jul Aug Sep Oct Nov Dec Jan Feb Apr Jun Jul Aug Sep Oct Nov Dec Jan Feb Apr Jun Jul Aug Sep Oct Nov Dec Jan Feb Apr Mar Mar Mar May May May 2009 2010 2011 2012 Panel Records Beacon Records © comScore, Inc. Proprietary. 7
  • 8. Our Event Volume in Perspective Property Page Views (MM) FACEBOOK.COM 472,814 Google Sites 302,802 Yahoo! Sites 90,448 Total 866,064 Source: comScore MediaMetrix Worldwide April 2012 © comScore, Inc. Proprietary. 8
  • 9. Growth Slides 1,600,000,000,000 R² = 0.9335 1,400,000,000,000 1,200,000,000,000 1,000,000,000,000 800,000,000,000 600,000,000,000 400,000,000,000 200,000,000,000 - © comScore, Inc. Proprietary. 9
  • 10. The Project: Census Web Agg © comScore, Inc. Proprietary. 10
  • 11. The Problem Statement §  Calculate the number of events and unique cookies for each key §  Key take aways –  Data on input will be sessionized daily –  Need to process all data for a month –  Need to calculate values for Total Internet and for each site under measurement © comScore, Inc. Proprietary. 11
  • 12. Counting Uniques from a Time Ordered Log File A Major Downsides: Need to keep all key elements in memory. D Constrained to one machine for final aggregation. B C B A A © comScore, Inc. Proprietary. 12
  • 13. Counting Uniques from a Key Ordered Log File A Major Downsides: Need to sort data in advance. A The sort time increases as volume grows. A B B C D © comScore, Inc. Proprietary. 13
  • 14. Scaling Issue §  As our volume has grown we have the following stats: –  Over 900 billion events per month –  Over 150 billion sessions per month –  Over 5,000 reportable sites –  Over 50 countries –  We see 15 billion distinct cookies in a month –  5 sites have over 1 billion cookies in a month –  The sum of all distinct cookies is 377 billion –  We only need to output 15 million rows © comScore, Inc. Proprietary. 14
  • 15. Counting Uniques from a Key Ordered Log File © comScore, Inc. Proprietary. 15
  • 16. Windows v1 (Single Server) §  Time to process data for first few months Month Wall Time (hours) Jul 2009 8 Aug 2009 10 Sep 2009 11 Oct 2009 16 Nov 2009 37 §  V1 Processed sessions at roughly 250K rows/sec §  Problems with this version: –  Slow –  Not Scalable –  Dedicated Server –  Bottleneck for delivering production © comScore, Inc. Proprietary. 16
  • 17. Counting Uniques from Sharded Key Ordered Log Files © comScore, Inc. Proprietary. 17
  • 18. Windows v2 §  Features of this version –  Distributed (32 servers) –  Multithreaded –  Data Localization –  Very low network data transfer –  Handling the data growth §  The V2 code processed data over 8 million rows/sec –  1 hour for Dec 2009; 5 hours for April 2012 §  Issues –  Data is distributed by ID into 64 parts –  Possibilities for skew in distribution key, that impacts performance and high disk usage on a node –  All data replication is manual, along with recovery –  Results cannot be calculated if any node is down –  Adding new servers or change in parts is a ton of effort –  Overhead to maintain framework to run distributed jobs © comScore, Inc. Proprietary. 18
  • 19. Enter the Elephant §  Why Hadoop? –  Scalable –  Low risk to lose data due to replication –  Run on a shared production cluster –  No overhead to maintain framework –  Easy job submission and management © comScore, Inc. Proprietary. 19
  • 20. Basic Approach §  Leverage Pig for POC –  Pig Latin is easy for developers and data analysts to learn –  Rapid application development vs. M/R applications (i.e. 1 line of Pig Latin = 20 lines in Java Map/ Reduce) –  Extendable via UDFs © comScore, Inc. Proprietary. 20
  • 21. Performance of Basic Approach on Various Samples Aggregation Performance 80.00 70.00 60.00 50.00 Time (minutes) 40.00 30.00 20.00 10.00 0.00 372 GB (3%) 744 GB (6%) 1116 GB (9%) Input data size © comScore, Inc. Proprietary. 21 Note: Target data size is over 10 TB
  • 22. M/R Data Flow B C A B C A Mapper Map Mapper Mapper Map Map A A B B C C Reduce Reduce Reduce A B C © comScore, Inc. Proprietary. 22
  • 23. Basic Approach Retrospective §  Processing speed is not scaling to our needs on a sample of the input data §  Diagnosis –  Most aggregations could not take significant advantage of combiners. Not a Pig issue. –  Large shuffles caused poor job performance. In some cases large aggregations ran slower on the Hadoop cluster compared to the current architecture §  Diagnosis –  A new approach is required to reduce the shuffle © comScore, Inc. Proprietary. 23
  • 24. Solution to reduce the shuffle §  The Problem: –  Most aggregations within comScore can not take advantage of combiners, leading to large shuffles and job performance issues §  The Idea: –  Partition and sort data on a daily basis –  Create a custom input format to merge daily partitions for monthly aggregations © comScore, Inc. Proprietary. 24
  • 25. Custom Input Format with Map Side Aggregation B C A B C A A Mapper Map B Mapper Map C Mapper Map Combiner Combiner Combiner A B C Reduce Reduce Reduce A B C © comScore, Inc. Proprietary. 25
  • 26. Performance of v2 on Various Samples Aggregation Performance 120.00 100.00 80.00 Time (minutes) 60.00 40.00 20.00 0.00 372 GB (3%) 744 GB (6%) 1116 GB (9%) 10304 GB (100%) Input data size Pig Custom Input Format © comScore, Inc. Proprietary. 26
  • 27. Partitioning Summary §  Benefits: –  A large portion of the aggregation can be completed in the map phase –  Applications can now take advantage of combiners –  Shuffles sizes are minimal §  Risks: –  Data locality loss –  Map failures might result in long run times. This is dependent on the size of the partitions © comScore, Inc. Proprietary. 27
  • 28. Full Sample Performance §  Full set of data analysis –  10 TB of input data –  150 billion session rows §  Total Time –  1 hour, 45 minutes –  Over 23,000,000 rows/sec © comScore, Inc. Proprietary. 28
  • 29. Future Ideas §  HBase –  Unique cookie calculations are free as data is more organized –  How will data loading fare? §  Data Locality –  Ideally it would be great to provide additional clues to the storage of the data –  Not sure if it will be included in Hadoop §  Connection to a MPP DB –  We also leverage Greenplum DB, we could connect to each sharded instance © comScore, Inc. Proprietary. 29
  • 30. Hadoop Cluster §  Production Hadoop Cluster –  80 nodes: Mix of Dell R710 and R510 –  Each R510 has (12x2TB drives; 64GB RAM; 24 cores) –  1768 total CPUs –  4.7TB total memory –  1200TB total disk space –  Our distro is MapR M5 1.2.7 © comScore, Inc. Proprietary. 30
  • 31. Useful Factoids Colorful, bite-sized graphical representations of the best discoveries we unearth. Visit www.comscoredatamine.com or follow @datagems for the latest gems. © comScore, Inc. Proprietary. 31
  • 32. Thank You! Michael Brown CTO comScore, Inc. mbrown@comscore.com © comScore, Inc. Proprietary. 32