Contenu connexe
Similaire à 30B events a day with hadoop (7)
Plus de DataWorks Summit (20)
30B events a day with hadoop
- 1. 30 Billion Events a Day with Hadoop
Michael Brown, CTO, comScore, Inc.
May 10th, 2012
- 2. comScore is a Global Leader in Measuring the Digital World
NASDAQ SCOR
Clients 1860+ worldwide
Employees 1000+
Headquarters Reston, VA
170+ countries under measurement;
Global Coverage
43 markets reported
Local Presence 32 locations in 23 countries
© comScore, Inc. Proprietary. 2 V1011
- 3. Some of our Clients
Media Agencies Telecom/Mobile Financial Retail Travel CPG Pharma Technology
© comScore, Inc. Proprietary. 3 V1011
- 4. The Trusted Source for Digital Intelligence Across Vertical Markets
9 out of the top 10 9 out of the top 10
INVESTMENT BANKS AUTO INSURERS
4 out of the top 4 11 out of the top 12
WIRELESS CARRIERS INTERNET SERVICE
PROVIDERS
47 out of the top 50 14 out of the top 15
ONLINE PROPERTIES PHARMACEUTICAL
COMPANIES
45 out of the top 50 11 out of the top 12
ADVERTISING AGENCIES CONSUMER FINANCE
COMPANIES
9 out of the top 10 8 out of the top 10
MAJOR MEDIA COMPANIES CPG COMPANIES
© comScore, Inc. Proprietary. 4 V1011
- 5. Unified Digital Measurement™ (UDM) Establishes Platform For
Panel + Census Data Integration
Global PERSON Global DEVICE
Measurement Measurement
PANEL CENSUS
Unified Digital Measurement (UDM)
Patent-Pending Methodology
Adopted by 90% of Top 100 U.S. Media Properties
© comScore, Inc. Proprietary. 5 V0411
- 7. Worldwide Tags per Month
Monthly Records Collection
1,000,000,000,000
900,000,000,000
800,000,000,000
700,000,000,000
600,000,000,000
# of records
500,000,000,000
400,000,000,000
300,000,000,000
200,000,000,000
100,000,000,000
0
Jul
Aug
Sep
Oct
Nov
Dec
Jan
Feb
Apr
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Jan
Feb
Apr
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Jan
Feb
Apr
Mar
Mar
Mar
May
May
May
2009 2010 2011 2012
Panel Records Beacon Records
© comScore, Inc. Proprietary. 7
- 8. Our Event Volume in Perspective
Property Page Views (MM)
FACEBOOK.COM 472,814
Google Sites 302,802
Yahoo! Sites 90,448
Total 866,064
Source: comScore MediaMetrix Worldwide April 2012
© comScore, Inc. Proprietary. 8
- 9. Growth Slides
1,600,000,000,000
R² = 0.9335
1,400,000,000,000
1,200,000,000,000
1,000,000,000,000
800,000,000,000
600,000,000,000
400,000,000,000
200,000,000,000
-
© comScore, Inc. Proprietary. 9
- 11. The Problem Statement
§ Calculate the number of events and unique cookies for each key
§ Key take aways
– Data on input will be sessionized daily
– Need to process all data for a month
– Need to calculate values for Total Internet and for each site under
measurement
© comScore, Inc. Proprietary. 11
- 12. Counting Uniques from a Time Ordered Log File
A Major Downsides:
Need to keep all key elements in memory.
D Constrained to one machine for final aggregation.
B
C
B
A
A
© comScore, Inc. Proprietary. 12
- 13. Counting Uniques from a Key Ordered Log File
A Major Downsides:
Need to sort data in advance.
A The sort time increases as volume grows.
A
B
B
C
D
© comScore, Inc. Proprietary. 13
- 14. Scaling Issue
§ As our volume has grown we have the following stats:
– Over 900 billion events per month
– Over 150 billion sessions per month
– Over 5,000 reportable sites
– Over 50 countries
– We see 15 billion distinct cookies in a month
– 5 sites have over 1 billion cookies in a month
– The sum of all distinct cookies is 377 billion
– We only need to output 15 million rows
© comScore, Inc. Proprietary. 14
- 16. Windows v1 (Single Server)
§ Time to process data for first few months
Month Wall Time (hours)
Jul 2009 8
Aug 2009 10
Sep 2009 11
Oct 2009 16
Nov 2009 37
§ V1 Processed sessions at roughly 250K rows/sec
§ Problems with this version:
– Slow
– Not Scalable
– Dedicated Server
– Bottleneck for delivering production
© comScore, Inc. Proprietary. 16
- 18. Windows v2
§ Features of this version
– Distributed (32 servers)
– Multithreaded
– Data Localization
– Very low network data transfer
– Handling the data growth
§ The V2 code processed data over 8 million rows/sec
– 1 hour for Dec 2009; 5 hours for April 2012
§ Issues
– Data is distributed by ID into 64 parts
– Possibilities for skew in distribution key, that impacts performance and high disk usage on a node
– All data replication is manual, along with recovery
– Results cannot be calculated if any node is down
– Adding new servers or change in parts is a ton of effort
– Overhead to maintain framework to run distributed jobs
© comScore, Inc. Proprietary. 18
- 19. Enter the Elephant
§ Why Hadoop?
– Scalable
– Low risk to lose data due to replication
– Run on a shared production cluster
– No overhead to maintain framework
– Easy job submission and management
© comScore, Inc. Proprietary. 19
- 20. Basic Approach
§ Leverage Pig for POC
– Pig Latin is easy for developers and data analysts to learn
– Rapid application development vs. M/R applications (i.e. 1 line of Pig Latin = 20 lines in Java Map/
Reduce)
– Extendable via UDFs
© comScore, Inc. Proprietary. 20
- 21. Performance of Basic Approach on Various Samples
Aggregation Performance
80.00
70.00
60.00
50.00
Time (minutes)
40.00
30.00
20.00
10.00
0.00
372 GB (3%) 744 GB (6%) 1116 GB (9%)
Input data size
© comScore, Inc. Proprietary. 21 Note: Target data size is over 10 TB
- 22. M/R Data Flow
B C A B C A
Mapper
Map Mapper Mapper
Map Map
A A B B C C
Reduce Reduce Reduce
A B C
© comScore, Inc. Proprietary. 22
- 23. Basic Approach Retrospective
§ Processing speed is not scaling to our needs on a sample of the input data
§ Diagnosis
– Most aggregations could not take significant advantage of combiners. Not a Pig issue.
– Large shuffles caused poor job performance. In some cases large aggregations ran slower on the
Hadoop cluster compared to the current architecture
§ Diagnosis
– A new approach is required to reduce the shuffle
© comScore, Inc. Proprietary. 23
- 24. Solution to reduce the shuffle
§ The Problem:
– Most aggregations within comScore can not take advantage of combiners, leading to large shuffles
and job performance issues
§ The Idea:
– Partition and sort data on a daily basis
– Create a custom input format to merge daily partitions for monthly aggregations
© comScore, Inc. Proprietary. 24
- 25. Custom Input Format with Map Side Aggregation
B C A B C A
A Mapper
Map B Mapper
Map C Mapper
Map
Combiner Combiner Combiner
A B C
Reduce Reduce Reduce
A B C
© comScore, Inc. Proprietary. 25
- 26. Performance of v2 on Various Samples
Aggregation Performance
120.00
100.00
80.00
Time (minutes)
60.00
40.00
20.00
0.00
372 GB (3%) 744 GB (6%) 1116 GB (9%) 10304 GB (100%)
Input data size
Pig Custom Input Format
© comScore, Inc. Proprietary. 26
- 27. Partitioning Summary
§ Benefits:
– A large portion of the aggregation can be completed in the map phase
– Applications can now take advantage of combiners
– Shuffles sizes are minimal
§ Risks:
– Data locality loss
– Map failures might result in long run times. This is dependent on the size of the partitions
© comScore, Inc. Proprietary. 27
- 28. Full Sample Performance
§ Full set of data analysis
– 10 TB of input data
– 150 billion session rows
§ Total Time
– 1 hour, 45 minutes
– Over 23,000,000 rows/sec
© comScore, Inc. Proprietary. 28
- 29. Future Ideas
§ HBase
– Unique cookie calculations are free as data is more organized
– How will data loading fare?
§ Data Locality
– Ideally it would be great to provide additional clues to the storage of the data
– Not sure if it will be included in Hadoop
§ Connection to a MPP DB
– We also leverage Greenplum DB, we could connect to each sharded instance
© comScore, Inc. Proprietary. 29
- 30. Hadoop Cluster
§ Production Hadoop Cluster
– 80 nodes: Mix of Dell R710 and R510
– Each R510 has (12x2TB drives; 64GB RAM; 24 cores)
– 1768 total CPUs
– 4.7TB total memory
– 1200TB total disk space
– Our distro is MapR M5 1.2.7
© comScore, Inc. Proprietary. 30
- 31. Useful Factoids
Colorful, bite-sized graphical representations of the best discoveries we unearth.
Visit www.comscoredatamine.com or follow @datagems for the latest gems.
© comScore, Inc. Proprietary. 31
- 32. Thank You!
Michael Brown
CTO
comScore, Inc.
mbrown@comscore.com
© comScore, Inc. Proprietary. 32