1. Hadoop and Protocol Buffers at Twitter
Kevin Weil -- @kevinweil
Analytics Lead, Twitter
TM
2. Outline
‣ Problem Statement
‣ CSV? XML? JSON? Regex?
‣ Protocol Buffers
‣ Codegen, Hadoop and You
‣ Applications
‣ Conclusions and Next Steps
3. My Background
‣ Studied Mathematics and Physics at Harvard, Physics at
Stanford
‣ Tropos Networks (city-wide wireless): mesh routing algorithms,
GBs of data
‣ Cooliris (web media): Hadoop and Pig for analytics, TBs of data
‣ Twitter: Hadoop, Pig, HBase, large-scale data analysis and
visualization, social graph analysis, machine learning, lots more
data
4. Outline
‣ Problem Statement
‣ CSV? XML? JSON? Regex?
‣ Protocol Buffers
‣ Codegen, Hadoop and You
‣ Applications
‣ Conclusions and Next Steps
7. The Challenge
‣ Store 100 billion tweets in a way that is
‣ Robust to changes
8. The Challenge
‣ Store 100 billion tweets in a way that is
‣ Robust
‣ Efficient in size and speed
9. The Challenge
‣ Store 100 billion tweets in a way that is
‣ Robust
‣ Efficient
‣ Amenable to large-scale analysis
10. The Challenge
‣ Store 100 billion tweets in a way that is
‣ Robust
‣ Efficient
‣ Amenable to large-scale analysis
‣ Reusable (especially for other classes of data, like logs, where the size gets
really large)
12. The Data ‣ kevin@tw-mbp-kweil ~ $ curl http://
‣
‣
<?xml version="1.0" encoding="UTF-8"?>
<status>
api.twitter.com/1/statuses/show/9225259353.xml
‣ <created_at>Wed Feb 17 08:01:13 +0000 2010</created_at>
‣ <id>9225259353</id>
‣ <text>Preparing slides for tomorrow's talk at Y! at the Hadoop User Group: Protobufs and Hadoop at Twitter. See you there? http://bit.ly/9DJcd9</text>
‣ <source><a href="http://www.tweetdeck.com/" rel="nofollow">TweetDeck</a></source>
‣ <truncated>false</truncated>
‣ <in_reply_to_status_id></in_reply_to_status_id>
<in_reply_to_user_id></in_reply_to_user_id>
Each tweet has 12 fields, 3 of which (user, geo,
‣
‣
‣ <favorited>false</favorited>
‣ <in_reply_to_screen_name></in_reply_to_screen_name>
‣ <user>
contributors) have subfields
‣ <id>3452911</id>
‣ <name>Kevin Weil</name>
‣ <screen_name>kevinweil</screen_name>
‣ <location>Portola Valley, CA</location>
‣ <description>Analytics Lead at Twitter. Ultra-marathons, cycling, hadoop, lolcats.</description>
‣ <profile_image_url>http://a3.twimg.com/profile_images/220257539/n206489_34325699_8572_normal.jpg</profile_image_url>
‣ <url></url>
‣ <protected>false</protected>
‣ <followers_count>3122</followers_count>
‣ <profile_background_color>B2DFDA</profile_background_color>
‣ <profile_text_color>333333</profile_text_color>
‣ It can change as we add new features
‣ <profile_link_color>93A644</profile_link_color>
‣ <profile_sidebar_fill_color>ffffff</profile_sidebar_fill_color>
‣ <profile_sidebar_border_color>eeeeee</profile_sidebar_border_color>
‣ <friends_count>436</friends_count>
‣ <created_at>Wed Apr 04 19:29:46 +0000 2007</created_at>
‣ <favourites_count>721</favourites_count>
‣ <utc_offset>-28800</utc_offset>
‣ <time_zone>Pacific Time (US & Canada)</time_zone>
‣ <profile_background_image_url>http://s.twimg.com/a/1266345225/images/themes/theme13/bg.gif</profile_background_image_url>
‣ <profile_background_tile>false</profile_background_tile>
‣ <notifications>false</notifications>
‣ <geo_enabled>true</geo_enabled>
‣ <verified>false</verified>
‣ <following>false</following>
‣ <statuses_count>2556</statuses_count>
‣ <lang>en</lang>
‣ <contributors_enabled>false</contributors_enabled>
‣ </user>
‣ <geo/>
‣ <contributors/>
‣ </status>
‣
13. The Requirements
‣ Splittability
‣ Parsing efficiency
‣ Reusability
‣ Ability to add new fields
‣ Ability to ignore unused fields
‣ Small data size
‣ Hierarchical
14. The Requirements
‣ Splittability
‣ Parsing efficiency
‣ Reusability
‣ Ability to add new fields
‣ Ability to ignore unused fields
‣ Small data size
‣ Hierarchical
15. The Requirements
‣ Splittability
‣ Parsing efficiency
‣ Reusability
‣ Ability to add new fields
‣ Ability to ignore unused fields
‣ Small data size
‣ Hierarchical
16. The Requirements
‣ Splittability
‣ Parsing efficiency
‣ Reusability
‣ Ability to add new fields
‣ Ability to ignore unused fields
‣ Small data size
‣ Hierarchical
17. The Requirements
‣ Splittability
‣ Parsing efficiency
‣ Reusability
‣ Ability to add new fields
‣ Ability to ignore unused fields
‣ Small data size
‣ Hierarchical
18. The Requirements
‣ Splittability
‣ Parsing efficiency
‣ Reusability
‣ Ability to add new fields
‣ Ability to ignore unused fields
‣ Small data size
‣ Hierarchical
19. The Requirements
‣ Splittability
‣ Parsing efficiency
‣ Reusability
‣ Ability to add new fields
‣ Ability to ignore unused fields
‣ Small data size
‣ Hierarchical
20. Outline
‣ Problem Statement
‣ CSV? XML? JSON? Regex?
‣ Protocol Buffers
‣ Codegen, Hadoop and You
‣ Applications
‣ Conclusions and Next Steps
21. Common Formats
Parsing Ignore unused
Splittable Reusability Add new fields Small data size Hierarchical
efficiency fields
XML
JSON
CSV
Custom
regex
(Apache)
22. Common Formats
Parsing Ignore unused
Splittable Reusability Add new fields Small data size Hierarchical
efficiency fields
XML
JSON
CSV
Custom
regex
(Apache)
23. Common Formats
Parsing Ignore unused
Splittable Reusability Add new fields Small data size Hierarchical
efficiency fields
XML
JSON
CSV
Custom
regex
(Apache)
24. Common Formats
Parsing Ignore unused
Splittable Reusability Add new fields Small data size Hierarchical
efficiency fields
XML
JSON
CSV
Custom
regex
(Apache)
25. Outline
‣ Problem Statement
‣ CSV? XML? JSON? Regex?
‣ Protocol Buffers
‣ Codegen, Hadoop and You
‣ Applications
‣ Conclusions and Next Steps
26. Enter Protocol Buffers
‣ “Protocol Buffers are a way of encoding structured data in an
efficient yet extensible format. Google uses Protocol Buffers for
almost all of its internal RPC protocols and file formats.”
‣
http://code.google.com/p/protobuf
‣ You write IDL describing your data structure
‣ It generates code in your languages of choice to construct, serialize,
deserialize, reflect across, etc, your data structure
‣ Like Thrift, but richer and more efficient (except no RPC)
‣ Avro is an exciting up-and-coming alternative
28. Protobuf Generated Code
‣ The generated code is:
‣
Efficient (Google quotes 80x vs. |-delimited format)1,2
‣
Extensible
‣
Backwards compatible
‣
Polymorphic (in Java, C++, Python)
‣
Metadata-rich
1. http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext
2. http://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking
29. Common Formats
Parsing Ignore unused
Splittable Reusability Add new fields Small data size Hierarchical
efficiency fields
XML
JSON
CSV
Custom
regex
(Apache)
Protocol
Buffers
30. Outline
‣ Problem Statement
‣ CSV? XML? JSON? Regex?
‣ Protocol Buffers
‣ Codegen, Hadoop and You
‣ Applications
‣ Conclusions and Next Steps
31. But Wait, There’s More
‣ Codegen for data structures is nice...
‣ Next step: codegen for all Hadoop-related code
32. But Wait, There’s More
‣ Codegen for data structures is nice...
‣ Next step: codegen for all Hadoop-related code
‣
Protocol Buffer InputFormats
33. But Wait, There’s More
‣ Codegen for data structures is nice...
‣ Next step: codegen for all Hadoop-related code
‣
Protocol Buffer InputFormats
‣
OutputFormats
34. But Wait, There’s More
‣ Codegen for data structures is nice...
‣ Next step: codegen for all Hadoop-related code
‣
Protocol Buffer InputFormats
‣
OutputFormats
‣
Writables
35. But Wait, There’s More
‣ Codegen for data structures is nice...
‣ Next step: codegen for all Hadoop-related code
‣
Protocol Buffer InputFormats
‣
OutputFormats
‣
Writables
‣
Pig LoadFuncs and StoreFuncs
36. But Wait, There’s More
‣ Codegen for data structures is nice...
‣ Next step: codegen for all Hadoop-related code
‣
Protocol Buffer InputFormats
‣
OutputFormats
‣
Writables
‣
Pig LoadFuncs and StoreFuncs
‣
Cascading, Streaming, Dumbo, etc
37. But Wait, There’s More
‣ Codegen for data structures is nice...
‣ Next step: codegen for all Hadoop-related code
‣
Protocol Buffer InputFormats
‣
OutputFormats
‣
Writables
‣
Pig LoadFuncs and StoreFuncs
‣
Cascading, Streaming, Dumbo, etc
‣
Per Protocol Buffer
38. Protocol Buffer InputFormats
‣ All objects
(hierarchical
data,
inheritance, etc)
‣ All automatically
generated
‣ Efficient,
extensible
storage and
serialization
39. Pig LoadFuncs
‣ All objects
(hierarchical
data,
inheritance, etc)
‣ All automatically
generated
‣ Even the load
statement itself
is codegen
40. Where do these work?
‣ Java MapReduce APIs (InputFormats, OutputFormats, Writables)
‣ Deprecated Java MapReduce APIs (same)
‣
Enables Streaming, Dumbo, Cascading
‣ Pig
‣ HBase
41. Outline
‣ Problem Statement
‣ CSV? XML? JSON? Regex?
‣ Protocol Buffers
‣ Codegen, Hadoop and You
‣ Applications
‣ Conclusions and Next Steps
42. Counting Big Data
‣ standard counts, min, max, std dev
‣ How many requests do we serve in a day?
‣ What is the average latency? 95% latency?
‣ Group by response code. What is the hourly distribution?
‣ How many searches happen each day on Twitter?
‣ How many unique queries, how many unique users?
‣ What is their geographic distribution?
43. Correlating Big Data
‣ probabilities, covariance, influence
‣ How does usage differ for mobile users?
‣ How about for users with 3rd party desktop clients?
‣ Cohort analyses
‣ Site problems: what goes wrong at the same time?
‣ Which features get users hooked?
‣ Which features do successful users use often?
‣ Search corrections, search suggestions
‣ A/B testing
44. Research on Big Data
‣ prediction, graph analysis, natural language
‣ What can we tell about a user from their tweets?
‣ From the tweets of those they follow?
‣ From the tweets of their followers?
‣ From the ratio of followers/following?
‣ What graph structures lead to successful networks?
‣ User reputation
45. Research on Big Data
‣ prediction, graph analysis, natural language
‣ Sentiment analysis
‣ What features get a tweet retweeted?
‣ How deep is the corresponding retweet tree?
‣ Long-term duplicate detection
‣ Machine learning
‣ Language detection
‣ ... the list goes on.
46. Outline
‣ Problem Statement
‣ CSV? XML? JSON? Regex?
‣ Protocol Buffers
‣ Codegen, Hadoop and You
‣ Applications
‣ Conclusions and Next Steps
47. Resolution
‣ All we do now is write IDL for the data schema
‣ Get efficient, forward/backwards compatible, splittable data structures
automatically generated for us
‣ Get loaders, input formats, output formats, writables, and schemas
automatically generated for us
‣ Helps the Twitter analytics team stay agile
‣
Can handle new, complex data without the need for new code, new
tests, new bugs
‣
Focus on the analysis, not data formats
48. Twitter Open Source
‣ Coming soon! (1-2 weeks) http://github.com/kevinweil
‣ All base classes for InputFormats, OutputFormats, Writables, Pig
Loaders, etc
‣ For new and deprecated MapReduce API
‣ With and without LZO compression (see http://github.com/
kevinweil/hadoop-lzo)
‣ Protobuf reflection helpers
‣ Serialized block storage format for HDFS
49. Questions? Follow me at
twitter.com/kevinweil
‣ If this sounded interesting to you -- that’s because it is. And we’re hiring.
TM