Protocol Buffers and Hadoop at Twitter

Hadoop and Protocol Buffers at Twitter
Kevin Weil -- @kevinweil
Analytics Lead, Twitter

TM

Outline
‣ Problem Statement
‣ CSV? XML? JSON? Regex?
‣ Protocol Buffers
‣ Codegen, Hadoop and You
‣ Applications
‣ Conclusions and Next Steps

My Background
‣ Studied Mathematics and Physics at Harvard, Physics at
Stanford
‣ Tropos Networks (city-wide wireless): mesh routing algorithms,
GBs of data
‣ Cooliris (web media): Hadoop and Pig for analytics, TBs of data
‣ Twitter: Hadoop, Pig, HBase, large-scale data analysis and
visualization, social graph analysis, machine learning, lots more
data

The Challenge
‣ Store some tweets

The Challenge
‣ Store some tweets Store 100 billion tweets

The Challenge
‣ Store 100 billion tweets in a way that is
‣ Robust to changes

The Challenge
‣ Robust
‣ Efficient in size and speed

The Challenge
‣ Robust
‣ Efficient
‣ Amenable to large-scale analysis

The Challenge
‣ Robust
‣ Efficient
‣ Amenable to large-scale analysis
‣ Reusable (especially for other classes of data, like logs, where the size gets
really large)

The System
‣ Your (friend’s) hadoop
cluster

The Data ‣ kevin@tw-mbp-kweil ~ $ curl http://
‣

‣
<?xml version="1.0" encoding="UTF-8"?>
<status>
api.twitter.com/1/statuses/show/9225259353.xml
‣ <created_at>Wed Feb 17 08:01:13 +0000 2010</created_at>
‣ <id>9225259353</id>
‣ <text>Preparing slides for tomorrow's talk at Y! at the Hadoop User Group: Protobufs and Hadoop at Twitter. See you there? http://bit.ly/9DJcd9</text>
‣ <source><a href="http://www.tweetdeck.com/" rel="nofollow">TweetDeck</a></source>
‣ <truncated>false</truncated>
‣ <in_reply_to_status_id></in_reply_to_status_id>
<in_reply_to_user_id></in_reply_to_user_id>

Each tweet has 12 fields, 3 of which (user, geo,
‣

‣
‣ <favorited>false</favorited>
‣ <in_reply_to_screen_name></in_reply_to_screen_name>
‣ <user>

contributors) have subfields
‣ <id>3452911</id>
‣ <name>Kevin Weil</name>
‣ <screen_name>kevinweil</screen_name>
‣ <location>Portola Valley, CA</location>
‣ <description>Analytics Lead at Twitter. Ultra-marathons, cycling, hadoop, lolcats.</description>
‣ <profile_image_url>http://a3.twimg.com/profile_images/220257539/n206489_34325699_8572_normal.jpg</profile_image_url>
‣ <url></url>
‣ <protected>false</protected>
‣ <followers_count>3122</followers_count>
‣ <profile_background_color>B2DFDA</profile_background_color>
‣ <profile_text_color>333333</profile_text_color>

‣ It can change as we add new features
‣ <profile_link_color>93A644</profile_link_color>
‣ <profile_sidebar_fill_color>ffffff</profile_sidebar_fill_color>
‣ <profile_sidebar_border_color>eeeeee</profile_sidebar_border_color>
‣ <friends_count>436</friends_count>
‣ <created_at>Wed Apr 04 19:29:46 +0000 2007</created_at>
‣ <favourites_count>721</favourites_count>
‣ <utc_offset>-28800</utc_offset>
‣ <time_zone>Pacific Time (US & Canada)</time_zone>
‣ <profile_background_image_url>http://s.twimg.com/a/1266345225/images/themes/theme13/bg.gif</profile_background_image_url>
‣ <profile_background_tile>false</profile_background_tile>
‣ <notifications>false</notifications>
‣ <geo_enabled>true</geo_enabled>
‣ <verified>false</verified>
‣ <following>false</following>
‣ <statuses_count>2556</statuses_count>
‣ <lang>en</lang>
‣ <contributors_enabled>false</contributors_enabled>
‣ </user>
‣ <geo/>
‣ <contributors/>
‣ </status>
‣

The Requirements
‣ Splittability
‣ Parsing efficiency
‣ Reusability
‣ Ability to add new fields
‣ Ability to ignore unused fields
‣ Small data size
‣ Hierarchical

Common Formats
Parsing Ignore unused
Splittable Reusability Add new fields Small data size Hierarchical
efficiency fields

XML

JSON

CSV

Custom
regex
(Apache)

Enter Protocol Buffers
‣ “Protocol Buffers are a way of encoding structured data in an
efficient yet extensible format. Google uses Protocol Buffers for
almost all of its internal RPC protocols and file formats.”
‣
http://code.google.com/p/protobuf
‣ You write IDL describing your data structure
‣ It generates code in your languages of choice to construct, serialize,
deserialize, reflect across, etc, your data structure
‣ Like Thrift, but richer and more efficient (except no RPC)
‣ Avro is an exciting up-and-coming alternative

Protobuf IDL Example
‣ message Status {
‣ optional string created_at = 1;
‣ optional int64 id = 2;
‣ optional string text = 3;
‣ optional string source = 4;
‣ optional bool truncated = 5;
‣ optional int64 in_reply_to_status_id = 6;
‣ optional int64 in_reply_to_user_id = 7;
‣ optional bool favorited = 8;
‣ optional string in_reply_to_screen_name = 9;
‣ optional message User = 10;
‣ optional message Geo = 11;
‣ optional message Contributors = 12;

‣ message User {
‣ optional int64 id = 1;
‣ optional string name = 2;
‣ ...
‣ }
‣ message Geo { ... }
‣ message Contributors { ... }
‣ }

Protobuf Generated Code
‣ The generated code is:
‣
Efﬁcient (Google quotes 80x vs. |-delimited format)1,2

‣
Extensible
‣
Backwards compatible
‣
Polymorphic (in Java, C++, Python)
‣
Metadata-rich

1. http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-ﬂexible-data-processing-tool/fulltext
2. http://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking

Common Formats
Parsing Ignore unused
Splittable Reusability Add new fields Small data size Hierarchical
efficiency fields

XML

JSON

CSV

Custom
regex
(Apache)
Protocol
Buffers

But Wait, There’s More
‣ Codegen for data structures is nice...
‣ Next step: codegen for all Hadoop-related code

‣
Protocol Buffer InputFormats

‣
‣
OutputFormats

‣
‣
OutputFormats
‣
Writables

‣
‣
OutputFormats
‣
Writables
‣
Pig LoadFuncs and StoreFuncs

‣
‣
OutputFormats
‣
Writables
‣
‣
Cascading, Streaming, Dumbo, etc

‣
‣
OutputFormats
‣
Writables
‣
‣
Cascading, Streaming, Dumbo, etc
‣
Per Protocol Buffer

‣ All objects
(hierarchical
data,
inheritance, etc)
‣ All automatically
generated
‣ Efﬁcient,
extensible
storage and
serialization

Pig LoadFuncs
‣ All objects
(hierarchical
data,
inheritance, etc)
‣ All automatically
generated
‣ Even the load
statement itself
is codegen

Where do these work?
‣ Java MapReduce APIs (InputFormats, OutputFormats, Writables)
‣ Deprecated Java MapReduce APIs (same)
‣
Enables Streaming, Dumbo, Cascading
‣ Pig
‣ HBase

Counting Big Data
‣ standard counts, min, max, std dev
‣ How many requests do we serve in a day?
‣ What is the average latency? 95% latency?
‣ Group by response code. What is the hourly distribution?
‣ How many searches happen each day on Twitter?
‣ How many unique queries, how many unique users?
‣ What is their geographic distribution?

Correlating Big Data
‣ probabilities, covariance, influence
‣ How does usage differ for mobile users?
‣ How about for users with 3rd party desktop clients?
‣ Cohort analyses
‣ Site problems: what goes wrong at the same time?
‣ Which features get users hooked?
‣ Which features do successful users use often?
‣ Search corrections, search suggestions
‣ A/B testing

Research on Big Data
‣ prediction, graph analysis, natural language
‣ What can we tell about a user from their tweets?
‣ From the tweets of those they follow?
‣ From the tweets of their followers?
‣ From the ratio of followers/following?
‣ What graph structures lead to successful networks?
‣ User reputation

Research on Big Data
‣ prediction, graph analysis, natural language
‣ Sentiment analysis
‣ What features get a tweet retweeted?
‣ How deep is the corresponding retweet tree?
‣ Long-term duplicate detection
‣ Machine learning
‣ Language detection
‣ ... the list goes on.

Resolution
‣ All we do now is write IDL for the data schema
‣ Get efﬁcient, forward/backwards compatible, splittable data structures
automatically generated for us
‣ Get loaders, input formats, output formats, writables, and schemas
automatically generated for us
‣ Helps the Twitter analytics team stay agile
‣
Can handle new, complex data without the need for new code, new

tests, new bugs
‣
Focus on the analysis, not data formats

Twitter Open Source
‣ Coming soon! (1-2 weeks) http://github.com/kevinweil
‣ All base classes for InputFormats, OutputFormats, Writables, Pig
Loaders, etc
‣ For new and deprecated MapReduce API
‣ With and without LZO compression (see http://github.com/
kevinweil/hadoop-lzo)
‣ Protobuf reﬂection helpers
‣ Serialized block storage format for HDFS

Questions? Follow me at
twitter.com/kevinweil

‣ If this sounded interesting to you -- that’s because it is. And we’re hiring.

TM

Protocol Buffers and Hadoop at Twitter

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Protocol Buffers and Hadoop at Twitter

Similar to Protocol Buffers and Hadoop at Twitter (20)

Recently uploaded

Recently uploaded (20)

Protocol Buffers and Hadoop at Twitter