20081022cca

Data Management at Facebook
(Back in the Day)

Jeff Hammerbacher
VP Product and Chief Scientist, Cloudera
October 22, 2008

My Background
Thanks for Asking
▪ hammer@cloudera.com
▪ Studied Mathematics at Harvard
▪ Worked as a Quant on Wall Street
▪ Came to Facebook in early 2006 as a Research Scientist
▪ Managed the Facebook Data Team through September 2008
▪ Over 25 amazing engineers and data scientists
▪ Now a cofounder of Cloudera
▪ Hadoop support and optimization

Common Themes
1. Simplicity
▪ Do one thing well ...
2. Scalability
▪ ... a lot
3. Manageability
▪ Remove the humans
4. Open Source
▪ Build a community

Serving Facebook.com
Data Retrieval and Hardware
GET /index.php HTTP/1.1
Host: www.facebook.com
▪ Three main server proﬁles:
▪ Web
▪ Memcached
Web Tier
▪ MySQL (more than 10,000 Servers)

▪ Simpliﬁed away:
Memcached Tier
▪ AJAX (around 1,000 servers)
MySQL Tier
(around 2,000 servers)
▪ Photo and Video
▪ Services

Services Infrastructure
What’s an SOA?
▪ Almost all services written in Thrift
▪ Networks Type-ahead, Search, Ads, SMS Gateway, Chat, Notes
Import, Scribe
▪ Batteries included
▪ Network transport libraries
▪ Serialization libraries
▪ Code generation
▪ Robust server implementations (multithreaded, nonblocking, etc.)
▪ Now an Apache Incubator project
▪ For more information, read the whitepaper

Services Infrastructure
Thrift, Mainly
▪ Developing a Thrift service:
▪ Define your data structures
▪ JSON-like data model
▪ Define your service endpoints
▪ Select your languages
▪ Generate stub code
▪ Write service logic
▪ Write client
▪ Configure and deploy
▪ Monitor, provision, and upgrade

Data Infrastructure
Offline Batch Processing
Scribe Tier MySQL Tier

▪ “Data Warehousing”
▪ Began with Oracle database
▪ Schedule data collection via cron
▪ Collect data every 24 hours
▪ “ETL” scripts: hand-coded Python Data Collection
Server
▪ Data volumes quickly grew
▪ Started at tens of GB in early 2006 Oracle Database
Server
▪ Up to about 1 TB per day by mid-2007
▪ Log ﬁles largest source of data growth

Data Infrastructure
Distributed Processing with Cheetah
▪ Goal: summarize log ﬁles outside of the database
▪ Solution: Cheetah, a distributed log ﬁle processing system
▪ Distributor.pl: distribute binaries to processing nodes
▪ C++ Binaries: parse, agg, load

Partitioned Log File
Cheetah Master

Filer Processing Tier

Data Infrastructure
Moving from Cheetah to Hadoop
▪ Cheetah limitations
▪ Limited ﬁler bandwidth
▪ No centralized logﬁle metadata
▪ Writing a new Cheetah job requires writing C++ binaries
▪ Jobs are difficult to monitor and debug
▪ No support for ad hoc querying
▪ Not open source

Data Infrastructure
Hadoop as Enterprise Data Warehouse
Scribe Tier MySQL Tier

Hadoop Tier

Oracle RAC Servers

Initial Hadoop Applications
Unstructured text analysis
▪ Intern asked to understand brand sentiment and inﬂuence

▪ Many tools for supporting his project had to be built
▪ Understanding serialization format of wall post logs
▪ Common data operations: project, ﬁlter, join, group by
▪ Developed using Hadoop streaming for rapid prototyping in
Python
▪ Scheduling regular processing and recovering from failures
▪ Making it easy to regularly load new data

Initial Hadoop Applications
Ensemble Learning
▪ Build a lot of Decision Trees and average them
▪ Random Forests are a combination of tree predictors such that
each tree depends on the values of a random vector sampled
independently and with the same distribution for all trees in the
forest
▪ Can be used for regression or classiﬁcation
▪ See “Random Forests” by Leo Breiman

More Hadoop Applications
Insights
▪ Monitor performance of your Facebook Ad, Page, Application
▪ Regular aggregation of high volumes of log ﬁle data
▪ First hourly pipelines
▪ Publish data back to a MySQL tier
▪ System currently only running partially on Hadoop

More Hadoop Applications
Platform Application Reputation Scoring
▪ Users complaining about being spammed by Platform
applications
▪ Now, every Platform Application has a set of quotas
▪ Notiﬁcations
▪ News Feed story insertion
▪ Invitations
▪ Emails
▪ Quotas determined by calculating a “reputation score” for the
application

Hive
Structured Data Management with Hadoop
▪ Hadoop:
▪ HDFS
▪ MapReduce
▪ Resource Manager
▪ Job Scheduler
▪ Hive:
▪ Logical data partitioning
▪ Metadata store (command line and web interfaces)
▪ Query Operators
▪ Query Language

Hive
The Team
▪ Joydeep Sen Sarma
▪ Ashish Thusoo
▪ Pete Wyckoff
▪ Suresh Anthony
▪ Zheng Shao
▪ Venky Iyer
▪ Dhruba Borthakur
▪ Namit Jain
▪ Raghu Murthy
▪ Prasad Chakka

Hive
Some Stats
▪ Cluster size - 320 nodes, 2560 cores, 1.3 PB capacity
▪ Total data (compressed, deduplicated) - 180 TB
▪ Net data per day
▪ 10 TB uncompressed - 4 TB from databases, 6 TB from logs
▪ Over 2 TB compressed
▪ Data Processing Statistics
▪ 3,200 Jobs and 800,000 Tasks per day
▪ 55 TB of compressed data processed per day
▪ 15 TB of compressed data produced per day
▪ 80 M minutes of compute time per day

Cassandra
Structured Storage over a P2P Network
▪ Conceptually: BigTable data model on Dynamo infrastructure
▪ Design Goals:
▪ High availability
▪ Incremental scalability
▪ Eventual consistency (trade consistency for availability)
▪ Optimistic replication
▪ Low total cost of ownership
▪ Minimal administrative overhead
▪ Tunable tradeoffs between consistency, durability, and latency

Cassandra
Initial Application
▪ Inbox search

Cassandra
The Team
▪ Avinash Lakshman
▪ Prashant Malik
▪ Karthik Ranganathan
▪ Kannan Muthukkaruppan

Cassandra
Some Stats
▪ Cluster size - 120 nodes
▪ Single instance across two data centers
▪ Total data stored - 36 TB
▪ Writes - 300 million writes per day.
▪ Reads - 1 million reads per day.
▪ Read Latencies
▪ Min - 6.03 ms
▪ Mean - 90.6 ms
▪ Median - 18.24 ms

20081022cca

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à 20081022cca

Similaire à 20081022cca (20)

Plus de Jeff Hammerbacher

Plus de Jeff Hammerbacher (20)

Dernier

Dernier (20)

20081022cca