The document describes the evolution of Facebook's big data architectures from 2007 to 2011. It started with a traditional data warehouse using MySQL and grew significantly over time. Facebook moved to Hadoop and Hive in 2008 to enable data science at scale and store all data online. In 2009, they further democratized data with tools to make it accessible. Later improvements focused on isolation, efficiency, utilization and monitoring to control the growing chaos. By 2011, they developed Puma for real-time analytics and Peregrine for fast queries to go beyond Hadoop.
The Ultimate Guide to Choosing WordPress Pros and Cons
Evolution of Big Data Architectures at Facebook
1. Evolution of Big Data
Architectures@
Facebook
Architecture Summit, Shenzhen, August 2012
Ashish Thusoo
2. About Me
• Currently Co-founder/CEO of Qubole
• Ran the Data Infrastructure Team at
Facebook till 2011
• Co-founded Apache Hive @ Facebook
3. Outline
• Big Data @ Facebook - Scope & Scale
• Evolution of Big Data Architectures @ FB
• Qubole
4. Big Data @ FB(2011):
Scale
• 25 PB of compressed data ~ 150 PB of
uncompressed data
• 400 TB/day (uncompressed) of new data
• 1 new job every second
5. Big Data @ FB: Scope
• Simple reporting
• Model generation
• Adhoc analysis + data science
• Index generation
• Many many others...
10. 2007: Traditional EDW
Scribe Mid-Tier
Summarization Cluster
Web Clusters
NAS Filers
MySQL Clusters RDBMS Data Warehouse
11. 2007: Pain Points
- compute close to storage
(early map/reduce)
Scribe Mid-Tier
Web Clusters
Summarization Cluster
NAS Filers
MySQL Clusters
- daily ETL > 24 hours
- Lots of tuning/indexes etc.
- Lots of hardware planning
RDBMS Data Warehouse
12. 2007: Limitations
• Most use cases were
in business metrics -
data science, model
building etc. not
possible
• Only summary data
was stored online -
details archived away
13. 2008: Move to Hadoop
Scribe Mid-Tier
Summarization Cluster
Web Clusters
NAS Filers
MySQL Clusters RDBMS Data Warehouse
14. 2008: Move to Hadoop
Scribe Mid-Tier Batch
copier/
Web Clusters
loaders
Hadoop/Hive Data Warehouse
NAS Filers
MySQL Clusters
RDBMS Data Mart
15. 2008: Immediate Pros
• Data science at
scale became
possible
• For the first time all
of the instrumented
data could be held
online
• Use cases expanded
16. 2009: Democratizing
Data
Scribe Mid-Tier
Web Clusters
Hadoop/Hive Data Warehouse
NAS Filers
MySQL Clusters
RDBMS Data Mart
17. 2009: Democratizing
Databee &
Data Nectar:
Chronos: Data instrumentation &
Pipeline schema aware data
Framework collection
HiPal: Adhoc Scrapes:
Queries + Data Hadoop/Hive Data Warehouse Configuration
Discovery Driven
18. 2009: Democratizing
Data(Nectar)
• Typical Nectar Pipeline
• Simple schema evolution
built in
• json encoded short term
data
• decomposing json for
long term storage
19. 2009: Democratizing
Data (Tools)
• HiPal - data discovery
and query authoring
• Charting and
dashboard generation
tools
20. 2009: Democratizing
Data (Tools)
• Databee: Workflow
language
• Chronos: Scheduling
tool
21. 2009: Cons of
Democratization
• Isolation to protect
against Bad Jobs
• Fair sharing of the
cluster - what is a
high priority job
and how to enforce
it
23. 2010: Isolation
Scribe Mid-Tier
Web Clusters
Hadoop/Hive Data Warehouse
NAS Filers
MySQL Clusters
24. 2010: Isolation
Scribe Mid-Tier
Web Clusters
Platinum Warehouse
Hive Replication
NAS Filers
MySQL Clusters
Silver Warehouse
25. 2010: Ops Efficiency
Web Clusters Scribe HDFS
ptail: parallel Platinum Warehouse
tail on hdfs Hive Replication
near real time data
consumers
MySQL Clusters
Silver Warehouse
26. 2010: Resource
Utilization (Disk)
• HDFS-RAID: from 3
replicas to 2.2 replicas
• RCFile: Row columnar
format for compressing
Hive tables
27. 2010: Resource
Utilization (CPU)
• Continuous copier/
loaders
• Incremental scrapes
• Hive optimizations to
save CPU
28. 2010: Monitoring(SLAs)
• Per job statistics rolled
up to owner/group/team
• Expected time of arrival
vs Actual time of arrival
of data
• Simple data quality
metrics
29. 2011: New
Requirements
• More real time requirements for
aggregations
• Optimizing resource utilization
30. 2011: Beyond Hadoop
• Puma for real time analytics
• Peregrine for simple and fast queries
31. 2011: Puma
Web Clusters Scribe HDFS
ptail: parallel Platinum Warehouse
tail on hdfs Hive Replication
near real time data
consumers
MySQL Clusters
Silver Warehouse
33. Some takeaways
• Operating and optimizing Data
Infrastructure is a hard problem
• Lots of components from log collection,
storage, compute, query processing, tools
and interfaces
• Lots of choices within each part of the
stack
34. Qubole
• Mission:
• Data Infrastructure in the Cloud made
Easy, Fast and Reliable
• We take care of operating and optimizing
this infrastructure so that you can focus
on your data, analysis, algorithms and
building your data apps
35. Qubole - Information
• Early Trial(by invitation):
• www.qubole.com
• Come talk to us to join a small and
passionate team
• jobs@qubole.com
• Follow us on twitter/facebook/linkedin