SlideShare une entreprise Scribd logo
1  sur  27
Télécharger pour lire hors ligne
Data Management at Facebook
(Back in the Day)

Jeff Hammerbacher
VP Product and Chief Scientist, Cloudera
October 22, 2008
My Background
Thanks for Asking
▪   hammer@cloudera.com
▪   Studied Mathematics at Harvard
▪   Worked as a Quant on Wall Street
▪   Came to Facebook in early 2006 as a Research Scientist
▪   Managed the Facebook Data Team through September 2008
    ▪   Over 25 amazing engineers and data scientists
▪   Now a cofounder of Cloudera
    ▪   Hadoop support and optimization
Common Themes
1. Simplicity
 ▪   Do one thing well ...
2. Scalability
 ▪   ... a lot
3. Manageability
 ▪   Remove the humans
4. Open Source
 ▪   Build a community
Serving Facebook.com
Data Retrieval and Hardware
                                                   GET /index.php HTTP/1.1
                                                   Host: www.facebook.com
▪   Three main server profiles:
    ▪   Web
    ▪   Memcached
                                                                           Web Tier
    ▪   MySQL                                                      (more than 10,000 Servers)




▪   Simplified away:
                                    Memcached Tier
    ▪   AJAX                     (around 1,000 servers)
                                                                  MySQL Tier
                                                             (around 2,000 servers)
    ▪   Photo and Video
    ▪   Services
Services Infrastructure
What’s an SOA?
▪   Almost all services written in Thrift
    ▪   Networks Type-ahead, Search, Ads, SMS Gateway, Chat, Notes
        Import, Scribe
▪   Batteries included
    ▪   Network transport libraries
    ▪   Serialization libraries
    ▪   Code generation
    ▪   Robust server implementations (multithreaded, nonblocking, etc.)
▪   Now an Apache Incubator project
▪   For more information, read the whitepaper
Services Infrastructure
Thrift, Mainly
▪   Developing a Thrift service:
    ▪   Define your data structures
        ▪   JSON-like data model
    ▪   Define your service endpoints
    ▪   Select your languages
    ▪   Generate stub code
    ▪   Write service logic
    ▪   Write client
    ▪   Configure and deploy
    ▪   Monitor, provision, and upgrade
Data Infrastructure
Offline Batch Processing
                                              Scribe Tier                     MySQL Tier

▪   “Data Warehousing”
▪   Began with Oracle database
▪   Schedule data collection via cron
▪   Collect data every 24 hours
▪   “ETL” scripts: hand-coded Python                        Data Collection
                                                                Server
▪   Data volumes quickly grew
    ▪   Started at tens of GB in early 2006                 Oracle Database
                                                                 Server
    ▪   Up to about 1 TB per day by mid-2007
    ▪   Log files largest source of data growth
Data Infrastructure
Distributed Processing with Cheetah
▪   Goal: summarize log files outside of the database
▪   Solution: Cheetah, a distributed log file processing system
    ▪   Distributor.pl: distribute binaries to processing nodes
    ▪   C++ Binaries: parse, agg, load




                       Partitioned Log File
                                                                Cheetah Master




                                Filer         Processing Tier
Data Infrastructure
Moving from Cheetah to Hadoop
▪   Cheetah limitations
    ▪   Limited filer bandwidth
    ▪   No centralized logfile metadata
    ▪   Writing a new Cheetah job requires writing C++ binaries
    ▪   Jobs are difficult to monitor and debug
    ▪   No support for ad hoc querying
    ▪   Not open source
Data Infrastructure
Hadoop as Enterprise Data Warehouse
               Scribe Tier     MySQL Tier




       Hadoop Tier




          Oracle RAC Servers
Initial Hadoop Applications
Unstructured text analysis
▪   Intern asked to understand brand sentiment and influence


▪   Many tools for supporting his project had to be built
    ▪   Understanding serialization format of wall post logs
    ▪   Common data operations: project, filter, join, group by
    ▪   Developed using Hadoop streaming for rapid prototyping in
        Python
    ▪   Scheduling regular processing and recovering from failures
    ▪   Making it easy to regularly load new data
Lexicon
Initial Hadoop Applications
Ensemble Learning
▪   Build a lot of Decision Trees and average them
    ▪   Random Forests are a combination of tree predictors such that
        each tree depends on the values of a random vector sampled
        independently and with the same distribution for all trees in the
        forest
    ▪   Can be used for regression or classification
    ▪   See “Random Forests” by Leo Breiman
More Hadoop Applications
Insights
▪   Monitor performance of your Facebook Ad, Page, Application
▪   Regular aggregation of high volumes of log file data
▪   First hourly pipelines
▪   Publish data back to a MySQL tier
▪   System currently only running partially on Hadoop
Insights
More Hadoop Applications
Platform Application Reputation Scoring
▪   Users complaining about being spammed by Platform
    applications
▪   Now, every Platform Application has a set of quotas
    ▪   Notifications
    ▪   News Feed story insertion
    ▪   Invitations
    ▪   Emails
▪   Quotas determined by calculating a “reputation score” for the
    application
Hive
Structured Data Management with Hadoop
▪   Hadoop:
    ▪   HDFS
    ▪   MapReduce
    ▪   Resource Manager
    ▪   Job Scheduler
▪   Hive:
    ▪   Logical data partitioning
    ▪   Metadata store (command line and web interfaces)
    ▪   Query Operators
    ▪   Query Language
Hive
Hive
The Team
▪   Joydeep Sen Sarma
▪   Ashish Thusoo
▪   Pete Wyckoff
▪   Suresh Anthony
▪   Zheng Shao
▪   Venky Iyer
▪   Dhruba Borthakur
▪   Namit Jain
▪   Raghu Murthy
▪   Prasad Chakka
Hive
Some Stats
▪   Cluster size - 320 nodes, 2560 cores, 1.3 PB capacity
▪   Total data (compressed, deduplicated) - 180 TB
▪   Net data per day
    ▪   10 TB uncompressed - 4 TB from databases, 6 TB from logs
    ▪   Over 2 TB compressed
▪   Data Processing Statistics
    ▪   3,200 Jobs and 800,000 Tasks per day
    ▪   55 TB of compressed data processed per day
    ▪   15 TB of compressed data produced per day
    ▪   80 M minutes of compute time per day
Cassandra
Structured Storage over a P2P Network
▪   Conceptually: BigTable data model on Dynamo infrastructure
▪   Design Goals:
    ▪   High availability
    ▪   Incremental scalability
    ▪   Eventual consistency (trade consistency for availability)
    ▪   Optimistic replication
    ▪   Low total cost of ownership
    ▪   Minimal administrative overhead
    ▪   Tunable tradeoffs between consistency, durability, and latency
Cassandra
Architecture
Cassandra
Initial Application
▪   Inbox search
Cassandra
The Team
▪   Avinash Lakshman
▪   Prashant Malik
▪   Karthik Ranganathan
▪   Kannan Muthukkaruppan
Cassandra
Some Stats
▪   Cluster size - 120 nodes
    ▪   Single instance across two data centers
▪   Total data stored - 36 TB
▪   Writes - 300 million writes per day.
▪   Reads - 1 million reads per day.
▪   Read Latencies
    ▪   Min - 6.03 ms
    ▪   Mean - 90.6 ms
    ▪   Median - 18.24 ms
(c) 2008 Facebook, Inc. or its licensors.  quot;Facebookquot; is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

Contenu connexe

Tendances

Amazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian MeyersAmazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian Meyershuguk
 
introduction to Neo4j (Tabriz Software Open Talks)
introduction to Neo4j (Tabriz Software Open Talks)introduction to Neo4j (Tabriz Software Open Talks)
introduction to Neo4j (Tabriz Software Open Talks)Farzin Bagheri
 
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
GNW03: Stream Processing with Apache Kafka by Gwen ShapiraGNW03: Stream Processing with Apache Kafka by Gwen Shapira
GNW03: Stream Processing with Apache Kafka by Gwen Shapiragluent.
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to OneSerg Masyutin
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Sparktsliwowicz
 
Change data capture with MongoDB and Kafka.
Change data capture with MongoDB and Kafka.Change data capture with MongoDB and Kafka.
Change data capture with MongoDB and Kafka.Dan Harvey
 
Building a Real-Time Feature Store at iFood
Building a Real-Time Feature Store at iFoodBuilding a Real-Time Feature Store at iFood
Building a Real-Time Feature Store at iFoodDatabricks
 
How does Riak compare to Cassandra? [Cassandra London User Group July 2011]
How does Riak compare to Cassandra? [Cassandra London User Group July 2011]How does Riak compare to Cassandra? [Cassandra London User Group July 2011]
How does Riak compare to Cassandra? [Cassandra London User Group July 2011]Rainforest QA
 
Technologies, Data Analytics Service and Enterprise Business
Technologies, Data Analytics Service and Enterprise BusinessTechnologies, Data Analytics Service and Enterprise Business
Technologies, Data Analytics Service and Enterprise BusinessSATOSHI TAGOMORI
 
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...Michael Stack
 
Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Zekeriya Besiroglu
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
 
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...Databricks
 
How Cloudflare analyzes -1m dns queries per second @ Percona E17
How Cloudflare analyzes -1m dns queries per second @ Percona E17How Cloudflare analyzes -1m dns queries per second @ Percona E17
How Cloudflare analyzes -1m dns queries per second @ Percona E17Tom Arnfeld
 
Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Zhenxiao Luo
 
On-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceOn-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceChin Huang
 
HBaseConAsia2018 Track2-2: Apache Kylin on HBase: Extreme OLAP for big data
HBaseConAsia2018  Track2-2: Apache Kylin on HBase: Extreme OLAP for big dataHBaseConAsia2018  Track2-2: Apache Kylin on HBase: Extreme OLAP for big data
HBaseConAsia2018 Track2-2: Apache Kylin on HBase: Extreme OLAP for big dataMichael Stack
 
Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platformhadooparchbook
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
 

Tendances (20)

Amazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian MeyersAmazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian Meyers
 
introduction to Neo4j (Tabriz Software Open Talks)
introduction to Neo4j (Tabriz Software Open Talks)introduction to Neo4j (Tabriz Software Open Talks)
introduction to Neo4j (Tabriz Software Open Talks)
 
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
GNW03: Stream Processing with Apache Kafka by Gwen ShapiraGNW03: Stream Processing with Apache Kafka by Gwen Shapira
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to One
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
 
Change data capture with MongoDB and Kafka.
Change data capture with MongoDB and Kafka.Change data capture with MongoDB and Kafka.
Change data capture with MongoDB and Kafka.
 
Building a Real-Time Feature Store at iFood
Building a Real-Time Feature Store at iFoodBuilding a Real-Time Feature Store at iFood
Building a Real-Time Feature Store at iFood
 
How does Riak compare to Cassandra? [Cassandra London User Group July 2011]
How does Riak compare to Cassandra? [Cassandra London User Group July 2011]How does Riak compare to Cassandra? [Cassandra London User Group July 2011]
How does Riak compare to Cassandra? [Cassandra London User Group July 2011]
 
Technologies, Data Analytics Service and Enterprise Business
Technologies, Data Analytics Service and Enterprise BusinessTechnologies, Data Analytics Service and Enterprise Business
Technologies, Data Analytics Service and Enterprise Business
 
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
 
Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
 
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...
 
How Cloudflare analyzes -1m dns queries per second @ Percona E17
How Cloudflare analyzes -1m dns queries per second @ Percona E17How Cloudflare analyzes -1m dns queries per second @ Percona E17
How Cloudflare analyzes -1m dns queries per second @ Percona E17
 
Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017
 
On-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceOn-boarding with JanusGraph Performance
On-boarding with JanusGraph Performance
 
HBaseConAsia2018 Track2-2: Apache Kylin on HBase: Extreme OLAP for big data
HBaseConAsia2018  Track2-2: Apache Kylin on HBase: Extreme OLAP for big dataHBaseConAsia2018  Track2-2: Apache Kylin on HBase: Extreme OLAP for big data
HBaseConAsia2018 Track2-2: Apache Kylin on HBase: Extreme OLAP for big data
 
Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platform
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 

Similaire à 20081022cca

AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
Facebook Hadoop Data & Applications
Facebook Hadoop Data & ApplicationsFacebook Hadoop Data & Applications
Facebook Hadoop Data & Applicationsdzhou
 
My Sql And Search At Craigslist
My Sql And Search At CraigslistMy Sql And Search At Craigslist
My Sql And Search At CraigslistMySQLConference
 
Building a High Performance Analytics Platform
Building a High Performance Analytics PlatformBuilding a High Performance Analytics Platform
Building a High Performance Analytics PlatformSantanu Dey
 
Storage Systems for High Scalable Systems Presentation
Storage Systems for High Scalable Systems PresentationStorage Systems for High Scalable Systems Presentation
Storage Systems for High Scalable Systems Presentationandyman3000
 
Cassandra at mahalo_com_scale_la_meetup_de
Cassandra at mahalo_com_scale_la_meetup_deCassandra at mahalo_com_scale_la_meetup_de
Cassandra at mahalo_com_scale_la_meetup_demahalomeetup
 
Web-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchWeb-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchEdward Capriolo
 
LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
LuSql: (Quickly and easily) Getting your data from your DBMS into LuceneLuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
LuSql: (Quickly and easily) Getting your data from your DBMS into Luceneeby
 
SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData
 
Collaborating with the Community
Collaborating with the CommunityCollaborating with the Community
Collaborating with the Communitytinacallahan
 
D Maeda Bi Portfolio
D Maeda Bi PortfolioD Maeda Bi Portfolio
D Maeda Bi PortfolioDMaeda
 
Machine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville MeetupMachine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville MeetupSri Ambati
 
Speeding Up The Snail
Speeding Up The SnailSpeeding Up The Snail
Speeding Up The SnailMarcus Deglos
 

Similaire à 20081022cca (20)

20080611accel
20080611accel20080611accel
20080611accel
 
20080528dublinpt1
20080528dublinpt120080528dublinpt1
20080528dublinpt1
 
Qcon
QconQcon
Qcon
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Facebook Hadoop Data & Applications
Facebook Hadoop Data & ApplicationsFacebook Hadoop Data & Applications
Facebook Hadoop Data & Applications
 
My Sql And Search At Craigslist
My Sql And Search At CraigslistMy Sql And Search At Craigslist
My Sql And Search At Craigslist
 
Building a High Performance Analytics Platform
Building a High Performance Analytics PlatformBuilding a High Performance Analytics Platform
Building a High Performance Analytics Platform
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
Storage Systems for High Scalable Systems Presentation
Storage Systems for High Scalable Systems PresentationStorage Systems for High Scalable Systems Presentation
Storage Systems for High Scalable Systems Presentation
 
20080528dublinpt2
20080528dublinpt220080528dublinpt2
20080528dublinpt2
 
The Web Scale
The Web ScaleThe Web Scale
The Web Scale
 
Cassandra at mahalo_com_scale_la_meetup_de
Cassandra at mahalo_com_scale_la_meetup_deCassandra at mahalo_com_scale_la_meetup_de
Cassandra at mahalo_com_scale_la_meetup_de
 
Web-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchWeb-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batch
 
LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
LuSql: (Quickly and easily) Getting your data from your DBMS into LuceneLuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
 
20090309berkeley
20090309berkeley20090309berkeley
20090309berkeley
 
SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017
 
Collaborating with the Community
Collaborating with the CommunityCollaborating with the Community
Collaborating with the Community
 
D Maeda Bi Portfolio
D Maeda Bi PortfolioD Maeda Bi Portfolio
D Maeda Bi Portfolio
 
Machine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville MeetupMachine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville Meetup
 
Speeding Up The Snail
Speeding Up The SnailSpeeding Up The Snail
Speeding Up The Snail
 

Plus de Jeff Hammerbacher (20)

20120223keystone
20120223keystone20120223keystone
20120223keystone
 
20100714accel
20100714accel20100714accel
20100714accel
 
20100608sigmod
20100608sigmod20100608sigmod
20100608sigmod
 
20100513brown
20100513brown20100513brown
20100513brown
 
20100423sage
20100423sage20100423sage
20100423sage
 
20100418sos
20100418sos20100418sos
20100418sos
 
20100301icde
20100301icde20100301icde
20100301icde
 
20100201hplabs
20100201hplabs20100201hplabs
20100201hplabs
 
20100128ebay
20100128ebay20100128ebay
20100128ebay
 
20091203gemini
20091203gemini20091203gemini
20091203gemini
 
20091203gemini
20091203gemini20091203gemini
20091203gemini
 
20091110startup2startup
20091110startup2startup20091110startup2startup
20091110startup2startup
 
20091030nasajpl
20091030nasajpl20091030nasajpl
20091030nasajpl
 
20091027genentech
20091027genentech20091027genentech
20091027genentech
 
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
 
20090622 Velocity
20090622 Velocity20090622 Velocity
20090622 Velocity
 
20090422 Www
20090422 Www20090422 Www
20090422 Www
 
20081030linkedin
20081030linkedin20081030linkedin
20081030linkedin
 
20081009nychive
20081009nychive20081009nychive
20081009nychive
 
2008 Ur Tech Talk Zshao
2008 Ur Tech Talk Zshao2008 Ur Tech Talk Zshao
2008 Ur Tech Talk Zshao
 

Dernier

Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 

Dernier (20)

Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 

20081022cca

  • 1.
  • 2. Data Management at Facebook (Back in the Day) Jeff Hammerbacher VP Product and Chief Scientist, Cloudera October 22, 2008
  • 3. My Background Thanks for Asking ▪ hammer@cloudera.com ▪ Studied Mathematics at Harvard ▪ Worked as a Quant on Wall Street ▪ Came to Facebook in early 2006 as a Research Scientist ▪ Managed the Facebook Data Team through September 2008 ▪ Over 25 amazing engineers and data scientists ▪ Now a cofounder of Cloudera ▪ Hadoop support and optimization
  • 4. Common Themes 1. Simplicity ▪ Do one thing well ... 2. Scalability ▪ ... a lot 3. Manageability ▪ Remove the humans 4. Open Source ▪ Build a community
  • 5. Serving Facebook.com Data Retrieval and Hardware GET /index.php HTTP/1.1 Host: www.facebook.com ▪ Three main server profiles: ▪ Web ▪ Memcached Web Tier ▪ MySQL (more than 10,000 Servers) ▪ Simplified away: Memcached Tier ▪ AJAX (around 1,000 servers) MySQL Tier (around 2,000 servers) ▪ Photo and Video ▪ Services
  • 6. Services Infrastructure What’s an SOA? ▪ Almost all services written in Thrift ▪ Networks Type-ahead, Search, Ads, SMS Gateway, Chat, Notes Import, Scribe ▪ Batteries included ▪ Network transport libraries ▪ Serialization libraries ▪ Code generation ▪ Robust server implementations (multithreaded, nonblocking, etc.) ▪ Now an Apache Incubator project ▪ For more information, read the whitepaper
  • 7. Services Infrastructure Thrift, Mainly ▪ Developing a Thrift service: ▪ Define your data structures ▪ JSON-like data model ▪ Define your service endpoints ▪ Select your languages ▪ Generate stub code ▪ Write service logic ▪ Write client ▪ Configure and deploy ▪ Monitor, provision, and upgrade
  • 8. Data Infrastructure Offline Batch Processing Scribe Tier MySQL Tier ▪ “Data Warehousing” ▪ Began with Oracle database ▪ Schedule data collection via cron ▪ Collect data every 24 hours ▪ “ETL” scripts: hand-coded Python Data Collection Server ▪ Data volumes quickly grew ▪ Started at tens of GB in early 2006 Oracle Database Server ▪ Up to about 1 TB per day by mid-2007 ▪ Log files largest source of data growth
  • 9. Data Infrastructure Distributed Processing with Cheetah ▪ Goal: summarize log files outside of the database ▪ Solution: Cheetah, a distributed log file processing system ▪ Distributor.pl: distribute binaries to processing nodes ▪ C++ Binaries: parse, agg, load Partitioned Log File Cheetah Master Filer Processing Tier
  • 10. Data Infrastructure Moving from Cheetah to Hadoop ▪ Cheetah limitations ▪ Limited filer bandwidth ▪ No centralized logfile metadata ▪ Writing a new Cheetah job requires writing C++ binaries ▪ Jobs are difficult to monitor and debug ▪ No support for ad hoc querying ▪ Not open source
  • 11. Data Infrastructure Hadoop as Enterprise Data Warehouse Scribe Tier MySQL Tier Hadoop Tier Oracle RAC Servers
  • 12. Initial Hadoop Applications Unstructured text analysis ▪ Intern asked to understand brand sentiment and influence ▪ Many tools for supporting his project had to be built ▪ Understanding serialization format of wall post logs ▪ Common data operations: project, filter, join, group by ▪ Developed using Hadoop streaming for rapid prototyping in Python ▪ Scheduling regular processing and recovering from failures ▪ Making it easy to regularly load new data
  • 14. Initial Hadoop Applications Ensemble Learning ▪ Build a lot of Decision Trees and average them ▪ Random Forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest ▪ Can be used for regression or classification ▪ See “Random Forests” by Leo Breiman
  • 15. More Hadoop Applications Insights ▪ Monitor performance of your Facebook Ad, Page, Application ▪ Regular aggregation of high volumes of log file data ▪ First hourly pipelines ▪ Publish data back to a MySQL tier ▪ System currently only running partially on Hadoop
  • 17. More Hadoop Applications Platform Application Reputation Scoring ▪ Users complaining about being spammed by Platform applications ▪ Now, every Platform Application has a set of quotas ▪ Notifications ▪ News Feed story insertion ▪ Invitations ▪ Emails ▪ Quotas determined by calculating a “reputation score” for the application
  • 18. Hive Structured Data Management with Hadoop ▪ Hadoop: ▪ HDFS ▪ MapReduce ▪ Resource Manager ▪ Job Scheduler ▪ Hive: ▪ Logical data partitioning ▪ Metadata store (command line and web interfaces) ▪ Query Operators ▪ Query Language
  • 19. Hive
  • 20. Hive The Team ▪ Joydeep Sen Sarma ▪ Ashish Thusoo ▪ Pete Wyckoff ▪ Suresh Anthony ▪ Zheng Shao ▪ Venky Iyer ▪ Dhruba Borthakur ▪ Namit Jain ▪ Raghu Murthy ▪ Prasad Chakka
  • 21. Hive Some Stats ▪ Cluster size - 320 nodes, 2560 cores, 1.3 PB capacity ▪ Total data (compressed, deduplicated) - 180 TB ▪ Net data per day ▪ 10 TB uncompressed - 4 TB from databases, 6 TB from logs ▪ Over 2 TB compressed ▪ Data Processing Statistics ▪ 3,200 Jobs and 800,000 Tasks per day ▪ 55 TB of compressed data processed per day ▪ 15 TB of compressed data produced per day ▪ 80 M minutes of compute time per day
  • 22. Cassandra Structured Storage over a P2P Network ▪ Conceptually: BigTable data model on Dynamo infrastructure ▪ Design Goals: ▪ High availability ▪ Incremental scalability ▪ Eventual consistency (trade consistency for availability) ▪ Optimistic replication ▪ Low total cost of ownership ▪ Minimal administrative overhead ▪ Tunable tradeoffs between consistency, durability, and latency
  • 25. Cassandra The Team ▪ Avinash Lakshman ▪ Prashant Malik ▪ Karthik Ranganathan ▪ Kannan Muthukkaruppan
  • 26. Cassandra Some Stats ▪ Cluster size - 120 nodes ▪ Single instance across two data centers ▪ Total data stored - 36 TB ▪ Writes - 300 million writes per day. ▪ Reads - 1 million reads per day. ▪ Read Latencies ▪ Min - 6.03 ms ▪ Mean - 90.6 ms ▪ Median - 18.24 ms
  • 27. (c) 2008 Facebook, Inc. or its licensors.  quot;Facebookquot; is a registered trademark of Facebook, Inc.. All rights reserved. 1.0