SlideShare une entreprise Scribd logo
1  sur  36
Télécharger pour lire hors ligne
Evolution of Big Data
  Architectures@
     Facebook
Architecture Summit, Shenzhen, August 2012
               Ashish Thusoo
About Me

• Currently Co-founder/CEO of Qubole
• Ran the Data Infrastructure Team at
  Facebook till 2011
• Co-founded Apache Hive @ Facebook
Outline

• Big Data @ Facebook - Scope & Scale
• Evolution of Big Data Architectures @ FB
• Qubole
Big Data @ FB(2011):
        Scale

• 25 PB of compressed data ~ 150 PB of
  uncompressed data
• 400 TB/day (uncompressed) of new data
• 1 new job every second
Big Data @ FB: Scope

• Simple reporting
• Model generation
• Adhoc analysis + data science
• Index generation
• Many many others...
A/B Testing Email #1
A/B Testing Email #2
A/B Testing Email #2 is
      3x Better
Evolution: 2007-2011
                 DW Size in TB
    30000
                                    25000

    22500


    15000
                             8000
     7500

            15   250   800
        0
         2007 2008 2009 2010 2011
2007: Traditional EDW


                 Scribe Mid-Tier
                                                Summarization Cluster
 Web Clusters




                                   NAS Filers


MySQL Clusters                                  RDBMS Data Warehouse
2007: Pain Points
                                                - compute close to storage
                                                    (early map/reduce)
                 Scribe Mid-Tier

 Web Clusters




                                                               Summarization Cluster
                                   NAS Filers



MySQL Clusters




                                        - daily ETL > 24 hours
                                     - Lots of tuning/indexes etc.
                                     - Lots of hardware planning
                                                               RDBMS Data Warehouse
2007: Limitations
• Most use cases were
  in business metrics -
  data science, model
  building etc. not
  possible
• Only summary data
  was stored online -
  details archived away
2008: Move to Hadoop


                 Scribe Mid-Tier
                                                Summarization Cluster
 Web Clusters




                                   NAS Filers




MySQL Clusters                                  RDBMS Data Warehouse
2008: Move to Hadoop


                 Scribe Mid-Tier             Batch
                                            copier/
 Web Clusters
                                            loaders


                                                Hadoop/Hive Data Warehouse
                                   NAS Filers




MySQL Clusters
                                                      RDBMS Data Mart
2008: Immediate Pros
• Data science at
  scale became
  possible
• For the first time all
  of the instrumented
  data could be held
  online
• Use cases expanded
2009: Democratizing
            Data

                 Scribe Mid-Tier
 Web Clusters


                                                Hadoop/Hive Data Warehouse
                                   NAS Filers




MySQL Clusters
                                                     RDBMS Data Mart
2009: Democratizing
 Databee &
               Data                                 Nectar:
Chronos: Data                                 instrumentation &
   Pipeline                                   schema aware data
 Framework                                         collection




 HiPal: Adhoc                                      Scrapes:
Queries + Data   Hadoop/Hive Data Warehouse      Configuration
  Discovery                                        Driven
2009: Democratizing
    Data(Nectar)
• Typical Nectar Pipeline
 • Simple schema evolution
    built in
 • json encoded short term
    data
 • decomposing json for
    long term storage
2009: Democratizing
    Data (Tools)
• HiPal - data discovery
  and query authoring
• Charting and
  dashboard generation
  tools
2009: Democratizing
    Data (Tools)

• Databee: Workflow
  language
• Chronos: Scheduling
  tool
2009: Cons of
     Democratization
• Isolation to protect
  against Bad Jobs
• Fair sharing of the
  cluster - what is a
  high priority job
  and how to enforce
  it
2010: Controlling
         Chaos
• Isolation
• Reducing operational overhead
• Better resource utilization
• Measurement, ownership, accountability
2010: Isolation

                   Scribe Mid-Tier
 Web Clusters

                                                  Hadoop/Hive Data Warehouse


                                     NAS Filers



MySQL Clusters
2010: Isolation

                   Scribe Mid-Tier
 Web Clusters

                                                  Platinum Warehouse

                                                      Hive Replication
                                     NAS Filers



MySQL Clusters




                                                   Silver Warehouse
2010: Ops Efficiency

 Web Clusters    Scribe HDFS

                       ptail: parallel             Platinum Warehouse
                        tail on hdfs                   Hive Replication
                             near real time data
                                 consumers

MySQL Clusters




                                                    Silver Warehouse
2010: Resource
        Utilization (Disk)

•   HDFS-RAID: from 3
    replicas to 2.2 replicas

•   RCFile: Row columnar
    format for compressing
    Hive tables
2010: Resource
       Utilization (CPU)
•   Continuous copier/
    loaders

•   Incremental scrapes

•   Hive optimizations to
    save CPU
2010: Monitoring(SLAs)

•   Per job statistics rolled
    up to owner/group/team

•   Expected time of arrival
    vs Actual time of arrival
    of data

•   Simple data quality
    metrics
2011: New
        Requirements

• More real time requirements for
  aggregations
• Optimizing resource utilization
2011: Beyond Hadoop


• Puma for real time analytics
• Peregrine for simple and fast queries
2011: Puma

 Web Clusters     Scribe HDFS

                        ptail: parallel             Platinum Warehouse
                         tail on hdfs                   Hive Replication
                              near real time data
                                  consumers

MySQL Clusters




                                                     Silver Warehouse
2011: Puma


    Scribe HDFS



          ptail: parallel tail on
                   hdfs




 Puma Clusters
                                    Hbase Cluster
Some takeaways
• Operating and optimizing Data
  Infrastructure is a hard problem
 • Lots of components from log collection,
    storage, compute, query processing, tools
    and interfaces
 • Lots of choices within each part of the
    stack
Qubole
• Mission:
 • Data Infrastructure in the Cloud made
    Easy, Fast and Reliable
 • We take care of operating and optimizing
    this infrastructure so that you can focus
    on your data, analysis, algorithms and
    building your data apps
Qubole - Information
• Early Trial(by invitation):
 • www.qubole.com
• Come talk to us to join a small and
  passionate team
  • jobs@qubole.com
• Follow us on twitter/facebook/linkedin
Evolution of Big Data Architectures at Facebook

Contenu connexe

Tendances

NoSQL overview implementation free
NoSQL overview implementation freeNoSQL overview implementation free
NoSQL overview implementation freeBenoit Perroud
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranMapR Technologies
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiJoydeep Sen Sarma
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingCloudera, Inc.
 
Design, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for HadoopDesign, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for Hadoopmcsrivas
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013WANdisco Plc
 

Tendances (20)

10c introduction
10c introduction10c introduction
10c introduction
 
Intro To Hadoop
Intro To HadoopIntro To Hadoop
Intro To Hadoop
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Hadoop Fundamentals I
Hadoop Fundamentals IHadoop Fundamentals I
Hadoop Fundamentals I
 
NoSQL overview implementation free
NoSQL overview implementation freeNoSQL overview implementation free
NoSQL overview implementation free
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Hadoop
HadoopHadoop
Hadoop
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-Delhi
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Design, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for HadoopDesign, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for Hadoop
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
HDFS
HDFSHDFS
HDFS
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Hadoop
Hadoop Hadoop
Hadoop
 

En vedette

Scaling agileteamsderby2012
Scaling agileteamsderby2012Scaling agileteamsderby2012
Scaling agileteamsderby2012drewz lin
 
低功耗服务器定制与绿色计算——章文嵩(淘宝)
低功耗服务器定制与绿色计算——章文嵩(淘宝)低功耗服务器定制与绿色计算——章文嵩(淘宝)
低功耗服务器定制与绿色计算——章文嵩(淘宝)drewz lin
 
Writing high quality code for agile2012
Writing high quality code for agile2012Writing high quality code for agile2012
Writing high quality code for agile2012drewz lin
 
Pragmatic notdogmatictdd agile2012
Pragmatic notdogmatictdd   agile2012Pragmatic notdogmatictdd   agile2012
Pragmatic notdogmatictdd agile2012drewz lin
 
Top100summit 芈珺七拼八凑搭建移动自动化测试框架
Top100summit 芈珺七拼八凑搭建移动自动化测试框架Top100summit 芈珺七拼八凑搭建移动自动化测试框架
Top100summit 芈珺七拼八凑搭建移动自动化测试框架drewz lin
 
Continuous delivery agile_2012
Continuous delivery agile_2012Continuous delivery agile_2012
Continuous delivery agile_2012drewz lin
 
F1 07 淘宝软件基础设施构建实践_章文嵩_淘宝
F1 07 淘宝软件基础设施构建实践_章文嵩_淘宝F1 07 淘宝软件基础设施构建实践_章文嵩_淘宝
F1 07 淘宝软件基础设施构建实践_章文嵩_淘宝drewz lin
 
Story mapstestplansandothercrosscutting
Story mapstestplansandothercrosscuttingStory mapstestplansandothercrosscutting
Story mapstestplansandothercrosscuttingdrewz lin
 
Via forensics appsecusa-nov-2013
Via forensics appsecusa-nov-2013Via forensics appsecusa-nov-2013
Via forensics appsecusa-nov-2013drewz lin
 
Web security-–-everything-we-know-is-wrong-eoin-keary
Web security-–-everything-we-know-is-wrong-eoin-kearyWeb security-–-everything-we-know-is-wrong-eoin-keary
Web security-–-everything-we-know-is-wrong-eoin-kearydrewz lin
 

En vedette (10)

Scaling agileteamsderby2012
Scaling agileteamsderby2012Scaling agileteamsderby2012
Scaling agileteamsderby2012
 
低功耗服务器定制与绿色计算——章文嵩(淘宝)
低功耗服务器定制与绿色计算——章文嵩(淘宝)低功耗服务器定制与绿色计算——章文嵩(淘宝)
低功耗服务器定制与绿色计算——章文嵩(淘宝)
 
Writing high quality code for agile2012
Writing high quality code for agile2012Writing high quality code for agile2012
Writing high quality code for agile2012
 
Pragmatic notdogmatictdd agile2012
Pragmatic notdogmatictdd   agile2012Pragmatic notdogmatictdd   agile2012
Pragmatic notdogmatictdd agile2012
 
Top100summit 芈珺七拼八凑搭建移动自动化测试框架
Top100summit 芈珺七拼八凑搭建移动自动化测试框架Top100summit 芈珺七拼八凑搭建移动自动化测试框架
Top100summit 芈珺七拼八凑搭建移动自动化测试框架
 
Continuous delivery agile_2012
Continuous delivery agile_2012Continuous delivery agile_2012
Continuous delivery agile_2012
 
F1 07 淘宝软件基础设施构建实践_章文嵩_淘宝
F1 07 淘宝软件基础设施构建实践_章文嵩_淘宝F1 07 淘宝软件基础设施构建实践_章文嵩_淘宝
F1 07 淘宝软件基础设施构建实践_章文嵩_淘宝
 
Story mapstestplansandothercrosscutting
Story mapstestplansandothercrosscuttingStory mapstestplansandothercrosscutting
Story mapstestplansandothercrosscutting
 
Via forensics appsecusa-nov-2013
Via forensics appsecusa-nov-2013Via forensics appsecusa-nov-2013
Via forensics appsecusa-nov-2013
 
Web security-–-everything-we-know-is-wrong-eoin-keary
Web security-–-everything-we-know-is-wrong-eoin-kearyWeb security-–-everything-we-know-is-wrong-eoin-keary
Web security-–-everything-we-know-is-wrong-eoin-keary
 

Similaire à Evolution of Big Data Architectures at Facebook

Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephantsOvidiu Dimulescu
 
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...Qian Lin
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Datacwensel
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Robert Grossman
 
001 hbase introduction
001 hbase introduction001 hbase introduction
001 hbase introductionScott Miao
 
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Acunu
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computingJoey Echeverria
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Cloudera, Inc.
 
Microsoft Openness Mongo DB
Microsoft Openness Mongo DBMicrosoft Openness Mongo DB
Microsoft Openness Mongo DBHeriyadi Janwar
 
Navigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesNavigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesshnkr_rmchndrn
 

Similaire à Evolution of Big Data Architectures at Facebook (20)

Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Nosql seminar
Nosql seminarNosql seminar
Nosql seminar
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)
 
001 hbase introduction
001 hbase introduction001 hbase introduction
001 hbase introduction
 
Drill njhug -19 feb2013
Drill njhug -19 feb2013Drill njhug -19 feb2013
Drill njhug -19 feb2013
 
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
 
NoSQL_Night
NoSQL_NightNoSQL_Night
NoSQL_Night
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
 
Microsoft Openness Mongo DB
Microsoft Openness Mongo DBMicrosoft Openness Mongo DB
Microsoft Openness Mongo DB
 
Navigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesNavigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skies
 

Plus de drewz lin

Phu appsec13
Phu appsec13Phu appsec13
Phu appsec13drewz lin
 
Owasp2013 johannesullrich
Owasp2013 johannesullrichOwasp2013 johannesullrich
Owasp2013 johannesullrichdrewz lin
 
Owasp advanced mobile-application-code-review-techniques-v0.2
Owasp advanced mobile-application-code-review-techniques-v0.2Owasp advanced mobile-application-code-review-techniques-v0.2
Owasp advanced mobile-application-code-review-techniques-v0.2drewz lin
 
I mas appsecusa-nov13-v2
I mas appsecusa-nov13-v2I mas appsecusa-nov13-v2
I mas appsecusa-nov13-v2drewz lin
 
Defeating xss-and-xsrf-with-my faces-frameworks-steve-wolf
Defeating xss-and-xsrf-with-my faces-frameworks-steve-wolfDefeating xss-and-xsrf-with-my faces-frameworks-steve-wolf
Defeating xss-and-xsrf-with-my faces-frameworks-steve-wolfdrewz lin
 
Csrf not-all-defenses-are-created-equal
Csrf not-all-defenses-are-created-equalCsrf not-all-defenses-are-created-equal
Csrf not-all-defenses-are-created-equaldrewz lin
 
Chuck willis-owaspbwa-beyond-1.0-app secusa-2013-11-21
Chuck willis-owaspbwa-beyond-1.0-app secusa-2013-11-21Chuck willis-owaspbwa-beyond-1.0-app secusa-2013-11-21
Chuck willis-owaspbwa-beyond-1.0-app secusa-2013-11-21drewz lin
 
Appsec usa roberthansen
Appsec usa roberthansenAppsec usa roberthansen
Appsec usa roberthansendrewz lin
 
Appsec usa2013 js_libinsecurity_stefanodipaola
Appsec usa2013 js_libinsecurity_stefanodipaolaAppsec usa2013 js_libinsecurity_stefanodipaola
Appsec usa2013 js_libinsecurity_stefanodipaoladrewz lin
 
Appsec2013 presentation-dickson final-with_all_final_edits
Appsec2013 presentation-dickson final-with_all_final_editsAppsec2013 presentation-dickson final-with_all_final_edits
Appsec2013 presentation-dickson final-with_all_final_editsdrewz lin
 
Appsec2013 presentation
Appsec2013 presentationAppsec2013 presentation
Appsec2013 presentationdrewz lin
 
Appsec 2013-krehel-ondrej-forensic-investigations-of-web-exploitations
Appsec 2013-krehel-ondrej-forensic-investigations-of-web-exploitationsAppsec 2013-krehel-ondrej-forensic-investigations-of-web-exploitations
Appsec 2013-krehel-ondrej-forensic-investigations-of-web-exploitationsdrewz lin
 
Appsec2013 assurance tagging-robert martin
Appsec2013 assurance tagging-robert martinAppsec2013 assurance tagging-robert martin
Appsec2013 assurance tagging-robert martindrewz lin
 
Amol scadaowasp
Amol scadaowaspAmol scadaowasp
Amol scadaowaspdrewz lin
 
Agile sdlc-v1.1-owasp-app sec-usa
Agile sdlc-v1.1-owasp-app sec-usaAgile sdlc-v1.1-owasp-app sec-usa
Agile sdlc-v1.1-owasp-app sec-usadrewz lin
 
Vulnex app secusa2013
Vulnex app secusa2013Vulnex app secusa2013
Vulnex app secusa2013drewz lin
 
基于虚拟化技术的分布式软件测试框架
基于虚拟化技术的分布式软件测试框架基于虚拟化技术的分布式软件测试框架
基于虚拟化技术的分布式软件测试框架drewz lin
 
新浪微博稳定性经验谈
新浪微博稳定性经验谈新浪微博稳定性经验谈
新浪微博稳定性经验谈drewz lin
 
无线App的性能分析和监控实践 rickyqiu
无线App的性能分析和监控实践 rickyqiu无线App的性能分析和监控实践 rickyqiu
无线App的性能分析和监控实践 rickyqiudrewz lin
 
网易移动自动化测试实践(孔庆云)
网易移动自动化测试实践(孔庆云)网易移动自动化测试实践(孔庆云)
网易移动自动化测试实践(孔庆云)drewz lin
 

Plus de drewz lin (20)

Phu appsec13
Phu appsec13Phu appsec13
Phu appsec13
 
Owasp2013 johannesullrich
Owasp2013 johannesullrichOwasp2013 johannesullrich
Owasp2013 johannesullrich
 
Owasp advanced mobile-application-code-review-techniques-v0.2
Owasp advanced mobile-application-code-review-techniques-v0.2Owasp advanced mobile-application-code-review-techniques-v0.2
Owasp advanced mobile-application-code-review-techniques-v0.2
 
I mas appsecusa-nov13-v2
I mas appsecusa-nov13-v2I mas appsecusa-nov13-v2
I mas appsecusa-nov13-v2
 
Defeating xss-and-xsrf-with-my faces-frameworks-steve-wolf
Defeating xss-and-xsrf-with-my faces-frameworks-steve-wolfDefeating xss-and-xsrf-with-my faces-frameworks-steve-wolf
Defeating xss-and-xsrf-with-my faces-frameworks-steve-wolf
 
Csrf not-all-defenses-are-created-equal
Csrf not-all-defenses-are-created-equalCsrf not-all-defenses-are-created-equal
Csrf not-all-defenses-are-created-equal
 
Chuck willis-owaspbwa-beyond-1.0-app secusa-2013-11-21
Chuck willis-owaspbwa-beyond-1.0-app secusa-2013-11-21Chuck willis-owaspbwa-beyond-1.0-app secusa-2013-11-21
Chuck willis-owaspbwa-beyond-1.0-app secusa-2013-11-21
 
Appsec usa roberthansen
Appsec usa roberthansenAppsec usa roberthansen
Appsec usa roberthansen
 
Appsec usa2013 js_libinsecurity_stefanodipaola
Appsec usa2013 js_libinsecurity_stefanodipaolaAppsec usa2013 js_libinsecurity_stefanodipaola
Appsec usa2013 js_libinsecurity_stefanodipaola
 
Appsec2013 presentation-dickson final-with_all_final_edits
Appsec2013 presentation-dickson final-with_all_final_editsAppsec2013 presentation-dickson final-with_all_final_edits
Appsec2013 presentation-dickson final-with_all_final_edits
 
Appsec2013 presentation
Appsec2013 presentationAppsec2013 presentation
Appsec2013 presentation
 
Appsec 2013-krehel-ondrej-forensic-investigations-of-web-exploitations
Appsec 2013-krehel-ondrej-forensic-investigations-of-web-exploitationsAppsec 2013-krehel-ondrej-forensic-investigations-of-web-exploitations
Appsec 2013-krehel-ondrej-forensic-investigations-of-web-exploitations
 
Appsec2013 assurance tagging-robert martin
Appsec2013 assurance tagging-robert martinAppsec2013 assurance tagging-robert martin
Appsec2013 assurance tagging-robert martin
 
Amol scadaowasp
Amol scadaowaspAmol scadaowasp
Amol scadaowasp
 
Agile sdlc-v1.1-owasp-app sec-usa
Agile sdlc-v1.1-owasp-app sec-usaAgile sdlc-v1.1-owasp-app sec-usa
Agile sdlc-v1.1-owasp-app sec-usa
 
Vulnex app secusa2013
Vulnex app secusa2013Vulnex app secusa2013
Vulnex app secusa2013
 
基于虚拟化技术的分布式软件测试框架
基于虚拟化技术的分布式软件测试框架基于虚拟化技术的分布式软件测试框架
基于虚拟化技术的分布式软件测试框架
 
新浪微博稳定性经验谈
新浪微博稳定性经验谈新浪微博稳定性经验谈
新浪微博稳定性经验谈
 
无线App的性能分析和监控实践 rickyqiu
无线App的性能分析和监控实践 rickyqiu无线App的性能分析和监控实践 rickyqiu
无线App的性能分析和监控实践 rickyqiu
 
网易移动自动化测试实践(孔庆云)
网易移动自动化测试实践(孔庆云)网易移动自动化测试实践(孔庆云)
网易移动自动化测试实践(孔庆云)
 

Dernier

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 

Dernier (20)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 

Evolution of Big Data Architectures at Facebook

  • 1. Evolution of Big Data Architectures@ Facebook Architecture Summit, Shenzhen, August 2012 Ashish Thusoo
  • 2. About Me • Currently Co-founder/CEO of Qubole • Ran the Data Infrastructure Team at Facebook till 2011 • Co-founded Apache Hive @ Facebook
  • 3. Outline • Big Data @ Facebook - Scope & Scale • Evolution of Big Data Architectures @ FB • Qubole
  • 4. Big Data @ FB(2011): Scale • 25 PB of compressed data ~ 150 PB of uncompressed data • 400 TB/day (uncompressed) of new data • 1 new job every second
  • 5. Big Data @ FB: Scope • Simple reporting • Model generation • Adhoc analysis + data science • Index generation • Many many others...
  • 8. A/B Testing Email #2 is 3x Better
  • 9. Evolution: 2007-2011 DW Size in TB 30000 25000 22500 15000 8000 7500 15 250 800 0 2007 2008 2009 2010 2011
  • 10. 2007: Traditional EDW Scribe Mid-Tier Summarization Cluster Web Clusters NAS Filers MySQL Clusters RDBMS Data Warehouse
  • 11. 2007: Pain Points - compute close to storage (early map/reduce) Scribe Mid-Tier Web Clusters Summarization Cluster NAS Filers MySQL Clusters - daily ETL > 24 hours - Lots of tuning/indexes etc. - Lots of hardware planning RDBMS Data Warehouse
  • 12. 2007: Limitations • Most use cases were in business metrics - data science, model building etc. not possible • Only summary data was stored online - details archived away
  • 13. 2008: Move to Hadoop Scribe Mid-Tier Summarization Cluster Web Clusters NAS Filers MySQL Clusters RDBMS Data Warehouse
  • 14. 2008: Move to Hadoop Scribe Mid-Tier Batch copier/ Web Clusters loaders Hadoop/Hive Data Warehouse NAS Filers MySQL Clusters RDBMS Data Mart
  • 15. 2008: Immediate Pros • Data science at scale became possible • For the first time all of the instrumented data could be held online • Use cases expanded
  • 16. 2009: Democratizing Data Scribe Mid-Tier Web Clusters Hadoop/Hive Data Warehouse NAS Filers MySQL Clusters RDBMS Data Mart
  • 17. 2009: Democratizing Databee & Data Nectar: Chronos: Data instrumentation & Pipeline schema aware data Framework collection HiPal: Adhoc Scrapes: Queries + Data Hadoop/Hive Data Warehouse Configuration Discovery Driven
  • 18. 2009: Democratizing Data(Nectar) • Typical Nectar Pipeline • Simple schema evolution built in • json encoded short term data • decomposing json for long term storage
  • 19. 2009: Democratizing Data (Tools) • HiPal - data discovery and query authoring • Charting and dashboard generation tools
  • 20. 2009: Democratizing Data (Tools) • Databee: Workflow language • Chronos: Scheduling tool
  • 21. 2009: Cons of Democratization • Isolation to protect against Bad Jobs • Fair sharing of the cluster - what is a high priority job and how to enforce it
  • 22. 2010: Controlling Chaos • Isolation • Reducing operational overhead • Better resource utilization • Measurement, ownership, accountability
  • 23. 2010: Isolation Scribe Mid-Tier Web Clusters Hadoop/Hive Data Warehouse NAS Filers MySQL Clusters
  • 24. 2010: Isolation Scribe Mid-Tier Web Clusters Platinum Warehouse Hive Replication NAS Filers MySQL Clusters Silver Warehouse
  • 25. 2010: Ops Efficiency Web Clusters Scribe HDFS ptail: parallel Platinum Warehouse tail on hdfs Hive Replication near real time data consumers MySQL Clusters Silver Warehouse
  • 26. 2010: Resource Utilization (Disk) • HDFS-RAID: from 3 replicas to 2.2 replicas • RCFile: Row columnar format for compressing Hive tables
  • 27. 2010: Resource Utilization (CPU) • Continuous copier/ loaders • Incremental scrapes • Hive optimizations to save CPU
  • 28. 2010: Monitoring(SLAs) • Per job statistics rolled up to owner/group/team • Expected time of arrival vs Actual time of arrival of data • Simple data quality metrics
  • 29. 2011: New Requirements • More real time requirements for aggregations • Optimizing resource utilization
  • 30. 2011: Beyond Hadoop • Puma for real time analytics • Peregrine for simple and fast queries
  • 31. 2011: Puma Web Clusters Scribe HDFS ptail: parallel Platinum Warehouse tail on hdfs Hive Replication near real time data consumers MySQL Clusters Silver Warehouse
  • 32. 2011: Puma Scribe HDFS ptail: parallel tail on hdfs Puma Clusters Hbase Cluster
  • 33. Some takeaways • Operating and optimizing Data Infrastructure is a hard problem • Lots of components from log collection, storage, compute, query processing, tools and interfaces • Lots of choices within each part of the stack
  • 34. Qubole • Mission: • Data Infrastructure in the Cloud made Easy, Fast and Reliable • We take care of operating and optimizing this infrastructure so that you can focus on your data, analysis, algorithms and building your data apps
  • 35. Qubole - Information • Early Trial(by invitation): • www.qubole.com • Come talk to us to join a small and passionate team • jobs@qubole.com • Follow us on twitter/facebook/linkedin