SlideShare une entreprise Scribd logo
1  sur  28
Télécharger pour lire hors ligne
Wednesday, April 22, 2009
Socializing Big Data
         Lessons from the Hadoop Community



         Jeff Hammerbacher
         Chief Scientist and Vice President of Products, Cloudera
         April 22, 2009



Wednesday, April 22, 2009
My Background
         Thanks for Asking

             hammer@cloudera.com
         ▪

             Studied Mathematics at Harvard
         ▪

             Worked as a Quant on Wall Street
         ▪

             Conceived, built, and led the Data team at Facebook
         ▪

                 Nearly 30 amazing engineers and data scientists
             ▪

                 Released Hive and Cassandra as open source projects
             ▪

                 Published research at conferences: SIGMOD, CHI, ICWSM
             ▪

             Founder of Cloudera
         ▪

                 Building tools to make learning go faster, starting with Hadoop
             ▪



Wednesday, April 22, 2009
Presentation Outline
             What is Hadoop?
         ▪

             Hadoop at Facebook
         ▪

                 Brief history of the Facebook Data team
             ▪

                 Summary of how we used Hadoop
             ▪

                 Reasons for choosing Hadoop
             ▪

             How is software built and adopted?
         ▪

                 “Laboratory Life”
             ▪

                 Social Learning Theory
             ▪

                 Organizations and tools in open source development
             ▪

             Moving from the “Age of Data” to the “Age of Learning”
         ▪




Wednesday, April 22, 2009
The Hadoop community is producing
         innovative, world class software for web
         scale data management and analysis.

         By studying how software is built and
         adopted, we can enhance rate at which data
         processing technologies evolve.

         The Hadoop community is open to everyone and
         will play a central role in this evolution.
         You should join us!


Wednesday, April 22, 2009
What is Hadoop?
         Not Just a Stuffed Elephant

             Open source project, written mostly in Java
         ▪

             Most active Apache Software Foundation project
         ▪

             Inspired by Google infrastructure
         ▪

             Over one hundred production deployments
         ▪

             Project structure
         ▪

                 Hadoop Distributed File System (HDFS)
             ▪

                 Hadoop MapReduce
             ▪

                 Hadoop Core: client libraries and management tools
             ▪

                 Other subprojects: Avro, HBase, Hive, Pig, Zookeeper
             ▪



Wednesday, April 22, 2009
Anatomy of a Hadoop Cluster
             Commodity servers
         ▪

                 2 x 4 core CPU, 8 GB RAM, 4 x 1 TB SATA, 2 x 1 gE NIC
             ▪

             Typically arranged in 2 level architecture
         ▪


                               Commodity                       Hardware Cluster
                 40 nodes per rack
             ▪

             Inexpensive to acquire and maintain
         ▪




                            •! Typically in 2 level architecture
                                –! Nodes are commodity Linux PCs
                                –! 40 nodes/rack
Wednesday, April 22, 2009
HDFS
             Pool commodity servers into a single hierarchical namespace
         ▪

             Break files into 128 MB blocks and replicate blocks
         ▪

             Designed for large files written once but read many times
         ▪

             Two main daemons: NameNode and DataNode
         ▪

                 NameNode manages filesystem metadata
             ▪

                 DataNode manages data using local filesystem
             ▪

             HDFS manages checksumming, replication, and compression
         ▪

             Throughput scales nearly linearly with cluster size
         ▪

             Access from Java, C, command line, FUSE, or Thrift
         ▪




Wednesday, April 22, 2009
'$*31%10$13+3&'1%)#$#I%
                                 #79:quot;5$)$3-.quot;.0&2$3-quot;&)quot;06-quot;*+,.0-2quot;84quot;82-$?()3quot;()*&5()3quot;
                                 /(+-.quot;()0"'(-*-.;quot;*$++-%quot;C8+&*?.;Dquot;$)%quot;.0&2()3quot;-$*6quot;&/quot;06-quot;8+&*?.quot;

                                                         HDFS
                                 2-%,)%$)0+4quot;$*2&..quot;06-quot;'&&+quot;&/quot;.-2=-2.<quot;quot;B)quot;06-quot;*&55&)quot;*$.-;quot;
                                 #79:quot;.0&2-.quot;062--quot;*&5'+-0-quot;*&'(-.quot;&/quot;-$*6quot;/(+-quot;84quot;*&'4()3quot;-$*6quot;
                                 '(-*-quot;0&quot;062--quot;%(//-2-)0quot;.-2=-2.Equot;
              HDFS distributes file blocks among servers
                       quot;

                                                          quot; !quot;                 quot; Fquot;

                                                           Iquot;                   !quot;


                                     quot;                     Hquot;                   Hquot;
                                         Fquot;

                                                                    quot; Fquot;
                                         !quot;
                                               #79:quot;
                                         Gquot;                          Gquot;

                                                                     Iquot;
                                         Iquot;

                                         Hquot;               quot; !quot;                 quot; Fquot;

                                                           Gquot;                    Gquot;

                                                           Iquot;                    Hquot;


                                                                                           quot;
                                         !quot;#$%&'()'*+!,'-quot;./%quot;0$/&.'1quot;2&'02345.'6738#'.&%9&%.'
                                 quot;
Wednesday, April 22, 2009
Hadoop MapReduce
             Fault-tolerant execution layer and API for parallel data processing
         ▪

             Can target multiple storage systems
         ▪

             Key/value data model
         ▪

             Two main daemons: JobTracker and TaskTracker
         ▪

             Three main phases: Map, Shufle, and Reduce
         ▪

             Growing sophistication for job and task scheduling
         ▪

             Many client interfaces
         ▪

                 Java, C++, Streaming
             ▪

                 Pig, SQL (Hive QL)
             ▪




Wednesday, April 22, 2009
MapReduce
                      MapReduce pushes work out to the data
             (#)**+%$#41'%
                                                                       Kquot;
                                            Qquot;
             #)5#0$#.1%*6%(/789%
             )#$#%)'$3:;$*0%                                        Qquot;
                                            !quot;
             '$3#$1.%$*%+;'quot;%=*34%
                                                                       Nquot;
                                            Nquot;
             *;$%$*%#0%0*)1'%0%#%
             ?@;'$13A%Bquot;'%#@@*='%
             #0#@'1'%$*%3;0%0%                          Kquot;
             +#3#@@1@%#0)%1@0#$1'%
             $quot;1%:*$$@101?4'%                             Pquot;
             +*'1)%:%*0*@$quot;?%                       !quot;
             '$*3#.1%''$1'A%
                                            Qquot;
                                                                       Kquot;
                                            Pquot;
                                                                       Pquot;
                                            !quot;
                                                                       Nquot;



                                                                                     quot;
                                                 !quot;#$%'()'*+,--.'.$/0/'1-%2'-$3'3-'30',+3+'
                                        quot;
Wednesday, April 22, 2009
Hadoop Subprojects
             Avro
         ▪

                 Cross-language framework for RPC
             ▪

             HBase
         ▪

                 Table storage above HDFS, modeled after Google’s BigTable
             ▪

             Hive
         ▪

                 SQL interface to structured data stored in HDFS
             ▪

             Pig
         ▪

                 Language for data flow programming
             ▪

             Zookeeper
         ▪

                 Coordination service for distributed systems
             ▪



Wednesday, April 22, 2009
Hadoop at Yahoo!
             Jan 2006: Hired Doug Cutting
         ▪

             Apr 2006: Sorted 1.9 TB on 188 nodes in 47 hours
         ▪

             March 2008: Hadoop Summit attracted several hundred attendees
         ▪

             Apr 2008: Sorted 1 TB on 910 nodes in 209 seconds
         ▪

             Aug 2008: Deployed 4,000 node Hadoop cluster
         ▪

             Data Points
         ▪

                 Over 20,000 nodes running Hadoop
             ▪

                 Hundreds of thousands of jobs per day
             ▪

                 Typical HDFS cluster: 1,400 nodes, 2 PB capacity
             ▪

                 Largest shufle is 450 TB
             ▪




Wednesday, April 22, 2009
Facebook Before Hadoop
         Early 2006: The First Research Scientist

             Source data living on horizontally partitioned MySQL tier
         ▪

             Intensive historical analysis dificult
         ▪

             No way to assess impact of changes to the site
         ▪




             First try: Python scripts pull data into MySQL
         ▪

             Second try: Python scripts pull data into Oracle
         ▪




             ...and then we turned on impression logging
         ▪




Wednesday, April 22, 2009
Facebook Data Infrastructure
                                                  2007
                                   Scribe Tier                     MySQL Tier




                                                 Data Collection
                                                     Server




                                                 Oracle Database
                                                      Server




Wednesday, April 22, 2009
Facebook Data Infrastructure
                                                         2008
                                         Scribe Tier            MySQL Tier




                                 Hadoop Tier




                                    Oracle RAC Servers




Wednesday, April 22, 2009
Facebook Workloads
             Data collection
         ▪

                 server logs
             ▪

                 application databases
             ▪

                 web crawls
             ▪

             Thousands of multi-stage processing pipelines
         ▪

                 Summaries consumed by external users
             ▪

                 Summaries for internal reporting
             ▪

                 Ad optimization pipeline
             ▪

                 Experimentation platform pipeline
             ▪

             Ad hoc analyses
         ▪




Wednesday, April 22, 2009
Facebook Hadoop Statistics
             Over 700 servers running Hadoop in one data center
         ▪

             2.5 PB in largest Hadoop cluster
         ▪

             15 TB loaded into Hadoop cluster each day
         ▪

             4,000 MapReduce jobs with 800,000 tasks run per day
         ▪

             55 TB of data processed per day
         ▪

             15 TB of additional data produced from cluster activity per day
         ▪




             Hadoop cluster not retiring data!
         ▪




Wednesday, April 22, 2009
Why Did Facebook Choose Hadoop?
         1. Demonstrated eectiveness for primary workload
         2. Proven ability to scale past any commercial vendor
         3. Easy provisioning and capacity planning with commodity nodes
         4. Data access for engineers and business analysts
         5. Single system to manage XML, JSON, text, and relational data
         6. No schemas enabled data collection without involving Data team
         7. Cost of software: zero dollars
         8. Deep commitment to continued development from Yahoo!
         9. Active user and developer community
         10. Apache-licensed open source code

Wednesday, April 22, 2009
Hadoop Community Support
         People Build Technology

             Most active Apache mailing lists
         ▪

             Details oficial documentation per release
         ▪

             Three books this year: O’Reilly, Apress, Manning
         ▪

             Free training videos online
         ▪

             Regular user group meetings in many cities
         ▪

             University courses across the world
         ▪

             Growing consultant and sys integrator expertise
         ▪

             Commercial training and support from Cloudera
         ▪




Wednesday, April 22, 2009
How Software is Built
         Methodological Reflexivity

             Latour and Woolgar’s “Laboratory Life”
         ▪

                 Study scientists doing science
             ▪

                 Use “thick descriptions” and focus on “microconcerns”
             ▪

             Some studies of closed and open source development exist
         ▪

                 “Mythical Man Month”, “Cathedral and the Bazaar”
             ▪

                 Hertel et al. surveyed 141 Linux kernel developers
             ▪

             Focus on the people creating code
         ▪

             Less religion, more empirical analyses
         ▪

             Build tools to facilitate interaction and output
         ▪



Wednesday, April 22, 2009
Building Open Source Software
         Structural Conditions for Success

             Moon and Sproul proposed some rules for successful projects
         ▪

                 Authority comes from competence
             ▪

                 Leaders have clear responsibilities and delegate often
             ▪

                 The code has a modular structure
             ▪

                 Establish a parallel release policy: stable and experimental
             ▪

                 Give credit to non-source contributions, e.g. documentation
             ▪

                 Communicate clear rules and norms for community online
             ▪

                 Use simple and reliable communication tools
             ▪




Wednesday, April 22, 2009
Building Software Faster
         Consolidate Best Practices

             Javascript frameworks starting to converge
         ▪

                 Many adopting jQuery’s selector syntax
             ▪

                 Significant benchmarks emerging
             ▪

             Web frameworks push idioms into project structure
         ▪

                 What would be the Rails/Django equivalent for data storage?
             ▪

                 Reusable components also nice, e.g. log structured merge trees
             ▪

                 Compare work on BOOM, RodentStore
             ▪

             Debian distributes release note writing responsibility via “beats”
         ▪




Wednesday, April 22, 2009
Complications of Open Source
             Intellectual property
         ▪

                 Trademark, Copyright, Patent, and Trade Secret
             ▪

                 Litigation history
             ▪

             Business models and foundations to ensure long-term support
         ▪

                 Direct support: Red Hat, MySQL
             ▪

                 Indirect support: LLVM, GSoC
             ▪

                 Foundations: Apache, Python, Django
             ▪

             Diversity of licenses
         ▪

                 Licenses form communities
             ▪

                 Licenses change over time (cf. Rambus BSD incident)
             ▪




Wednesday, April 22, 2009
How Software is Adopted
         Choosing the Right Tool for the Job

             Must be aware that a software project exists
         ▪

                 Tools like GitHub, Ohloh, Launchpad
             ▪

                 Sites like Reddit and Hacker News
             ▪

             Existing example use cases are critical
         ▪

                 At Facebook, we studied motivations for content production
             ▪

                 Especially eective: Bandura’s “Social Learning Theory”
             ▪

                 Hadoop being run in production at Yahoo! and Facebook
             ▪

             Active user communities and great documentation
         ▪




Wednesday, April 22, 2009
Open Learning
         Open Data, Hypotheses and Workflows

             In science, data is generated once and analyzed many times
         ▪

                 IceCube
             ▪

                 LHC
             ▪

             Lots of places where data and visualizations get shared
         ▪

                 data.gov, Many Eyes, Swivel, theinfo.org, InfoChimps, iCharts
             ▪

             Record which hypotheses and workflows have been applied
         ▪

             Increase diversity of questions asked and applications built
         ▪

             Analysis skills unevenly distributed; send skills to the data!
         ▪




Wednesday, April 22, 2009
The Future of Data Processing
         Hadoop, the Browser, and Collaboration

             “The Unreasonable Eectiveness of Data”, “MAD Skills”
         ▪

             Single namespace for your organization’s bits
         ▪

             Single engine for distributed data processing
         ▪

             Materialization of structured subsets into optimized stores
         ▪

             Browser as client interface with focus on user experience
         ▪

             The system gets better over time using workload information
         ▪

             Cloning and sharing of common libraries and workflows
         ▪

             Global metadata store driving collection, analysis, and reporting
         ▪

             Version control within and between sites, cf. Orchestra
         ▪



Wednesday, April 22, 2009
(c) 2009 Cloudera, Inc. or its licensors.  quot;Clouderaquot; is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0




Wednesday, April 22, 2009

Contenu connexe

En vedette

Open Source Business Ecosystem - PhD work
Open Source Business Ecosystem - PhD workOpen Source Business Ecosystem - PhD work
Open Source Business Ecosystem - PhD workJonathan Le Lous
 
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...Jeff Hammerbacher
 
Open Source Migration
Open Source MigrationOpen Source Migration
Open Source Migrationrw2
 

En vedette (6)

Open Source Business Ecosystem - PhD work
Open Source Business Ecosystem - PhD workOpen Source Business Ecosystem - PhD work
Open Source Business Ecosystem - PhD work
 
20080528dublinpt1
20080528dublinpt120080528dublinpt1
20080528dublinpt1
 
20100201hplabs
20100201hplabs20100201hplabs
20100201hplabs
 
20100423sage
20100423sage20100423sage
20100423sage
 
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
 
Open Source Migration
Open Source MigrationOpen Source Migration
Open Source Migration
 

Similaire à 20090422 Www

When To Use Ruby On Rails
When To Use Ruby On RailsWhen To Use Ruby On Rails
When To Use Ruby On Railsdosire
 
More Secrets of JavaScript Libraries
More Secrets of JavaScript LibrariesMore Secrets of JavaScript Libraries
More Secrets of JavaScript Librariesjeresig
 
Facebook Hadoop Data & Applications
Facebook Hadoop Data & ApplicationsFacebook Hadoop Data & Applications
Facebook Hadoop Data & Applicationsdzhou
 
The Yahoo Open Stack
The Yahoo Open StackThe Yahoo Open Stack
The Yahoo Open StackMegan Eskey
 
The DiSo Project and the Open Web
The DiSo Project and the Open WebThe DiSo Project and the Open Web
The DiSo Project and the Open WebChris Messina
 
ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...
ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...
ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...eswcsummerschool
 
Blueprint talk at Open Hackday London 2009
Blueprint talk at Open Hackday London 2009Blueprint talk at Open Hackday London 2009
Blueprint talk at Open Hackday London 2009Ricardo Varela
 

Similaire à 20090422 Www (20)

20091027genentech
20091027genentech20091027genentech
20091027genentech
 
20091030nasajpl
20091030nasajpl20091030nasajpl
20091030nasajpl
 
20090622 Velocity
20090622 Velocity20090622 Velocity
20090622 Velocity
 
20091110startup2startup
20091110startup2startup20091110startup2startup
20091110startup2startup
 
20091203gemini
20091203gemini20091203gemini
20091203gemini
 
20100128ebay
20100128ebay20100128ebay
20100128ebay
 
When To Use Ruby On Rails
When To Use Ruby On RailsWhen To Use Ruby On Rails
When To Use Ruby On Rails
 
More Secrets of JavaScript Libraries
More Secrets of JavaScript LibrariesMore Secrets of JavaScript Libraries
More Secrets of JavaScript Libraries
 
20100201hplabs
20100201hplabs20100201hplabs
20100201hplabs
 
Happy Coding with Ruby on Rails
Happy Coding with Ruby on RailsHappy Coding with Ruby on Rails
Happy Coding with Ruby on Rails
 
20080529dublinpt1
20080529dublinpt120080529dublinpt1
20080529dublinpt1
 
SearchMonkey
SearchMonkeySearchMonkey
SearchMonkey
 
Facebook Hadoop Data & Applications
Facebook Hadoop Data & ApplicationsFacebook Hadoop Data & Applications
Facebook Hadoop Data & Applications
 
Cooking with Chef
Cooking with ChefCooking with Chef
Cooking with Chef
 
Sinatra
SinatraSinatra
Sinatra
 
Capistrano2
Capistrano2Capistrano2
Capistrano2
 
The Yahoo Open Stack
The Yahoo Open StackThe Yahoo Open Stack
The Yahoo Open Stack
 
The DiSo Project and the Open Web
The DiSo Project and the Open WebThe DiSo Project and the Open Web
The DiSo Project and the Open Web
 
ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...
ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...
ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...
 
Blueprint talk at Open Hackday London 2009
Blueprint talk at Open Hackday London 2009Blueprint talk at Open Hackday London 2009
Blueprint talk at Open Hackday London 2009
 

Plus de Jeff Hammerbacher (18)

20120223keystone
20120223keystone20120223keystone
20120223keystone
 
20100714accel
20100714accel20100714accel
20100714accel
 
20100608sigmod
20100608sigmod20100608sigmod
20100608sigmod
 
20100513brown
20100513brown20100513brown
20100513brown
 
20100418sos
20100418sos20100418sos
20100418sos
 
20100301icde
20100301icde20100301icde
20100301icde
 
20090309berkeley
20090309berkeley20090309berkeley
20090309berkeley
 
20081030linkedin
20081030linkedin20081030linkedin
20081030linkedin
 
20081022cca
20081022cca20081022cca
20081022cca
 
20081009nychive
20081009nychive20081009nychive
20081009nychive
 
2008 Ur Tech Talk Zshao
2008 Ur Tech Talk Zshao2008 Ur Tech Talk Zshao
2008 Ur Tech Talk Zshao
 
Data Presentations Cassandra Sigmod
Data  Presentations  Cassandra SigmodData  Presentations  Cassandra Sigmod
Data Presentations Cassandra Sigmod
 
20080611accel
20080611accel20080611accel
20080611accel
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
Hdfs Dhruba
Hdfs DhrubaHdfs Dhruba
Hdfs Dhruba
 
20080528dublinpt3
20080528dublinpt320080528dublinpt3
20080528dublinpt3
 
20080529dublinpt3
20080529dublinpt320080529dublinpt3
20080529dublinpt3
 
20080529dublinpt2
20080529dublinpt220080529dublinpt2
20080529dublinpt2
 

Dernier

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Dernier (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

20090422 Www

  • 2. Socializing Big Data Lessons from the Hadoop Community Jeff Hammerbacher Chief Scientist and Vice President of Products, Cloudera April 22, 2009 Wednesday, April 22, 2009
  • 3. My Background Thanks for Asking hammer@cloudera.com ▪ Studied Mathematics at Harvard ▪ Worked as a Quant on Wall Street ▪ Conceived, built, and led the Data team at Facebook ▪ Nearly 30 amazing engineers and data scientists ▪ Released Hive and Cassandra as open source projects ▪ Published research at conferences: SIGMOD, CHI, ICWSM ▪ Founder of Cloudera ▪ Building tools to make learning go faster, starting with Hadoop ▪ Wednesday, April 22, 2009
  • 4. Presentation Outline What is Hadoop? ▪ Hadoop at Facebook ▪ Brief history of the Facebook Data team ▪ Summary of how we used Hadoop ▪ Reasons for choosing Hadoop ▪ How is software built and adopted? ▪ “Laboratory Life” ▪ Social Learning Theory ▪ Organizations and tools in open source development ▪ Moving from the “Age of Data” to the “Age of Learning” ▪ Wednesday, April 22, 2009
  • 5. The Hadoop community is producing innovative, world class software for web scale data management and analysis. By studying how software is built and adopted, we can enhance rate at which data processing technologies evolve. The Hadoop community is open to everyone and will play a central role in this evolution. You should join us! Wednesday, April 22, 2009
  • 6. What is Hadoop? Not Just a Stuffed Elephant Open source project, written mostly in Java ▪ Most active Apache Software Foundation project ▪ Inspired by Google infrastructure ▪ Over one hundred production deployments ▪ Project structure ▪ Hadoop Distributed File System (HDFS) ▪ Hadoop MapReduce ▪ Hadoop Core: client libraries and management tools ▪ Other subprojects: Avro, HBase, Hive, Pig, Zookeeper ▪ Wednesday, April 22, 2009
  • 7. Anatomy of a Hadoop Cluster Commodity servers ▪ 2 x 4 core CPU, 8 GB RAM, 4 x 1 TB SATA, 2 x 1 gE NIC ▪ Typically arranged in 2 level architecture ▪ Commodity Hardware Cluster 40 nodes per rack ▪ Inexpensive to acquire and maintain ▪ •! Typically in 2 level architecture –! Nodes are commodity Linux PCs –! 40 nodes/rack Wednesday, April 22, 2009
  • 8. HDFS Pool commodity servers into a single hierarchical namespace ▪ Break files into 128 MB blocks and replicate blocks ▪ Designed for large files written once but read many times ▪ Two main daemons: NameNode and DataNode ▪ NameNode manages filesystem metadata ▪ DataNode manages data using local filesystem ▪ HDFS manages checksumming, replication, and compression ▪ Throughput scales nearly linearly with cluster size ▪ Access from Java, C, command line, FUSE, or Thrift ▪ Wednesday, April 22, 2009
  • 9. '$*31%10$13+3&'1%)#$#I% #79:quot;5$)$3-.quot;.0&2$3-quot;&)quot;06-quot;*+,.0-2quot;84quot;82-$?()3quot;()*&5()3quot; /(+-.quot;()0&quot;'(-*-.;quot;*$++-%quot;C8+&*?.;Dquot;$)%quot;.0&2()3quot;-$*6quot;&/quot;06-quot;8+&*?.quot; HDFS 2-%,)%$)0+4quot;$*2&..quot;06-quot;'&&+quot;&/quot;.-2=-2.<quot;quot;B)quot;06-quot;*&55&)quot;*$.-;quot; #79:quot;.0&2-.quot;062--quot;*&5'+-0-quot;*&'(-.quot;&/quot;-$*6quot;/(+-quot;84quot;*&'4()3quot;-$*6quot; '(-*-quot;0&quot;062--quot;%(//-2-)0quot;.-2=-2.Equot; HDFS distributes file blocks among servers quot; quot; !quot; quot; Fquot; Iquot; !quot; quot; Hquot; Hquot; Fquot; quot; Fquot; !quot; #79:quot; Gquot; Gquot; Iquot; Iquot; Hquot; quot; !quot; quot; Fquot; Gquot; Gquot; Iquot; Hquot; quot; !quot;#$%&'()'*+!,'-quot;./%quot;0$/&.'1quot;2&'02345.'6738#'.&%9&%.' quot; Wednesday, April 22, 2009
  • 10. Hadoop MapReduce Fault-tolerant execution layer and API for parallel data processing ▪ Can target multiple storage systems ▪ Key/value data model ▪ Two main daemons: JobTracker and TaskTracker ▪ Three main phases: Map, Shufle, and Reduce ▪ Growing sophistication for job and task scheduling ▪ Many client interfaces ▪ Java, C++, Streaming ▪ Pig, SQL (Hive QL) ▪ Wednesday, April 22, 2009
  • 11. MapReduce MapReduce pushes work out to the data (#)**+%$#41'% Kquot; Qquot; #)5#0$#.1%*6%(/789% )#$#%)'$3:;$*0% Qquot; !quot; '$3#$1.%$*%+;'quot;%=*34% Nquot; Nquot; *;$%$*%#0%0*)1'%0%#% ?@;'$13A%Bquot;'%#@@*='% #0#@'1'%$*%3;0%0% Kquot; +#3#@@1@%#0)%1@0#$1'% $quot;1%:*$$@101?4'% Pquot; +*'1)%:%*0*@$quot;?% !quot; '$*3#.1%''$1'A% Qquot; Kquot; Pquot; Pquot; !quot; Nquot; quot; !quot;#$%'()'*+,--.'.$/0/'1-%2'-$3'3-'30',+3+' quot; Wednesday, April 22, 2009
  • 12. Hadoop Subprojects Avro ▪ Cross-language framework for RPC ▪ HBase ▪ Table storage above HDFS, modeled after Google’s BigTable ▪ Hive ▪ SQL interface to structured data stored in HDFS ▪ Pig ▪ Language for data flow programming ▪ Zookeeper ▪ Coordination service for distributed systems ▪ Wednesday, April 22, 2009
  • 13. Hadoop at Yahoo! Jan 2006: Hired Doug Cutting ▪ Apr 2006: Sorted 1.9 TB on 188 nodes in 47 hours ▪ March 2008: Hadoop Summit attracted several hundred attendees ▪ Apr 2008: Sorted 1 TB on 910 nodes in 209 seconds ▪ Aug 2008: Deployed 4,000 node Hadoop cluster ▪ Data Points ▪ Over 20,000 nodes running Hadoop ▪ Hundreds of thousands of jobs per day ▪ Typical HDFS cluster: 1,400 nodes, 2 PB capacity ▪ Largest shufle is 450 TB ▪ Wednesday, April 22, 2009
  • 14. Facebook Before Hadoop Early 2006: The First Research Scientist Source data living on horizontally partitioned MySQL tier ▪ Intensive historical analysis dificult ▪ No way to assess impact of changes to the site ▪ First try: Python scripts pull data into MySQL ▪ Second try: Python scripts pull data into Oracle ▪ ...and then we turned on impression logging ▪ Wednesday, April 22, 2009
  • 15. Facebook Data Infrastructure 2007 Scribe Tier MySQL Tier Data Collection Server Oracle Database Server Wednesday, April 22, 2009
  • 16. Facebook Data Infrastructure 2008 Scribe Tier MySQL Tier Hadoop Tier Oracle RAC Servers Wednesday, April 22, 2009
  • 17. Facebook Workloads Data collection ▪ server logs ▪ application databases ▪ web crawls ▪ Thousands of multi-stage processing pipelines ▪ Summaries consumed by external users ▪ Summaries for internal reporting ▪ Ad optimization pipeline ▪ Experimentation platform pipeline ▪ Ad hoc analyses ▪ Wednesday, April 22, 2009
  • 18. Facebook Hadoop Statistics Over 700 servers running Hadoop in one data center ▪ 2.5 PB in largest Hadoop cluster ▪ 15 TB loaded into Hadoop cluster each day ▪ 4,000 MapReduce jobs with 800,000 tasks run per day ▪ 55 TB of data processed per day ▪ 15 TB of additional data produced from cluster activity per day ▪ Hadoop cluster not retiring data! ▪ Wednesday, April 22, 2009
  • 19. Why Did Facebook Choose Hadoop? 1. Demonstrated eectiveness for primary workload 2. Proven ability to scale past any commercial vendor 3. Easy provisioning and capacity planning with commodity nodes 4. Data access for engineers and business analysts 5. Single system to manage XML, JSON, text, and relational data 6. No schemas enabled data collection without involving Data team 7. Cost of software: zero dollars 8. Deep commitment to continued development from Yahoo! 9. Active user and developer community 10. Apache-licensed open source code Wednesday, April 22, 2009
  • 20. Hadoop Community Support People Build Technology Most active Apache mailing lists ▪ Details oficial documentation per release ▪ Three books this year: O’Reilly, Apress, Manning ▪ Free training videos online ▪ Regular user group meetings in many cities ▪ University courses across the world ▪ Growing consultant and sys integrator expertise ▪ Commercial training and support from Cloudera ▪ Wednesday, April 22, 2009
  • 21. How Software is Built Methodological Reflexivity Latour and Woolgar’s “Laboratory Life” ▪ Study scientists doing science ▪ Use “thick descriptions” and focus on “microconcerns” ▪ Some studies of closed and open source development exist ▪ “Mythical Man Month”, “Cathedral and the Bazaar” ▪ Hertel et al. surveyed 141 Linux kernel developers ▪ Focus on the people creating code ▪ Less religion, more empirical analyses ▪ Build tools to facilitate interaction and output ▪ Wednesday, April 22, 2009
  • 22. Building Open Source Software Structural Conditions for Success Moon and Sproul proposed some rules for successful projects ▪ Authority comes from competence ▪ Leaders have clear responsibilities and delegate often ▪ The code has a modular structure ▪ Establish a parallel release policy: stable and experimental ▪ Give credit to non-source contributions, e.g. documentation ▪ Communicate clear rules and norms for community online ▪ Use simple and reliable communication tools ▪ Wednesday, April 22, 2009
  • 23. Building Software Faster Consolidate Best Practices Javascript frameworks starting to converge ▪ Many adopting jQuery’s selector syntax ▪ Significant benchmarks emerging ▪ Web frameworks push idioms into project structure ▪ What would be the Rails/Django equivalent for data storage? ▪ Reusable components also nice, e.g. log structured merge trees ▪ Compare work on BOOM, RodentStore ▪ Debian distributes release note writing responsibility via “beats” ▪ Wednesday, April 22, 2009
  • 24. Complications of Open Source Intellectual property ▪ Trademark, Copyright, Patent, and Trade Secret ▪ Litigation history ▪ Business models and foundations to ensure long-term support ▪ Direct support: Red Hat, MySQL ▪ Indirect support: LLVM, GSoC ▪ Foundations: Apache, Python, Django ▪ Diversity of licenses ▪ Licenses form communities ▪ Licenses change over time (cf. Rambus BSD incident) ▪ Wednesday, April 22, 2009
  • 25. How Software is Adopted Choosing the Right Tool for the Job Must be aware that a software project exists ▪ Tools like GitHub, Ohloh, Launchpad ▪ Sites like Reddit and Hacker News ▪ Existing example use cases are critical ▪ At Facebook, we studied motivations for content production ▪ Especially eective: Bandura’s “Social Learning Theory” ▪ Hadoop being run in production at Yahoo! and Facebook ▪ Active user communities and great documentation ▪ Wednesday, April 22, 2009
  • 26. Open Learning Open Data, Hypotheses and Workflows In science, data is generated once and analyzed many times ▪ IceCube ▪ LHC ▪ Lots of places where data and visualizations get shared ▪ data.gov, Many Eyes, Swivel, theinfo.org, InfoChimps, iCharts ▪ Record which hypotheses and workflows have been applied ▪ Increase diversity of questions asked and applications built ▪ Analysis skills unevenly distributed; send skills to the data! ▪ Wednesday, April 22, 2009
  • 27. The Future of Data Processing Hadoop, the Browser, and Collaboration “The Unreasonable Eectiveness of Data”, “MAD Skills” ▪ Single namespace for your organization’s bits ▪ Single engine for distributed data processing ▪ Materialization of structured subsets into optimized stores ▪ Browser as client interface with focus on user experience ▪ The system gets better over time using workload information ▪ Cloning and sharing of common libraries and workflows ▪ Global metadata store driving collection, analysis, and reporting ▪ Version control within and between sites, cf. Orchestra ▪ Wednesday, April 22, 2009
  • 28. (c) 2009 Cloudera, Inc. or its licensors.  quot;Clouderaquot; is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0 Wednesday, April 22, 2009