SlideShare une entreprise Scribd logo
1  sur  16
Télécharger pour lire hors ligne
DevOps: Empowering Developers
           with Infrastructure

               SXSW 2 0 1 3 – Tu e s d a y, M a r c h 1 2

            Go here: http://infochim.ps/15INnv8

                    Nathan Eliot - @temujin9
                    Ryan Miller - @rmiller107
                Amanda McGuckin-Hager - @shoogie
                    Tim Gasper - @timgasper

3/12/2013          #ironfan   #devops   #sxsw   #bigdata   #chef   1
Agenda

             http://infochim.ps/15INnv8
1. Intros - Housekeeping                                          (15 min – 15 total)
2. Initial Setup                                                  (30 min – 45 total)
3. Debug Initial Set Up                                           (30-45 min – 1:15 total)
4. Standing Up a Simple Cluster                                   (30-60 min – 2:15 total)
5. Hadoop!                                                        (30-60 min – 3:15 total)
6. General Q&A                                                    (30-60 min – 4:00 total)




3/12/2013                 #ironfan   #devops   #sxsw   #bigdata   #chef                      2
Key Ironfan Contributors
• Flip Kromer, @mrflip
  – CTO of Infochimps
• Nathaniel Eliot, @temujin9
  – Ops Engineer of Infochimps
• Chris Howe
  – System Architect at Civitas Learning
Infochimps Enterprise Cloud for Big
                   Data
                    CUSTOMER APPLICATIONS

            Custom Applications             Business Intelligence              Packaged Apps
               (Java, Python, etc.)          (Cognos, BOBJ, Microstrategy)        (ERP, CRM, etc.)




3/12/2013                       #ironfan   #devops      #sxsw       #bigdata   #chef                 4
Why We Love Chef
• Infrastructure as Code
      – Version Control
      – Shareable
      – Testable
      – Recapitulable




3/12/2013        #ironfan   #devops   #sxsw   #bigdata   #chef   5
Why We Love Chef




            MySQL                         Nginx                       SOLR




                          My Application
3/12/2013            #ironfan   #devops    #sxsw   #bigdata   #chef          6
Why We Love Chef




3/12/2013    #ironfan   #devops   #sxsw   #bigdata   #chef   7
Why We Don’t Love Chef
• Anything is possible
• Nothing is simple
• There’s not much
  repetition (not DRY)
Why We Don’t Love Chef




            Too much is hard-coded at development/upload time!

3/12/2013              #ironfan   #devops   #sxsw   #bigdata   #chef   9
Why We Don’t Love Chef




            How do we make @server_ips dynamic?




3/12/2013        #ironfan   #devops   #sxsw   #bigdata   #chef   10
Why We Wrote Ironfan
• Simplify, unify, and
  standardize our usage
  of the Chef toolset
• Build further
  abstractions on top of
  Chef
• Give us superpowers
  that Chef doesn’t have
  yet
            http://github.com/infochimps-labs/ironfan
3/12/2013           #ironfan   #devops   #sxsw   #bigdata   #chef   11
What Does Ironfan Do




                                      Ironfan

         Simple helpers in the silverware cookbook
         abstract common Chef patterns
         and keep things DRY.
  Chef
What Does Ironfan Do

Dynamic service discovery:




3/12/2013     #ironfan   #devops   #sxsw   #bigdata   #chef   13
What Does Ironfan Do

                                                A simple DSL
                                                for defining
                                                clusters of
                                                machines.




3/12/2013     #ironfan   #devops   #sxsw   #bigdata   #chef    14
Big Data for Chimps




            May 2013



3/12/2013      #ironfan   #devops   #sxsw   #bigdata   #chef   15
As we walk through Ironfan…
• Shortlink: http://infochim.ps/15INnv8

FYI
• We are hiring! (we have offices in Austin &
  SF)
      – careers@infochimps.com
      – infochimps.com/careers
• Learn more about our enterprise product:
      – sales@infochimps.com
3/12/2013         #ironfan   #devops   #sxsw   #bigdata   #chef   16

Contenu connexe

Plus de Infochimps, a CSC Big Data Business

[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive AnalyticsInfochimps, a CSC Big Data Business
 
Case Study: Digital Agency Turbocharges Social Listening and Insights with t...
Case Study: Digital  Agency Turbocharges Social Listening and Insights with t...Case Study: Digital  Agency Turbocharges Social Listening and Insights with t...
Case Study: Digital Agency Turbocharges Social Listening and Insights with t...Infochimps, a CSC Big Data Business
 

Plus de Infochimps, a CSC Big Data Business (14)

Vayacondios: Divine into Complex Systems
Vayacondios: Divine into Complex SystemsVayacondios: Divine into Complex Systems
Vayacondios: Divine into Complex Systems
 
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
 
Report: CIOs & Big Data
Report: CIOs & Big DataReport: CIOs & Big Data
Report: CIOs & Big Data
 
Infographic: CIOs & Big Data
Infographic: CIOs & Big DataInfographic: CIOs & Big Data
Infographic: CIOs & Big Data
 
5 Big Data Use Cases for 2013
5 Big Data Use Cases for 20135 Big Data Use Cases for 2013
5 Big Data Use Cases for 2013
 
[Webinar] Top Strategies for Successful Big Data Projects
[Webinar] Top Strategies for Successful Big Data Projects[Webinar] Top Strategies for Successful Big Data Projects
[Webinar] Top Strategies for Successful Big Data Projects
 
[Webinar] High Speed Retail Analytics
[Webinar] High Speed Retail Analytics[Webinar] High Speed Retail Analytics
[Webinar] High Speed Retail Analytics
 
Infochimps + CloudCon: Infinite Monkey Theorem
Infochimps + CloudCon: Infinite Monkey TheoremInfochimps + CloudCon: Infinite Monkey Theorem
Infochimps + CloudCon: Infinite Monkey Theorem
 
Taming the Big Data Tsunami using Intel Architecture
Taming the Big Data Tsunami using Intel ArchitectureTaming the Big Data Tsunami using Intel Architecture
Taming the Big Data Tsunami using Intel Architecture
 
The Other Way of Doing Big Data
The Other Way of Doing Big DataThe Other Way of Doing Big Data
The Other Way of Doing Big Data
 
Real-Time Analytics: The Future of Big Data in the Agency
Real-Time Analytics: The Future of Big Data in the AgencyReal-Time Analytics: The Future of Big Data in the Agency
Real-Time Analytics: The Future of Big Data in the Agency
 
The Power of Elasticsearch
The Power of ElasticsearchThe Power of Elasticsearch
The Power of Elasticsearch
 
Case Study: Digital Agency Turbocharges Social Listening and Insights with t...
Case Study: Digital  Agency Turbocharges Social Listening and Insights with t...Case Study: Digital  Agency Turbocharges Social Listening and Insights with t...
Case Study: Digital Agency Turbocharges Social Listening and Insights with t...
 
Meet the Infochimps Platform
Meet the Infochimps PlatformMeet the Infochimps Platform
Meet the Infochimps Platform
 

Dernier

COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 

Dernier (20)

201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 

SXSWi Workshop: DevOps - Infrastructure as Code

  • 1. DevOps: Empowering Developers with Infrastructure SXSW 2 0 1 3 – Tu e s d a y, M a r c h 1 2 Go here: http://infochim.ps/15INnv8 Nathan Eliot - @temujin9 Ryan Miller - @rmiller107 Amanda McGuckin-Hager - @shoogie Tim Gasper - @timgasper 3/12/2013 #ironfan #devops #sxsw #bigdata #chef 1
  • 2. Agenda http://infochim.ps/15INnv8 1. Intros - Housekeeping (15 min – 15 total) 2. Initial Setup (30 min – 45 total) 3. Debug Initial Set Up (30-45 min – 1:15 total) 4. Standing Up a Simple Cluster (30-60 min – 2:15 total) 5. Hadoop! (30-60 min – 3:15 total) 6. General Q&A (30-60 min – 4:00 total) 3/12/2013 #ironfan #devops #sxsw #bigdata #chef 2
  • 3. Key Ironfan Contributors • Flip Kromer, @mrflip – CTO of Infochimps • Nathaniel Eliot, @temujin9 – Ops Engineer of Infochimps • Chris Howe – System Architect at Civitas Learning
  • 4. Infochimps Enterprise Cloud for Big Data CUSTOMER APPLICATIONS Custom Applications Business Intelligence Packaged Apps (Java, Python, etc.) (Cognos, BOBJ, Microstrategy) (ERP, CRM, etc.) 3/12/2013 #ironfan #devops #sxsw #bigdata #chef 4
  • 5. Why We Love Chef • Infrastructure as Code – Version Control – Shareable – Testable – Recapitulable 3/12/2013 #ironfan #devops #sxsw #bigdata #chef 5
  • 6. Why We Love Chef MySQL Nginx SOLR My Application 3/12/2013 #ironfan #devops #sxsw #bigdata #chef 6
  • 7. Why We Love Chef 3/12/2013 #ironfan #devops #sxsw #bigdata #chef 7
  • 8. Why We Don’t Love Chef • Anything is possible • Nothing is simple • There’s not much repetition (not DRY)
  • 9. Why We Don’t Love Chef Too much is hard-coded at development/upload time! 3/12/2013 #ironfan #devops #sxsw #bigdata #chef 9
  • 10. Why We Don’t Love Chef How do we make @server_ips dynamic? 3/12/2013 #ironfan #devops #sxsw #bigdata #chef 10
  • 11. Why We Wrote Ironfan • Simplify, unify, and standardize our usage of the Chef toolset • Build further abstractions on top of Chef • Give us superpowers that Chef doesn’t have yet http://github.com/infochimps-labs/ironfan 3/12/2013 #ironfan #devops #sxsw #bigdata #chef 11
  • 12. What Does Ironfan Do Ironfan Simple helpers in the silverware cookbook abstract common Chef patterns and keep things DRY. Chef
  • 13. What Does Ironfan Do Dynamic service discovery: 3/12/2013 #ironfan #devops #sxsw #bigdata #chef 13
  • 14. What Does Ironfan Do A simple DSL for defining clusters of machines. 3/12/2013 #ironfan #devops #sxsw #bigdata #chef 14
  • 15. Big Data for Chimps May 2013 3/12/2013 #ironfan #devops #sxsw #bigdata #chef 15
  • 16. As we walk through Ironfan… • Shortlink: http://infochim.ps/15INnv8 FYI • We are hiring! (we have offices in Austin & SF) – careers@infochimps.com – infochimps.com/careers • Learn more about our enterprise product: – sales@infochimps.com 3/12/2013 #ironfan #devops #sxsw #bigdata #chef 16

Notes de l'éditeur

  1. Part I. Big Data for Chimps1. Hello, Early Releasers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3My Questions for You 4Probable Contents 4Not Contents 7Feedback 72. About. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9What this book covers 9Who this book is for 10Who this book is not for 10How this book is being written 113. Hello, Reviewers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Controversials 13Style Nits 144. First Exploration (ch. A). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Where is Barbecue? 15First Steps 16Why? 16Plot of this story 17Exemplars and Touchstones 17Data and features 18Summarize every page on Wikipedia 18Summarize every page on Wikipedia 18Bin by Location 19A pause, to think 20iiiPulling signal from noise 20Takeaways 215. The Stream (ch. B). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Exercises 27Exercise 1.1: Running time 27Exercise 1.2: A Petabyte-scale wc command 286. Reshape Steps (ch. C). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29Locality of Reference 29Locality: Examples 29The Hadoop Haiku 307. Chimpanzee and Elephant Save Christmas (ch. D). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33A Non-scalable approach 33Letters to Toy Requests 34Order Delivery 36Toy Assembly 38Why it’s efficient 38Sorted Batches 39The Map-Reduce Haiku 39The Reducer Guarantee 40Partition Key and Sort Key 418. Geo Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43Spatial Data 44Geographic Data Model 44Geospatial JOIN using quadtiles 45The Quadtile Grid System 45Patterns in UFO Sightings 47Mapper: dispatch objects to rendezvous at quadtiles 48Reducer: combine objects on each quadtile 49Comparing Distributions 50Data Model 50GeoJSON 51Quadtile Practicalities 52Converting points to quadkeys (quadtile indexes) 52Exploration 56Interesting quadtile properties 56Quadtile Ready Reference 58Working with paths 59Calculating Distances 60iv | Table of ContentsDistributing Boundaries and Regions to Grid Cells 61Adaptive Grid Size 62Tree structure of Quadtile indexing 66Map Polygons to Grid Tiles 66Weather Near You 68Find the Voronoi Polygon for each Weather Station 68Break polygons on quadtiles 69Map Observations to Grid Cells 69K-means clustering to summarize 69Keep Exploring 70Exercises 70— References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719. Log Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73Data Model 73Simple Log Parsing 73Parser script 74Histograms 75User Paths through the site (“Sessionizing”) 77Page-Page similarity 79Geo-IP Matching 79Range Queries 80Using Hadoop for website stress testing (“Benign DDos”) 8010. WhyHadoop Works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83Disk is the new tape 83Hadoop is Secretly Fun 83Economics: 84Notes 8411. Sampling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87Consistent Random Sampling 88Random Sampling using strides 89Constant-Memory “Reservoir” Sampling 89— References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9012. HadoopExecution in Detail. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91Launch 91Split 92Mappers 93Choosing a file size 94Jobs with Map andReduce 94Table of Contents | vMapper-only jobs 9413. Pathology of Tuning (aka “when you should touch that dial”). . . . . . . . . . . . . . . . . . . . . 97Mapper 97A few map tasks take noticably longer than all the rest 97Tons of tiny little mappers 98Many non-local mappers 98Map tasks “spill” multiple times 98Job output files that are each slightly larger than an HDFS block 98Reducer 99Tons of data to a few reducers (high skew) 99Reducer merge (sort+shuffle) is longer than Reducer processing 99Output Commit phase is longer than Reducer processing 99Way more total data to reducers than cumulative cluster RAM 99System 100Excessive Swapping 100Out of Memory / No C+B reserve 100Stop-the-world (STW) Garbage collections 100Checklist 100Other 101Basic Checks 10114. HadoopMetrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103The USE Method appliedtoHadoop 103Look for the Bounding Resource 104Resource List 105See What’s Happening 108JMX (Java Monitoring Extensions) 108Roughnotes 10915. Data Formats and Schemata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111Good Format 1: TSV (It’s simple) 111Good Format 2: JSON (It’s Generic and Ubiquitous) 112structured to model. 112Good Format #3: Avro (It does everything right) 113Other reasonable choices: tagged net strings and null-delimited documents 114Crap format #1: XML 114Writing XML 114Crap Format #2: N3 triples 117Crap Format #3: Flat format 117Web log and Regexpable 117Glyphing (string encoding), Unicode,UTF-8 117vi | Table of ContentsICSS 118Schema.org Types 118Munging 11816. HBase Data Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121Row Key, Column Family, Column Qualifier, Timestamp, Value 121Keep it Stupidly Simple 123Help HBase be Lazy 123Row Locality and Compression 124Simple Table 124Airport Metadata 124Airport Timezone 125Range Lookup 125Geographic Data 126Multi-scale indexing 126Wikipedia: Corpus and Graph 126Graph Data 126Web Logs: Rows-As-Columns 127Column Families 128Atomic Counters 128Most-Frequent URLs 129Most-Recent URLs 129Rollup columns 130Row Locality 130adjacency is good 130adjacency is bad 130Vertical Partitioning (Column Families) 131Feature Set review 131“Design for Reads” 132— References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13417. Semi-Structured Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135Wikipedia Metadata 135Wikipedia Pageview Stats (importing TSV) 135Assembling the namespace join table 136Getting file metadata in a Wukong (or any Hadoop streaming) Script 136Wikipedia Article Metadata (importing a SQL Dump) 136Necessary Bullcrap #76: Bad encoding 136Wikipedia Page Graph 137Target Domain Models 137XML Data (Wikipedia Corpus) 138Extract, Translate, Canonicalize 141Table of Contents | vii
  2. Part I. Big Data for Chimps1. Hello, Early Releasers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3My Questions for You 4Probable Contents 4Not Contents 7Feedback 72. About. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9What this book covers 9Who this book is for 10Who this book is not for 10How this book is being written 113. Hello, Reviewers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Controversials 13Style Nits 144. First Exploration (ch. A). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Where is Barbecue? 15First Steps 16Why? 16Plot of this story 17Exemplars and Touchstones 17Data and features 18Summarize every page on Wikipedia 18Summarize every page on Wikipedia 18Bin by Location 19A pause, to think 20iiiPulling signal from noise 20Takeaways 215. The Stream (ch. B). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Exercises 27Exercise 1.1: Running time 27Exercise 1.2: A Petabyte-scale wc command 286. Reshape Steps (ch. C). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29Locality of Reference 29Locality: Examples 29The Hadoop Haiku 307. Chimpanzee and Elephant Save Christmas (ch. D). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33A Non-scalable approach 33Letters to Toy Requests 34Order Delivery 36Toy Assembly 38Why it’s efficient 38Sorted Batches 39The Map-Reduce Haiku 39The Reducer Guarantee 40Partition Key and Sort Key 418. Geo Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43Spatial Data 44Geographic Data Model 44Geospatial JOIN using quadtiles 45The Quadtile Grid System 45Patterns in UFO Sightings 47Mapper: dispatch objects to rendezvous at quadtiles 48Reducer: combine objects on each quadtile 49Comparing Distributions 50Data Model 50GeoJSON 51Quadtile Practicalities 52Converting points to quadkeys (quadtile indexes) 52Exploration 56Interesting quadtile properties 56Quadtile Ready Reference 58Working with paths 59Calculating Distances 60iv | Table of ContentsDistributing Boundaries and Regions to Grid Cells 61Adaptive Grid Size 62Tree structure of Quadtile indexing 66Map Polygons to Grid Tiles 66Weather Near You 68Find the Voronoi Polygon for each Weather Station 68Break polygons on quadtiles 69Map Observations to Grid Cells 69K-means clustering to summarize 69Keep Exploring 70Exercises 70— References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719. Log Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73Data Model 73Simple Log Parsing 73Parser script 74Histograms 75User Paths through the site (“Sessionizing”) 77Page-Page similarity 79Geo-IP Matching 79Range Queries 80Using Hadoop for website stress testing (“Benign DDos”) 8010. WhyHadoop Works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83Disk is the new tape 83Hadoop is Secretly Fun 83Economics: 84Notes 8411. Sampling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87Consistent Random Sampling 88Random Sampling using strides 89Constant-Memory “Reservoir” Sampling 89— References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9012. HadoopExecution in Detail. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91Launch 91Split 92Mappers 93Choosing a file size 94Jobs with Map andReduce 94Table of Contents | vMapper-only jobs 9413. Pathology of Tuning (aka “when you should touch that dial”). . . . . . . . . . . . . . . . . . . . . 97Mapper 97A few map tasks take noticably longer than all the rest 97Tons of tiny little mappers 98Many non-local mappers 98Map tasks “spill” multiple times 98Job output files that are each slightly larger than an HDFS block 98Reducer 99Tons of data to a few reducers (high skew) 99Reducer merge (sort+shuffle) is longer than Reducer processing 99Output Commit phase is longer than Reducer processing 99Way more total data to reducers than cumulative cluster RAM 99System 100Excessive Swapping 100Out of Memory / No C+B reserve 100Stop-the-world (STW) Garbage collections 100Checklist 100Other 101Basic Checks 10114. HadoopMetrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103The USE Method appliedtoHadoop 103Look for the Bounding Resource 104Resource List 105See What’s Happening 108JMX (Java Monitoring Extensions) 108Roughnotes 10915. Data Formats and Schemata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111Good Format 1: TSV (It’s simple) 111Good Format 2: JSON (It’s Generic and Ubiquitous) 112structured to model. 112Good Format #3: Avro (It does everything right) 113Other reasonable choices: tagged net strings and null-delimited documents 114Crap format #1: XML 114Writing XML 114Crap Format #2: N3 triples 117Crap Format #3: Flat format 117Web log and Regexpable 117Glyphing (string encoding), Unicode,UTF-8 117vi | Table of ContentsICSS 118Schema.org Types 118Munging 11816. HBase Data Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121Row Key, Column Family, Column Qualifier, Timestamp, Value 121Keep it Stupidly Simple 123Help HBase be Lazy 123Row Locality and Compression 124Simple Table 124Airport Metadata 124Airport Timezone 125Range Lookup 125Geographic Data 126Multi-scale indexing 126Wikipedia: Corpus and Graph 126Graph Data 126Web Logs: Rows-As-Columns 127Column Families 128Atomic Counters 128Most-Frequent URLs 129Most-Recent URLs 129Rollup columns 130Row Locality 130adjacency is good 130adjacency is bad 130Vertical Partitioning (Column Families) 131Feature Set review 131“Design for Reads” 132— References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13417. Semi-Structured Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135Wikipedia Metadata 135Wikipedia Pageview Stats (importing TSV) 135Assembling the namespace join table 136Getting file metadata in a Wukong (or any Hadoop streaming) Script 136Wikipedia Article Metadata (importing a SQL Dump) 136Necessary Bullcrap #76: Bad encoding 136Wikipedia Page Graph 137Target Domain Models 137XML Data (Wikipedia Corpus) 138Extract, Translate, Canonicalize 141Table of Contents | vii