SlideShare une entreprise Scribd logo
1  sur  16
DevOps: Empowering Developers
           with Infrastructure

               SXSW 2 0 1 3 – Tu e s d a y, M a r c h 1 2

            Go here: http://infochim.ps/15INnv8

                    Nathan Eliot - @temujin9
                    Ryan Miller - @rmiller107
                Amanda McGuckin-Hager - @shoogie
                    Tim Gasper - @timgasper

3/12/2013          #ironfan   #devops   #sxsw   #bigdata   #chef   1
Agenda

             http://infochim.ps/15INnv8
1. Intros - Housekeeping                                          (15 min – 15 total)
2. Initial Setup                                                  (30 min – 45 total)
3. Debug Initial Set Up                                           (30-45 min – 1:15 total)
4. Standing Up a Simple Cluster                                   (30-60 min – 2:15 total)
5. Hadoop!                                                        (30-60 min – 3:15 total)
6. General Q&A                                                    (30-60 min – 4:00 total)




3/12/2013                 #ironfan   #devops   #sxsw   #bigdata   #chef                      2
Key Ironfan Contributors
• Flip Kromer, @mrflip
  – CTO of Infochimps
• Nathaniel Eliot, @temujin9
  – Ops Engineer of Infochimps
• Chris Howe
  – System Architect at Civitas Learning
Infochimps Enterprise Cloud for Big
                   Data
                    CUSTOMER APPLICATIONS

            Custom Applications             Business Intelligence              Packaged Apps
               (Java, Python, etc.)          (Cognos, BOBJ, Microstrategy)        (ERP, CRM, etc.)




3/12/2013                       #ironfan   #devops      #sxsw       #bigdata   #chef                 4
Why We Love Chef
• Infrastructure as Code
      – Version Control
      – Shareable
      – Testable
      – Recapitulable




3/12/2013        #ironfan   #devops   #sxsw   #bigdata   #chef   5
Why We Love Chef




            MySQL                         Nginx                       SOLR




                          My Application
3/12/2013            #ironfan   #devops    #sxsw   #bigdata   #chef          6
Why We Love Chef




3/12/2013    #ironfan   #devops   #sxsw   #bigdata   #chef   7
Why We Don’t Love Chef
• Anything is possible
• Nothing is simple
• There’s not much
  repetition (not DRY)
Why We Don’t Love Chef




            Too much is hard-coded at development/upload time!

3/12/2013              #ironfan   #devops   #sxsw   #bigdata   #chef   9
Why We Don’t Love Chef




            How do we make @server_ips dynamic?




3/12/2013        #ironfan   #devops   #sxsw   #bigdata   #chef   10
Why We Wrote Ironfan
• Simplify, unify, and
  standardize our usage
  of the Chef toolset
• Build further
  abstractions on top of
  Chef
• Give us superpowers
  that Chef doesn’t have
  yet
            http://github.com/infochimps-labs/ironfan
3/12/2013           #ironfan   #devops   #sxsw   #bigdata   #chef   11
What Does Ironfan Do




                                      Ironfan

         Simple helpers in the silverware cookbook
         abstract common Chef patterns
         and keep things DRY.
  Chef
What Does Ironfan Do

Dynamic service discovery:




3/12/2013     #ironfan   #devops   #sxsw   #bigdata   #chef   13
What Does Ironfan Do

                                                A simple DSL
                                                for defining
                                                clusters of
                                                machines.




3/12/2013     #ironfan   #devops   #sxsw   #bigdata   #chef    14
Big Data for Chimps




            May 2013



3/12/2013      #ironfan   #devops   #sxsw   #bigdata   #chef   15
As we walk through Ironfan…
• Shortlink: http://infochim.ps/15INnv8

FYI
• We are hiring! (we have offices in Austin &
  SF)
      – careers@infochimps.com
      – infochimps.com/careers
• Learn more about our enterprise product:
      – sales@infochimps.com
3/12/2013         #ironfan   #devops   #sxsw   #bigdata   #chef   16

Contenu connexe

Plus de Infochimps, a CSC Big Data Business

[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive AnalyticsInfochimps, a CSC Big Data Business
 
Case Study: Digital Agency Turbocharges Social Listening and Insights with t...
Case Study: Digital  Agency Turbocharges Social Listening and Insights with t...Case Study: Digital  Agency Turbocharges Social Listening and Insights with t...
Case Study: Digital Agency Turbocharges Social Listening and Insights with t...Infochimps, a CSC Big Data Business
 

Plus de Infochimps, a CSC Big Data Business (14)

Vayacondios: Divine into Complex Systems
Vayacondios: Divine into Complex SystemsVayacondios: Divine into Complex Systems
Vayacondios: Divine into Complex Systems
 
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
 
Report: CIOs & Big Data
Report: CIOs & Big DataReport: CIOs & Big Data
Report: CIOs & Big Data
 
Infographic: CIOs & Big Data
Infographic: CIOs & Big DataInfographic: CIOs & Big Data
Infographic: CIOs & Big Data
 
5 Big Data Use Cases for 2013
5 Big Data Use Cases for 20135 Big Data Use Cases for 2013
5 Big Data Use Cases for 2013
 
[Webinar] Top Strategies for Successful Big Data Projects
[Webinar] Top Strategies for Successful Big Data Projects[Webinar] Top Strategies for Successful Big Data Projects
[Webinar] Top Strategies for Successful Big Data Projects
 
[Webinar] High Speed Retail Analytics
[Webinar] High Speed Retail Analytics[Webinar] High Speed Retail Analytics
[Webinar] High Speed Retail Analytics
 
Infochimps + CloudCon: Infinite Monkey Theorem
Infochimps + CloudCon: Infinite Monkey TheoremInfochimps + CloudCon: Infinite Monkey Theorem
Infochimps + CloudCon: Infinite Monkey Theorem
 
Taming the Big Data Tsunami using Intel Architecture
Taming the Big Data Tsunami using Intel ArchitectureTaming the Big Data Tsunami using Intel Architecture
Taming the Big Data Tsunami using Intel Architecture
 
The Other Way of Doing Big Data
The Other Way of Doing Big DataThe Other Way of Doing Big Data
The Other Way of Doing Big Data
 
Real-Time Analytics: The Future of Big Data in the Agency
Real-Time Analytics: The Future of Big Data in the AgencyReal-Time Analytics: The Future of Big Data in the Agency
Real-Time Analytics: The Future of Big Data in the Agency
 
The Power of Elasticsearch
The Power of ElasticsearchThe Power of Elasticsearch
The Power of Elasticsearch
 
Case Study: Digital Agency Turbocharges Social Listening and Insights with t...
Case Study: Digital  Agency Turbocharges Social Listening and Insights with t...Case Study: Digital  Agency Turbocharges Social Listening and Insights with t...
Case Study: Digital Agency Turbocharges Social Listening and Insights with t...
 
Meet the Infochimps Platform
Meet the Infochimps PlatformMeet the Infochimps Platform
Meet the Infochimps Platform
 

Dernier

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Dernier (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

SXSWi Workshop: DevOps - Infrastructure as Code

  • 1. DevOps: Empowering Developers with Infrastructure SXSW 2 0 1 3 – Tu e s d a y, M a r c h 1 2 Go here: http://infochim.ps/15INnv8 Nathan Eliot - @temujin9 Ryan Miller - @rmiller107 Amanda McGuckin-Hager - @shoogie Tim Gasper - @timgasper 3/12/2013 #ironfan #devops #sxsw #bigdata #chef 1
  • 2. Agenda http://infochim.ps/15INnv8 1. Intros - Housekeeping (15 min – 15 total) 2. Initial Setup (30 min – 45 total) 3. Debug Initial Set Up (30-45 min – 1:15 total) 4. Standing Up a Simple Cluster (30-60 min – 2:15 total) 5. Hadoop! (30-60 min – 3:15 total) 6. General Q&A (30-60 min – 4:00 total) 3/12/2013 #ironfan #devops #sxsw #bigdata #chef 2
  • 3. Key Ironfan Contributors • Flip Kromer, @mrflip – CTO of Infochimps • Nathaniel Eliot, @temujin9 – Ops Engineer of Infochimps • Chris Howe – System Architect at Civitas Learning
  • 4. Infochimps Enterprise Cloud for Big Data CUSTOMER APPLICATIONS Custom Applications Business Intelligence Packaged Apps (Java, Python, etc.) (Cognos, BOBJ, Microstrategy) (ERP, CRM, etc.) 3/12/2013 #ironfan #devops #sxsw #bigdata #chef 4
  • 5. Why We Love Chef • Infrastructure as Code – Version Control – Shareable – Testable – Recapitulable 3/12/2013 #ironfan #devops #sxsw #bigdata #chef 5
  • 6. Why We Love Chef MySQL Nginx SOLR My Application 3/12/2013 #ironfan #devops #sxsw #bigdata #chef 6
  • 7. Why We Love Chef 3/12/2013 #ironfan #devops #sxsw #bigdata #chef 7
  • 8. Why We Don’t Love Chef • Anything is possible • Nothing is simple • There’s not much repetition (not DRY)
  • 9. Why We Don’t Love Chef Too much is hard-coded at development/upload time! 3/12/2013 #ironfan #devops #sxsw #bigdata #chef 9
  • 10. Why We Don’t Love Chef How do we make @server_ips dynamic? 3/12/2013 #ironfan #devops #sxsw #bigdata #chef 10
  • 11. Why We Wrote Ironfan • Simplify, unify, and standardize our usage of the Chef toolset • Build further abstractions on top of Chef • Give us superpowers that Chef doesn’t have yet http://github.com/infochimps-labs/ironfan 3/12/2013 #ironfan #devops #sxsw #bigdata #chef 11
  • 12. What Does Ironfan Do Ironfan Simple helpers in the silverware cookbook abstract common Chef patterns and keep things DRY. Chef
  • 13. What Does Ironfan Do Dynamic service discovery: 3/12/2013 #ironfan #devops #sxsw #bigdata #chef 13
  • 14. What Does Ironfan Do A simple DSL for defining clusters of machines. 3/12/2013 #ironfan #devops #sxsw #bigdata #chef 14
  • 15. Big Data for Chimps May 2013 3/12/2013 #ironfan #devops #sxsw #bigdata #chef 15
  • 16. As we walk through Ironfan… • Shortlink: http://infochim.ps/15INnv8 FYI • We are hiring! (we have offices in Austin & SF) – careers@infochimps.com – infochimps.com/careers • Learn more about our enterprise product: – sales@infochimps.com 3/12/2013 #ironfan #devops #sxsw #bigdata #chef 16

Notes de l'éditeur

  1. Part I. Big Data for Chimps1. Hello, Early Releasers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3My Questions for You 4Probable Contents 4Not Contents 7Feedback 72. About. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9What this book covers 9Who this book is for 10Who this book is not for 10How this book is being written 113. Hello, Reviewers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Controversials 13Style Nits 144. First Exploration (ch. A). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Where is Barbecue? 15First Steps 16Why? 16Plot of this story 17Exemplars and Touchstones 17Data and features 18Summarize every page on Wikipedia 18Summarize every page on Wikipedia 18Bin by Location 19A pause, to think 20iiiPulling signal from noise 20Takeaways 215. The Stream (ch. B). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Exercises 27Exercise 1.1: Running time 27Exercise 1.2: A Petabyte-scale wc command 286. Reshape Steps (ch. C). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29Locality of Reference 29Locality: Examples 29The Hadoop Haiku 307. Chimpanzee and Elephant Save Christmas (ch. D). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33A Non-scalable approach 33Letters to Toy Requests 34Order Delivery 36Toy Assembly 38Why it’s efficient 38Sorted Batches 39The Map-Reduce Haiku 39The Reducer Guarantee 40Partition Key and Sort Key 418. Geo Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43Spatial Data 44Geographic Data Model 44Geospatial JOIN using quadtiles 45The Quadtile Grid System 45Patterns in UFO Sightings 47Mapper: dispatch objects to rendezvous at quadtiles 48Reducer: combine objects on each quadtile 49Comparing Distributions 50Data Model 50GeoJSON 51Quadtile Practicalities 52Converting points to quadkeys (quadtile indexes) 52Exploration 56Interesting quadtile properties 56Quadtile Ready Reference 58Working with paths 59Calculating Distances 60iv | Table of ContentsDistributing Boundaries and Regions to Grid Cells 61Adaptive Grid Size 62Tree structure of Quadtile indexing 66Map Polygons to Grid Tiles 66Weather Near You 68Find the Voronoi Polygon for each Weather Station 68Break polygons on quadtiles 69Map Observations to Grid Cells 69K-means clustering to summarize 69Keep Exploring 70Exercises 70— References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719. Log Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73Data Model 73Simple Log Parsing 73Parser script 74Histograms 75User Paths through the site (“Sessionizing”) 77Page-Page similarity 79Geo-IP Matching 79Range Queries 80Using Hadoop for website stress testing (“Benign DDos”) 8010. WhyHadoop Works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83Disk is the new tape 83Hadoop is Secretly Fun 83Economics: 84Notes 8411. Sampling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87Consistent Random Sampling 88Random Sampling using strides 89Constant-Memory “Reservoir” Sampling 89— References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9012. HadoopExecution in Detail. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91Launch 91Split 92Mappers 93Choosing a file size 94Jobs with Map andReduce 94Table of Contents | vMapper-only jobs 9413. Pathology of Tuning (aka “when you should touch that dial”). . . . . . . . . . . . . . . . . . . . . 97Mapper 97A few map tasks take noticably longer than all the rest 97Tons of tiny little mappers 98Many non-local mappers 98Map tasks “spill” multiple times 98Job output files that are each slightly larger than an HDFS block 98Reducer 99Tons of data to a few reducers (high skew) 99Reducer merge (sort+shuffle) is longer than Reducer processing 99Output Commit phase is longer than Reducer processing 99Way more total data to reducers than cumulative cluster RAM 99System 100Excessive Swapping 100Out of Memory / No C+B reserve 100Stop-the-world (STW) Garbage collections 100Checklist 100Other 101Basic Checks 10114. HadoopMetrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103The USE Method appliedtoHadoop 103Look for the Bounding Resource 104Resource List 105See What’s Happening 108JMX (Java Monitoring Extensions) 108Roughnotes 10915. Data Formats and Schemata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111Good Format 1: TSV (It’s simple) 111Good Format 2: JSON (It’s Generic and Ubiquitous) 112structured to model. 112Good Format #3: Avro (It does everything right) 113Other reasonable choices: tagged net strings and null-delimited documents 114Crap format #1: XML 114Writing XML 114Crap Format #2: N3 triples 117Crap Format #3: Flat format 117Web log and Regexpable 117Glyphing (string encoding), Unicode,UTF-8 117vi | Table of ContentsICSS 118Schema.org Types 118Munging 11816. HBase Data Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121Row Key, Column Family, Column Qualifier, Timestamp, Value 121Keep it Stupidly Simple 123Help HBase be Lazy 123Row Locality and Compression 124Simple Table 124Airport Metadata 124Airport Timezone 125Range Lookup 125Geographic Data 126Multi-scale indexing 126Wikipedia: Corpus and Graph 126Graph Data 126Web Logs: Rows-As-Columns 127Column Families 128Atomic Counters 128Most-Frequent URLs 129Most-Recent URLs 129Rollup columns 130Row Locality 130adjacency is good 130adjacency is bad 130Vertical Partitioning (Column Families) 131Feature Set review 131“Design for Reads” 132— References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13417. Semi-Structured Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135Wikipedia Metadata 135Wikipedia Pageview Stats (importing TSV) 135Assembling the namespace join table 136Getting file metadata in a Wukong (or any Hadoop streaming) Script 136Wikipedia Article Metadata (importing a SQL Dump) 136Necessary Bullcrap #76: Bad encoding 136Wikipedia Page Graph 137Target Domain Models 137XML Data (Wikipedia Corpus) 138Extract, Translate, Canonicalize 141Table of Contents | vii
  2. Part I. Big Data for Chimps1. Hello, Early Releasers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3My Questions for You 4Probable Contents 4Not Contents 7Feedback 72. About. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9What this book covers 9Who this book is for 10Who this book is not for 10How this book is being written 113. Hello, Reviewers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Controversials 13Style Nits 144. First Exploration (ch. A). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Where is Barbecue? 15First Steps 16Why? 16Plot of this story 17Exemplars and Touchstones 17Data and features 18Summarize every page on Wikipedia 18Summarize every page on Wikipedia 18Bin by Location 19A pause, to think 20iiiPulling signal from noise 20Takeaways 215. The Stream (ch. B). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Exercises 27Exercise 1.1: Running time 27Exercise 1.2: A Petabyte-scale wc command 286. Reshape Steps (ch. C). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29Locality of Reference 29Locality: Examples 29The Hadoop Haiku 307. Chimpanzee and Elephant Save Christmas (ch. D). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33A Non-scalable approach 33Letters to Toy Requests 34Order Delivery 36Toy Assembly 38Why it’s efficient 38Sorted Batches 39The Map-Reduce Haiku 39The Reducer Guarantee 40Partition Key and Sort Key 418. Geo Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43Spatial Data 44Geographic Data Model 44Geospatial JOIN using quadtiles 45The Quadtile Grid System 45Patterns in UFO Sightings 47Mapper: dispatch objects to rendezvous at quadtiles 48Reducer: combine objects on each quadtile 49Comparing Distributions 50Data Model 50GeoJSON 51Quadtile Practicalities 52Converting points to quadkeys (quadtile indexes) 52Exploration 56Interesting quadtile properties 56Quadtile Ready Reference 58Working with paths 59Calculating Distances 60iv | Table of ContentsDistributing Boundaries and Regions to Grid Cells 61Adaptive Grid Size 62Tree structure of Quadtile indexing 66Map Polygons to Grid Tiles 66Weather Near You 68Find the Voronoi Polygon for each Weather Station 68Break polygons on quadtiles 69Map Observations to Grid Cells 69K-means clustering to summarize 69Keep Exploring 70Exercises 70— References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719. Log Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73Data Model 73Simple Log Parsing 73Parser script 74Histograms 75User Paths through the site (“Sessionizing”) 77Page-Page similarity 79Geo-IP Matching 79Range Queries 80Using Hadoop for website stress testing (“Benign DDos”) 8010. WhyHadoop Works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83Disk is the new tape 83Hadoop is Secretly Fun 83Economics: 84Notes 8411. Sampling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87Consistent Random Sampling 88Random Sampling using strides 89Constant-Memory “Reservoir” Sampling 89— References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9012. HadoopExecution in Detail. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91Launch 91Split 92Mappers 93Choosing a file size 94Jobs with Map andReduce 94Table of Contents | vMapper-only jobs 9413. Pathology of Tuning (aka “when you should touch that dial”). . . . . . . . . . . . . . . . . . . . . 97Mapper 97A few map tasks take noticably longer than all the rest 97Tons of tiny little mappers 98Many non-local mappers 98Map tasks “spill” multiple times 98Job output files that are each slightly larger than an HDFS block 98Reducer 99Tons of data to a few reducers (high skew) 99Reducer merge (sort+shuffle) is longer than Reducer processing 99Output Commit phase is longer than Reducer processing 99Way more total data to reducers than cumulative cluster RAM 99System 100Excessive Swapping 100Out of Memory / No C+B reserve 100Stop-the-world (STW) Garbage collections 100Checklist 100Other 101Basic Checks 10114. HadoopMetrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103The USE Method appliedtoHadoop 103Look for the Bounding Resource 104Resource List 105See What’s Happening 108JMX (Java Monitoring Extensions) 108Roughnotes 10915. Data Formats and Schemata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111Good Format 1: TSV (It’s simple) 111Good Format 2: JSON (It’s Generic and Ubiquitous) 112structured to model. 112Good Format #3: Avro (It does everything right) 113Other reasonable choices: tagged net strings and null-delimited documents 114Crap format #1: XML 114Writing XML 114Crap Format #2: N3 triples 117Crap Format #3: Flat format 117Web log and Regexpable 117Glyphing (string encoding), Unicode,UTF-8 117vi | Table of ContentsICSS 118Schema.org Types 118Munging 11816. HBase Data Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121Row Key, Column Family, Column Qualifier, Timestamp, Value 121Keep it Stupidly Simple 123Help HBase be Lazy 123Row Locality and Compression 124Simple Table 124Airport Metadata 124Airport Timezone 125Range Lookup 125Geographic Data 126Multi-scale indexing 126Wikipedia: Corpus and Graph 126Graph Data 126Web Logs: Rows-As-Columns 127Column Families 128Atomic Counters 128Most-Frequent URLs 129Most-Recent URLs 129Rollup columns 130Row Locality 130adjacency is good 130adjacency is bad 130Vertical Partitioning (Column Families) 131Feature Set review 131“Design for Reads” 132— References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13417. Semi-Structured Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135Wikipedia Metadata 135Wikipedia Pageview Stats (importing TSV) 135Assembling the namespace join table 136Getting file metadata in a Wukong (or any Hadoop streaming) Script 136Wikipedia Article Metadata (importing a SQL Dump) 136Necessary Bullcrap #76: Bad encoding 136Wikipedia Page Graph 137Target Domain Models 137XML Data (Wikipedia Corpus) 138Extract, Translate, Canonicalize 141Table of Contents | vii