Configuring, deploying, and managing Big Data infrastructure, Hadoop in particular, is time consuming and expensive. Infochimps’ Ironfan is an open source systems configuration suite for the cloud, quickly and easily orchestrating an entire Big Data stack including data ingestion, scraping, storage, computation, and monitoring. With Ironfan, you can spin up clusters when you need them and turn them off when you don’t, enabling you to spend your time, money, and engineering focus on finding insights and creating value, not getting your machines ready. These are the slides from the SXSWi workshop, where individuals learned how to go from a single development machine to a full-stack cloud deployment.
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
SXSWi Workshop: DevOps - Infrastructure as Code
1. DevOps: Empowering Developers
with Infrastructure
SXSW 2 0 1 3 – Tu e s d a y, M a r c h 1 2
Go here: http://infochim.ps/15INnv8
Nathan Eliot - @temujin9
Ryan Miller - @rmiller107
Amanda McGuckin-Hager - @shoogie
Tim Gasper - @timgasper
3/12/2013 #ironfan #devops #sxsw #bigdata #chef 1
2. Agenda
http://infochim.ps/15INnv8
1. Intros - Housekeeping (15 min – 15 total)
2. Initial Setup (30 min – 45 total)
3. Debug Initial Set Up (30-45 min – 1:15 total)
4. Standing Up a Simple Cluster (30-60 min – 2:15 total)
5. Hadoop! (30-60 min – 3:15 total)
6. General Q&A (30-60 min – 4:00 total)
3/12/2013 #ironfan #devops #sxsw #bigdata #chef 2
3. Key Ironfan Contributors
• Flip Kromer, @mrflip
– CTO of Infochimps
• Nathaniel Eliot, @temujin9
– Ops Engineer of Infochimps
• Chris Howe
– System Architect at Civitas Learning
4. Infochimps Enterprise Cloud for Big
Data
CUSTOMER APPLICATIONS
Custom Applications Business Intelligence Packaged Apps
(Java, Python, etc.) (Cognos, BOBJ, Microstrategy) (ERP, CRM, etc.)
3/12/2013 #ironfan #devops #sxsw #bigdata #chef 4
5. Why We Love Chef
• Infrastructure as Code
– Version Control
– Shareable
– Testable
– Recapitulable
3/12/2013 #ironfan #devops #sxsw #bigdata #chef 5
6. Why We Love Chef
MySQL Nginx SOLR
My Application
3/12/2013 #ironfan #devops #sxsw #bigdata #chef 6
7. Why We Love Chef
3/12/2013 #ironfan #devops #sxsw #bigdata #chef 7
8. Why We Don’t Love Chef
• Anything is possible
• Nothing is simple
• There’s not much
repetition (not DRY)
9. Why We Don’t Love Chef
Too much is hard-coded at development/upload time!
3/12/2013 #ironfan #devops #sxsw #bigdata #chef 9
10. Why We Don’t Love Chef
How do we make @server_ips dynamic?
3/12/2013 #ironfan #devops #sxsw #bigdata #chef 10
11. Why We Wrote Ironfan
• Simplify, unify, and
standardize our usage
of the Chef toolset
• Build further
abstractions on top of
Chef
• Give us superpowers
that Chef doesn’t have
yet
http://github.com/infochimps-labs/ironfan
3/12/2013 #ironfan #devops #sxsw #bigdata #chef 11
12. What Does Ironfan Do
Ironfan
Simple helpers in the silverware cookbook
abstract common Chef patterns
and keep things DRY.
Chef
13. What Does Ironfan Do
Dynamic service discovery:
3/12/2013 #ironfan #devops #sxsw #bigdata #chef 13
14. What Does Ironfan Do
A simple DSL
for defining
clusters of
machines.
3/12/2013 #ironfan #devops #sxsw #bigdata #chef 14
15. Big Data for Chimps
May 2013
3/12/2013 #ironfan #devops #sxsw #bigdata #chef 15
16. As we walk through Ironfan…
• Shortlink: http://infochim.ps/15INnv8
FYI
• We are hiring! (we have offices in Austin &
SF)
– careers@infochimps.com
– infochimps.com/careers
• Learn more about our enterprise product:
– sales@infochimps.com
3/12/2013 #ironfan #devops #sxsw #bigdata #chef 16
Notes de l'éditeur
Part I. Big Data for Chimps1. Hello, Early Releasers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3My Questions for You 4Probable Contents 4Not Contents 7Feedback 72. About. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9What this book covers 9Who this book is for 10Who this book is not for 10How this book is being written 113. Hello, Reviewers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Controversials 13Style Nits 144. First Exploration (ch. A). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Where is Barbecue? 15First Steps 16Why? 16Plot of this story 17Exemplars and Touchstones 17Data and features 18Summarize every page on Wikipedia 18Summarize every page on Wikipedia 18Bin by Location 19A pause, to think 20iiiPulling signal from noise 20Takeaways 215. The Stream (ch. B). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Exercises 27Exercise 1.1: Running time 27Exercise 1.2: A Petabyte-scale wc command 286. Reshape Steps (ch. C). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29Locality of Reference 29Locality: Examples 29The Hadoop Haiku 307. Chimpanzee and Elephant Save Christmas (ch. D). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33A Non-scalable approach 33Letters to Toy Requests 34Order Delivery 36Toy Assembly 38Why it’s efficient 38Sorted Batches 39The Map-Reduce Haiku 39The Reducer Guarantee 40Partition Key and Sort Key 418. Geo Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43Spatial Data 44Geographic Data Model 44Geospatial JOIN using quadtiles 45The Quadtile Grid System 45Patterns in UFO Sightings 47Mapper: dispatch objects to rendezvous at quadtiles 48Reducer: combine objects on each quadtile 49Comparing Distributions 50Data Model 50GeoJSON 51Quadtile Practicalities 52Converting points to quadkeys (quadtile indexes) 52Exploration 56Interesting quadtile properties 56Quadtile Ready Reference 58Working with paths 59Calculating Distances 60iv | Table of ContentsDistributing Boundaries and Regions to Grid Cells 61Adaptive Grid Size 62Tree structure of Quadtile indexing 66Map Polygons to Grid Tiles 66Weather Near You 68Find the Voronoi Polygon for each Weather Station 68Break polygons on quadtiles 69Map Observations to Grid Cells 69K-means clustering to summarize 69Keep Exploring 70Exercises 70— References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719. Log Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73Data Model 73Simple Log Parsing 73Parser script 74Histograms 75User Paths through the site (“Sessionizing”) 77Page-Page similarity 79Geo-IP Matching 79Range Queries 80Using Hadoop for website stress testing (“Benign DDos”) 8010. WhyHadoop Works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83Disk is the new tape 83Hadoop is Secretly Fun 83Economics: 84Notes 8411. Sampling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87Consistent Random Sampling 88Random Sampling using strides 89Constant-Memory “Reservoir” Sampling 89— References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9012. HadoopExecution in Detail. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91Launch 91Split 92Mappers 93Choosing a file size 94Jobs with Map andReduce 94Table of Contents | vMapper-only jobs 9413. Pathology of Tuning (aka “when you should touch that dial”). . . . . . . . . . . . . . . . . . . . . 97Mapper 97A few map tasks take noticably longer than all the rest 97Tons of tiny little mappers 98Many non-local mappers 98Map tasks “spill” multiple times 98Job output files that are each slightly larger than an HDFS block 98Reducer 99Tons of data to a few reducers (high skew) 99Reducer merge (sort+shuffle) is longer than Reducer processing 99Output Commit phase is longer than Reducer processing 99Way more total data to reducers than cumulative cluster RAM 99System 100Excessive Swapping 100Out of Memory / No C+B reserve 100Stop-the-world (STW) Garbage collections 100Checklist 100Other 101Basic Checks 10114. HadoopMetrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103The USE Method appliedtoHadoop 103Look for the Bounding Resource 104Resource List 105See What’s Happening 108JMX (Java Monitoring Extensions) 108Roughnotes 10915. Data Formats and Schemata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111Good Format 1: TSV (It’s simple) 111Good Format 2: JSON (It’s Generic and Ubiquitous) 112structured to model. 112Good Format #3: Avro (It does everything right) 113Other reasonable choices: tagged net strings and null-delimited documents 114Crap format #1: XML 114Writing XML 114Crap Format #2: N3 triples 117Crap Format #3: Flat format 117Web log and Regexpable 117Glyphing (string encoding), Unicode,UTF-8 117vi | Table of ContentsICSS 118Schema.org Types 118Munging 11816. HBase Data Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121Row Key, Column Family, Column Qualifier, Timestamp, Value 121Keep it Stupidly Simple 123Help HBase be Lazy 123Row Locality and Compression 124Simple Table 124Airport Metadata 124Airport Timezone 125Range Lookup 125Geographic Data 126Multi-scale indexing 126Wikipedia: Corpus and Graph 126Graph Data 126Web Logs: Rows-As-Columns 127Column Families 128Atomic Counters 128Most-Frequent URLs 129Most-Recent URLs 129Rollup columns 130Row Locality 130adjacency is good 130adjacency is bad 130Vertical Partitioning (Column Families) 131Feature Set review 131“Design for Reads” 132— References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13417. Semi-Structured Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135Wikipedia Metadata 135Wikipedia Pageview Stats (importing TSV) 135Assembling the namespace join table 136Getting file metadata in a Wukong (or any Hadoop streaming) Script 136Wikipedia Article Metadata (importing a SQL Dump) 136Necessary Bullcrap #76: Bad encoding 136Wikipedia Page Graph 137Target Domain Models 137XML Data (Wikipedia Corpus) 138Extract, Translate, Canonicalize 141Table of Contents | vii
Part I. Big Data for Chimps1. Hello, Early Releasers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3My Questions for You 4Probable Contents 4Not Contents 7Feedback 72. About. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9What this book covers 9Who this book is for 10Who this book is not for 10How this book is being written 113. Hello, Reviewers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Controversials 13Style Nits 144. First Exploration (ch. A). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Where is Barbecue? 15First Steps 16Why? 16Plot of this story 17Exemplars and Touchstones 17Data and features 18Summarize every page on Wikipedia 18Summarize every page on Wikipedia 18Bin by Location 19A pause, to think 20iiiPulling signal from noise 20Takeaways 215. The Stream (ch. B). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Exercises 27Exercise 1.1: Running time 27Exercise 1.2: A Petabyte-scale wc command 286. Reshape Steps (ch. C). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29Locality of Reference 29Locality: Examples 29The Hadoop Haiku 307. Chimpanzee and Elephant Save Christmas (ch. D). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33A Non-scalable approach 33Letters to Toy Requests 34Order Delivery 36Toy Assembly 38Why it’s efficient 38Sorted Batches 39The Map-Reduce Haiku 39The Reducer Guarantee 40Partition Key and Sort Key 418. Geo Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43Spatial Data 44Geographic Data Model 44Geospatial JOIN using quadtiles 45The Quadtile Grid System 45Patterns in UFO Sightings 47Mapper: dispatch objects to rendezvous at quadtiles 48Reducer: combine objects on each quadtile 49Comparing Distributions 50Data Model 50GeoJSON 51Quadtile Practicalities 52Converting points to quadkeys (quadtile indexes) 52Exploration 56Interesting quadtile properties 56Quadtile Ready Reference 58Working with paths 59Calculating Distances 60iv | Table of ContentsDistributing Boundaries and Regions to Grid Cells 61Adaptive Grid Size 62Tree structure of Quadtile indexing 66Map Polygons to Grid Tiles 66Weather Near You 68Find the Voronoi Polygon for each Weather Station 68Break polygons on quadtiles 69Map Observations to Grid Cells 69K-means clustering to summarize 69Keep Exploring 70Exercises 70— References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719. Log Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73Data Model 73Simple Log Parsing 73Parser script 74Histograms 75User Paths through the site (“Sessionizing”) 77Page-Page similarity 79Geo-IP Matching 79Range Queries 80Using Hadoop for website stress testing (“Benign DDos”) 8010. WhyHadoop Works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83Disk is the new tape 83Hadoop is Secretly Fun 83Economics: 84Notes 8411. Sampling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87Consistent Random Sampling 88Random Sampling using strides 89Constant-Memory “Reservoir” Sampling 89— References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9012. HadoopExecution in Detail. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91Launch 91Split 92Mappers 93Choosing a file size 94Jobs with Map andReduce 94Table of Contents | vMapper-only jobs 9413. Pathology of Tuning (aka “when you should touch that dial”). . . . . . . . . . . . . . . . . . . . . 97Mapper 97A few map tasks take noticably longer than all the rest 97Tons of tiny little mappers 98Many non-local mappers 98Map tasks “spill” multiple times 98Job output files that are each slightly larger than an HDFS block 98Reducer 99Tons of data to a few reducers (high skew) 99Reducer merge (sort+shuffle) is longer than Reducer processing 99Output Commit phase is longer than Reducer processing 99Way more total data to reducers than cumulative cluster RAM 99System 100Excessive Swapping 100Out of Memory / No C+B reserve 100Stop-the-world (STW) Garbage collections 100Checklist 100Other 101Basic Checks 10114. HadoopMetrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103The USE Method appliedtoHadoop 103Look for the Bounding Resource 104Resource List 105See What’s Happening 108JMX (Java Monitoring Extensions) 108Roughnotes 10915. Data Formats and Schemata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111Good Format 1: TSV (It’s simple) 111Good Format 2: JSON (It’s Generic and Ubiquitous) 112structured to model. 112Good Format #3: Avro (It does everything right) 113Other reasonable choices: tagged net strings and null-delimited documents 114Crap format #1: XML 114Writing XML 114Crap Format #2: N3 triples 117Crap Format #3: Flat format 117Web log and Regexpable 117Glyphing (string encoding), Unicode,UTF-8 117vi | Table of ContentsICSS 118Schema.org Types 118Munging 11816. HBase Data Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121Row Key, Column Family, Column Qualifier, Timestamp, Value 121Keep it Stupidly Simple 123Help HBase be Lazy 123Row Locality and Compression 124Simple Table 124Airport Metadata 124Airport Timezone 125Range Lookup 125Geographic Data 126Multi-scale indexing 126Wikipedia: Corpus and Graph 126Graph Data 126Web Logs: Rows-As-Columns 127Column Families 128Atomic Counters 128Most-Frequent URLs 129Most-Recent URLs 129Rollup columns 130Row Locality 130adjacency is good 130adjacency is bad 130Vertical Partitioning (Column Families) 131Feature Set review 131“Design for Reads” 132— References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13417. Semi-Structured Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135Wikipedia Metadata 135Wikipedia Pageview Stats (importing TSV) 135Assembling the namespace join table 136Getting file metadata in a Wukong (or any Hadoop streaming) Script 136Wikipedia Article Metadata (importing a SQL Dump) 136Necessary Bullcrap #76: Bad encoding 136Wikipedia Page Graph 137Target Domain Models 137XML Data (Wikipedia Corpus) 138Extract, Translate, Canonicalize 141Table of Contents | vii