SlideShare une entreprise Scribd logo
1  sur  43
Télécharger pour lire hors ligne
Getting Started with
Josh Devins
Nokia


Berlin Expert Days
April 8, 2011
Berlin, Germany
http://www.flickr.com/photos/haiko/154105048/

how did we get here?
* Google crawls the web, surfaces the "big data" problem
* big data problem defined: so much data that cannot be processed by one individual
machine
* (also defined as: so much data that you need a team of people to managed it)
* solve it: use multiple machines
http://www.flickr.com/photos/jamisonjudd/2433102356/
http://www.flickr.com/photos/torkildr/3462607995/
http://www.flickr.com/photos/torkildr/3462606643/

* since 1999, Google engineers wrote complex distributed programs to analyze crawled data
* too complex, not accessible
* requirement: must be easy for engineers with little to no distributed computing and large
data processing experience
  * fault tolerance
  * scaling
  * simple coding experience
  * easy to teach
  * visibility/monitorability
• Google implement MapReduce and GFS

   • GFS paper published (Ghemawat, et al)




basic history of MapReduce at Google

* 2003 Google implement MapReduce and GFS
   * to support large-scale, distributed computing on large data sets using commodity
hardware
* basically to make data crunching a reality for "regular" Google engineers
* 2003 GFS paper published by Sanjay Ghemawat, et al
• MapReduce paper published (Jeffrey Dean and Sanjay
     Ghemawat)

   • MapReduce patent application (2004 applied, 2010
     approved)




* 2004 MapReduce paper published by Jeffrey Dean and Sanjay Ghemawat
   * http://labs.google.com/papers/mapreduce.html
* MR is patented by Google (2004 applied, 2010 approved), but supports Hadoop completely
and uses the patent defensively only (to ensure that everyone can use the patent)
* http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=
%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=7,650,331.PN.&OS=PN/
7,650,331&RS=PN/7,650,331
• 2004 Doug Cutting and Mike Cafarella create implementation for Nutch
   • 2006 Doug Cutting joins Yahoo!

   • 2006 Hadoop split out from Nutch
   • 2006 Yahoo! search index building powered by Hadoop
   • 2007 Yahoo! runs 2x 1,000 node R&D clusters

   • 2008 Hadoop wins the 1 TB sort benchmark in 209s on 900 nodes
   • 2008 Cloudera founded by ex-Oracle, Yahoo! and Facebook employees

   • 2009 Cutting leaves Yahoo! for Cloudera

evolution into Hadoop, natural continuation from Google work, in the public domain

* implemented for Nutch's index creation, relying on their NDFS (Nutch dist filesystem)
* Nutch is a web crawler and search engine based on Lucene
map1

                            map2

      book                  map3                  reduce                summary

                            map4

                            mapn




so what the hell is it already?
“a distributed batch processing system”
the non-technical example, courtesy of Matt Biddulph: give n people a book to read and get
reports back from them
map/reduce parts can be parallelized, section in the outer box
map(String key, String value):
   // key: document name
   // value: row/line from document
   for each w in value:
     EmitIntermediate(w, 1);

 sortAndGroup(List<String, Integer> mapOut)

 reduce(String key, Iterator<Integer> values):
   // key: a word
   // values: a list of counts
   Integer count = 0;
   for each v in values:
     count += values.next();
   Emit(key, count);

similar to the previous example of reports
(simplified) canonical example of word counting
* give those same n people or mappers each a line from document and have them write down
a ‘1’ for every word they see
* the collector is the responsible for summing up all the ‘1’s per word
* not a ‘pure function’ (‘emit’ methods have side-effects, impl in Hadoop has side-effects)
* based on, but not exact ‘map’ and ‘reduce’ in the strictly functional definition

map function takes:
- key as document name
- value as the line from the document

map function emits:
- key as the word
- value as the number 1 (I’ve seen this word one time)

reduce function takes:
 - key as the word
 - list of values is list of 1’s -- for each time the word was seen by a mapper

reduce function emits: the word, the sum of number of times word was encountered by a
mapper
map input:
(doc1,start of the first document)
(doc1,the document is super interesting)
(doc1,end of the first document)

map output:
(start,1) (of,1) (the,1) (first,1) (document,1)
(the,1) (document,1) (is,1) (super,1) (interesting, 1)
(end,1) (of,1) (the,1) (first,1) (document,1)

sort:
(start,1) (of,1)(of,1) (the,1)(the,1)(the,1)
(first,1)(first,1) (document,1)(document,1)(document,1)
(is,1) (super,1) (interesting, 1) (end,1)

group (reduce input):
(start,{1}) (of,{1,1}) (the,{1,1,1}) (first,{1,1})
(document,{1,1,1})
(is,{1}) (super,{1}) (interesting,{1}) (end,{1})

reduce output:
(start,1) (of,2) (the,3) (first,2) (document,3)
(is,1) (super,1) (interesting,1) (end,1)
HDFS




logical file view

HDFS primer
* block structure
* std block size
* replicated blocks, std 3x
* input task per block
* data locality
1


        3                     2

                                       4




* high level, physical view of HDFS
* walk through write operation steps
1


                             2



                                      3




*   job run
*   data/processing locality (best effort attempt)
*   can’t always achieve data-local processing though
*   stats will show how many data-local map tasks were run
Nomenclature Review

   • HDFS

       • NameNode: metadata, coordination

       • DataNode: storage, retrieval, replication

   • MapReduce

       • JobTracker: job coordination

       • TaskTracker: task management (map and reduce)




* saw all of these pieces in the previous slides
Hadoop ecosystem
Yahoo!
Facebook
Cloudera
* Avro started at Yahoo! by Doug Cutting, continues work at Cloudera
LinkedIn
Other (Amazon-AWS Elastic MapReduce, Chris Wensel-Cascading, Infochimps-Wukong,
Google-Proto Buf)
Hadoop ecosystem
Diving In

   • Cloudera training VM, CDH3b3

   • github.com/joshdevins/talks-hadoop-getting-started

   • Exercise:

      • analyse Apache access logs from mac-geeks.de

      • use raw Java MapReduce API, MRUnit

      • use Pig, PigUnit

      • simple visualization/dashboard


* Cloudera VM, pre-installed with CDH (Cloudera Distribution for Hadoop): http://cloudera-
vm.s3.amazonaws.com/cloudera-demo-0.3.5.tar.bz2?downloads (username/password:
cloudera/cloudera)
* thanks @maxheadroom, mac-geeks.de
* throughput analysis
* Pig is a high-level abstraction on MR providing a ‘data flow’ language, with constructs
  similar to SQL
1.2.3.4   -   -   [30/Sep/2010:15:07:53   -0400]   "GET   /foo   HTTP/1.1"   200   3190
1.2.3.4   -   -   [30/Sep/2010:15:07:53   -0400]   "GET   /bar   HTTP/1.1"   404   3190
1.2.3.4   -   -   [30/Sep/2010:15:07:54   -0400]   "GET   /foo   HTTP/1.1"   200   3190
1.2.3.4   -   -   [30/Sep/2010:15:07:54   -0400]   "GET   /foo   HTTP/1.1"   200   3190




                   30/Sep/2010:15:07:53, 1
                   30/Sep/2010:15:07:54, 2                group by second


                   30/Sep/2010:15:00:00,
                     {(30/Sep/2010:15:07:53, 1),
                      (30/Sep/2010:15:07:54, 2)}
                                                                 group by hour


                   30/Sep/2010:15:00:00, 3, 2               count, find max


general approach
Code




github.com/joshdevins/talks-hadoop-getting-started
Hadoop at Nokia




* Nokia Berlin - location based services
Global Architecture




* remote DC’s: Singapore, Peking, Atlanta, Mumbai
* central DC: Slough/London
* R&D DC’s and Hadoop clusters: Berlin, Boston
Hardware


          DC           LONDON                           BERLIN

        cores          12x (w/ HT)                      4x 2.00 GHz (w/ HT)

         RAM           48GB                             16GB

        disks          12x 2TB                          4x 1TB

       storage         24TB                             4TB

         LAN           1Gb                              2x 1Gb (bonded)


http://www.flickr.com/photos/torkildr/3462607995/in/photostream/

BERLIN
* HP DL160 G6
* 1x Quad-core Intel Xeon E5504 @ 2.00 GHz (4-cores total)
* 16GB DDR3 RAM
* 4x 1TB 7200 RPM SATA
* 2x 1Gb LAN
* iLO Lights-Out 100 Advanced
Meaning?

   • Size

      • Berlin: 2 master nodes, 13 data nodes, ~17TB HDFS

      • London: “large enough to handle a year’s worth of
        activity log data, with plans for rapid expansion”

   • Scribe

      • 250,000 1KB msg/sec

      • 244MB/sec, 14.3GB/hr, 343GB/day



http://www.flickr.com/photos/torkildr/3462607995/in/photostream/
Reporting




operational - access logs, throughput, general usage, dashboards
business reporting - what are all of the products doing, how do they compare to other
months
ad-hoc - random business queries

*   almost all of this goes through Pig at some point
*   pipelines with Oozie
*   sometimes parsing and decoding in Java MR job, then Pig for the heavy lifting
*   mostly goes into a RDBMS using Sqoop for display and querying in other tools
*   Tableau for some dashboards and quick visualizations
*   many JS libs for good visualization/dashboarding
*   sometimes roll your own with image libraries in Python, Ruby, etc.
IKEA!




other than reporting, we also occasionally do some data exploration, which can be quite fun
any guesses what this is a plot of?
geo-searches for Ikea!
Prenzl Berg Yuppies




    Ikea Spandau


                                                                     Ikea Schoenefeld
                         Ikea Tempelhof


Ikea geo-searches bounded to Berlin
can we make any assumptions about what the actual locations are?
kind of, but not much data here
clearly there is a Tempelhof cluster but the others are not very evident
certainly shows the relative popularity of all the locations
Ikea Lichtenberg was not open yet during this time frame
Ikea Edmonton
               Ikea Wembley


                                                                    Ikea Lakeside




                               Ikea Croydon

Ikea geo-searches bounded to London
can we make any assumptions about what the actual locations are?
turns out we can!
using a clustering algorithm like K-Means (maybe from Mahout) we probably could guess

> this is considering search location, what about time?
Berlin
distribution of searches over days of the week and hours of the day
certainly can make some comments about the hours that Berliners are awake
can we make assumptions about average opening hours?
Berlin
upwards trend a couple hours before opening
can also clearly make some statements about the best time to visit Ikea in Berlin - Sat night!

BERLIN
 * Mon-Fri 10am-9pm
 * Saturday 10am-10pm
London
more data points again so we get smoother results
London
LONDON
 * Mon-Fri 10am-10pm
 * Saturday 9am-10pm
 * Sunday 11am-5pm

> potential revenue stream?
> what to do with this data or data like this?
Productizing
Berlin




another example of something that can be productized

Berlin
 * traffic sensors
 * map tiles
Los Angeles




LA
 * traffic sensors
 * map tiles
Berlin   Los Angeles
Join Us

• Nokia is hiring in Berlin!

• software engineers

• operations engineers

• josh.devins@nokia.com

• www.nokia.com/careers
Thanks!
Josh Devins
www.joshdevins.net
info@joshdevins.net
@joshdevins

code: github.com/joshdevins/talks-hadoop-getting-started

Contenu connexe

Tendances

Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
Marin Dimitrov
 
Apache Hadoop Talk at QCon
Apache Hadoop Talk at QConApache Hadoop Talk at QCon
Apache Hadoop Talk at QCon
Cloudera, Inc.
 
Introduction to the Oakforest-PACS Supercomputer in Japan
Introduction to the Oakforest-PACS Supercomputer in JapanIntroduction to the Oakforest-PACS Supercomputer in Japan
Introduction to the Oakforest-PACS Supercomputer in Japan
inside-BigData.com
 
InfiniCortex and the Renaissance in Polish Supercomputing
InfiniCortex and the Renaissance in Polish Supercomputing InfiniCortex and the Renaissance in Polish Supercomputing
InfiniCortex and the Renaissance in Polish Supercomputing
inside-BigData.com
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010
Thejas Nair
 

Tendances (20)

Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 
Big Data Step-by-Step: Infrastructure 1/3: Local VM
Big Data Step-by-Step: Infrastructure 1/3: Local VMBig Data Step-by-Step: Infrastructure 1/3: Local VM
Big Data Step-by-Step: Infrastructure 1/3: Local VM
 
Apache Hadoop Talk at QCon
Apache Hadoop Talk at QConApache Hadoop Talk at QCon
Apache Hadoop Talk at QCon
 
Mastering Python 3 I/O (Version 2)
Mastering Python 3 I/O (Version 2)Mastering Python 3 I/O (Version 2)
Mastering Python 3 I/O (Version 2)
 
Big Data Analysis With RHadoop
Big Data Analysis With RHadoopBig Data Analysis With RHadoop
Big Data Analysis With RHadoop
 
Using R with Hadoop
Using R with HadoopUsing R with Hadoop
Using R with Hadoop
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
Generators: The Final Frontier
Generators: The Final FrontierGenerators: The Final Frontier
Generators: The Final Frontier
 
Updates on the BHL Global Cluster
Updates on the BHL Global ClusterUpdates on the BHL Global Cluster
Updates on the BHL Global Cluster
 
Dynamic Hadoop Clusters
Dynamic Hadoop ClustersDynamic Hadoop Clusters
Dynamic Hadoop Clusters
 
Foreign Data Wrapper Enhancements
Foreign Data Wrapper EnhancementsForeign Data Wrapper Enhancements
Foreign Data Wrapper Enhancements
 
Introduction to the Oakforest-PACS Supercomputer in Japan
Introduction to the Oakforest-PACS Supercomputer in JapanIntroduction to the Oakforest-PACS Supercomputer in Japan
Introduction to the Oakforest-PACS Supercomputer in Japan
 
Hadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaHadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without Java
 
InfiniCortex and the Renaissance in Polish Supercomputing
InfiniCortex and the Renaissance in Polish Supercomputing InfiniCortex and the Renaissance in Polish Supercomputing
InfiniCortex and the Renaissance in Polish Supercomputing
 
Ganesh naik linux_kernel_internals
Ganesh naik linux_kernel_internalsGanesh naik linux_kernel_internals
Ganesh naik linux_kernel_internals
 
What’s new in 9.6, by PostgreSQL contributor
What’s new in 9.6, by PostgreSQL contributorWhat’s new in 9.6, by PostgreSQL contributor
What’s new in 9.6, by PostgreSQL contributor
 
Seattle OpenStack Meetup
Seattle OpenStack MeetupSeattle OpenStack Meetup
Seattle OpenStack Meetup
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010
 

En vedette (6)

From Zero to Lots - ScaleCamp UK 2009
From Zero to Lots - ScaleCamp UK 2009From Zero to Lots - ScaleCamp UK 2009
From Zero to Lots - ScaleCamp UK 2009
 
Signage Solutions E Brochure
Signage Solutions E BrochureSignage Solutions E Brochure
Signage Solutions E Brochure
 
Siddharth History Slideshow Pp
Siddharth History Slideshow PpSiddharth History Slideshow Pp
Siddharth History Slideshow Pp
 
Continuous Deployment and DevOps: Deprecating Silos - JAOO 2010
Continuous Deployment and DevOps: Deprecating Silos - JAOO 2010Continuous Deployment and DevOps: Deprecating Silos - JAOO 2010
Continuous Deployment and DevOps: Deprecating Silos - JAOO 2010
 
Tabatha , Lovedeep.
Tabatha , Lovedeep.Tabatha , Lovedeep.
Tabatha , Lovedeep.
 
Company report
Company reportCompany report
Company report
 

Similaire à Getting Started with Hadoop

Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Andrey Vykhodtsev
 
Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014
rpbrehm
 

Similaire à Getting Started with Hadoop (20)

A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Hadoop
HadoopHadoop
Hadoop
 
Processing Geospatial Data At Scale @locationtech
Processing Geospatial Data At Scale @locationtechProcessing Geospatial Data At Scale @locationtech
Processing Geospatial Data At Scale @locationtech
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersA performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
 
Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Softshake 2013: Introduction to NoSQL with Couchbase
Softshake 2013: Introduction to NoSQL with CouchbaseSoftshake 2013: Introduction to NoSQL with Couchbase
Softshake 2013: Introduction to NoSQL with Couchbase
 

Dernier

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

Getting Started with Hadoop

  • 1. Getting Started with Josh Devins Nokia Berlin Expert Days April 8, 2011 Berlin, Germany
  • 2. http://www.flickr.com/photos/haiko/154105048/ how did we get here? * Google crawls the web, surfaces the "big data" problem * big data problem defined: so much data that cannot be processed by one individual machine * (also defined as: so much data that you need a team of people to managed it) * solve it: use multiple machines
  • 5. http://www.flickr.com/photos/torkildr/3462606643/ * since 1999, Google engineers wrote complex distributed programs to analyze crawled data * too complex, not accessible * requirement: must be easy for engineers with little to no distributed computing and large data processing experience * fault tolerance * scaling * simple coding experience * easy to teach * visibility/monitorability
  • 6. • Google implement MapReduce and GFS • GFS paper published (Ghemawat, et al) basic history of MapReduce at Google * 2003 Google implement MapReduce and GFS * to support large-scale, distributed computing on large data sets using commodity hardware * basically to make data crunching a reality for "regular" Google engineers * 2003 GFS paper published by Sanjay Ghemawat, et al
  • 7. • MapReduce paper published (Jeffrey Dean and Sanjay Ghemawat) • MapReduce patent application (2004 applied, 2010 approved) * 2004 MapReduce paper published by Jeffrey Dean and Sanjay Ghemawat * http://labs.google.com/papers/mapreduce.html * MR is patented by Google (2004 applied, 2010 approved), but supports Hadoop completely and uses the patent defensively only (to ensure that everyone can use the patent) * http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u= %2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=7,650,331.PN.&OS=PN/ 7,650,331&RS=PN/7,650,331
  • 8. • 2004 Doug Cutting and Mike Cafarella create implementation for Nutch • 2006 Doug Cutting joins Yahoo! • 2006 Hadoop split out from Nutch • 2006 Yahoo! search index building powered by Hadoop • 2007 Yahoo! runs 2x 1,000 node R&D clusters • 2008 Hadoop wins the 1 TB sort benchmark in 209s on 900 nodes • 2008 Cloudera founded by ex-Oracle, Yahoo! and Facebook employees • 2009 Cutting leaves Yahoo! for Cloudera evolution into Hadoop, natural continuation from Google work, in the public domain * implemented for Nutch's index creation, relying on their NDFS (Nutch dist filesystem) * Nutch is a web crawler and search engine based on Lucene
  • 9. map1 map2 book map3 reduce summary map4 mapn so what the hell is it already? “a distributed batch processing system” the non-technical example, courtesy of Matt Biddulph: give n people a book to read and get reports back from them map/reduce parts can be parallelized, section in the outer box
  • 10. map(String key, String value): // key: document name // value: row/line from document for each w in value: EmitIntermediate(w, 1); sortAndGroup(List<String, Integer> mapOut) reduce(String key, Iterator<Integer> values): // key: a word // values: a list of counts Integer count = 0; for each v in values: count += values.next(); Emit(key, count); similar to the previous example of reports (simplified) canonical example of word counting * give those same n people or mappers each a line from document and have them write down a ‘1’ for every word they see * the collector is the responsible for summing up all the ‘1’s per word * not a ‘pure function’ (‘emit’ methods have side-effects, impl in Hadoop has side-effects) * based on, but not exact ‘map’ and ‘reduce’ in the strictly functional definition map function takes: - key as document name - value as the line from the document map function emits: - key as the word - value as the number 1 (I’ve seen this word one time) reduce function takes: - key as the word - list of values is list of 1’s -- for each time the word was seen by a mapper reduce function emits: the word, the sum of number of times word was encountered by a mapper
  • 11. map input: (doc1,start of the first document) (doc1,the document is super interesting) (doc1,end of the first document) map output: (start,1) (of,1) (the,1) (first,1) (document,1) (the,1) (document,1) (is,1) (super,1) (interesting, 1) (end,1) (of,1) (the,1) (first,1) (document,1) sort: (start,1) (of,1)(of,1) (the,1)(the,1)(the,1) (first,1)(first,1) (document,1)(document,1)(document,1) (is,1) (super,1) (interesting, 1) (end,1) group (reduce input): (start,{1}) (of,{1,1}) (the,{1,1,1}) (first,{1,1}) (document,{1,1,1}) (is,{1}) (super,{1}) (interesting,{1}) (end,{1}) reduce output: (start,1) (of,2) (the,3) (first,2) (document,3) (is,1) (super,1) (interesting,1) (end,1)
  • 12. HDFS logical file view HDFS primer * block structure * std block size * replicated blocks, std 3x * input task per block * data locality
  • 13. 1 3 2 4 * high level, physical view of HDFS * walk through write operation steps
  • 14. 1 2 3 * job run * data/processing locality (best effort attempt) * can’t always achieve data-local processing though * stats will show how many data-local map tasks were run
  • 15. Nomenclature Review • HDFS • NameNode: metadata, coordination • DataNode: storage, retrieval, replication • MapReduce • JobTracker: job coordination • TaskTracker: task management (map and reduce) * saw all of these pieces in the previous slides
  • 19. Cloudera * Avro started at Yahoo! by Doug Cutting, continues work at Cloudera
  • 21. Other (Amazon-AWS Elastic MapReduce, Chris Wensel-Cascading, Infochimps-Wukong, Google-Proto Buf)
  • 23. Diving In • Cloudera training VM, CDH3b3 • github.com/joshdevins/talks-hadoop-getting-started • Exercise: • analyse Apache access logs from mac-geeks.de • use raw Java MapReduce API, MRUnit • use Pig, PigUnit • simple visualization/dashboard * Cloudera VM, pre-installed with CDH (Cloudera Distribution for Hadoop): http://cloudera- vm.s3.amazonaws.com/cloudera-demo-0.3.5.tar.bz2?downloads (username/password: cloudera/cloudera) * thanks @maxheadroom, mac-geeks.de * throughput analysis * Pig is a high-level abstraction on MR providing a ‘data flow’ language, with constructs similar to SQL
  • 24. 1.2.3.4 - - [30/Sep/2010:15:07:53 -0400] "GET /foo HTTP/1.1" 200 3190 1.2.3.4 - - [30/Sep/2010:15:07:53 -0400] "GET /bar HTTP/1.1" 404 3190 1.2.3.4 - - [30/Sep/2010:15:07:54 -0400] "GET /foo HTTP/1.1" 200 3190 1.2.3.4 - - [30/Sep/2010:15:07:54 -0400] "GET /foo HTTP/1.1" 200 3190 30/Sep/2010:15:07:53, 1 30/Sep/2010:15:07:54, 2 group by second 30/Sep/2010:15:00:00, {(30/Sep/2010:15:07:53, 1), (30/Sep/2010:15:07:54, 2)} group by hour 30/Sep/2010:15:00:00, 3, 2 count, find max general approach
  • 26. Hadoop at Nokia * Nokia Berlin - location based services
  • 27. Global Architecture * remote DC’s: Singapore, Peking, Atlanta, Mumbai * central DC: Slough/London * R&D DC’s and Hadoop clusters: Berlin, Boston
  • 28. Hardware DC LONDON BERLIN cores 12x (w/ HT) 4x 2.00 GHz (w/ HT) RAM 48GB 16GB disks 12x 2TB 4x 1TB storage 24TB 4TB LAN 1Gb 2x 1Gb (bonded) http://www.flickr.com/photos/torkildr/3462607995/in/photostream/ BERLIN * HP DL160 G6 * 1x Quad-core Intel Xeon E5504 @ 2.00 GHz (4-cores total) * 16GB DDR3 RAM * 4x 1TB 7200 RPM SATA * 2x 1Gb LAN * iLO Lights-Out 100 Advanced
  • 29. Meaning? • Size • Berlin: 2 master nodes, 13 data nodes, ~17TB HDFS • London: “large enough to handle a year’s worth of activity log data, with plans for rapid expansion” • Scribe • 250,000 1KB msg/sec • 244MB/sec, 14.3GB/hr, 343GB/day http://www.flickr.com/photos/torkildr/3462607995/in/photostream/
  • 30. Reporting operational - access logs, throughput, general usage, dashboards business reporting - what are all of the products doing, how do they compare to other months ad-hoc - random business queries * almost all of this goes through Pig at some point * pipelines with Oozie * sometimes parsing and decoding in Java MR job, then Pig for the heavy lifting * mostly goes into a RDBMS using Sqoop for display and querying in other tools * Tableau for some dashboards and quick visualizations * many JS libs for good visualization/dashboarding * sometimes roll your own with image libraries in Python, Ruby, etc.
  • 31. IKEA! other than reporting, we also occasionally do some data exploration, which can be quite fun any guesses what this is a plot of? geo-searches for Ikea!
  • 32. Prenzl Berg Yuppies Ikea Spandau Ikea Schoenefeld Ikea Tempelhof Ikea geo-searches bounded to Berlin can we make any assumptions about what the actual locations are? kind of, but not much data here clearly there is a Tempelhof cluster but the others are not very evident certainly shows the relative popularity of all the locations Ikea Lichtenberg was not open yet during this time frame
  • 33. Ikea Edmonton Ikea Wembley Ikea Lakeside Ikea Croydon Ikea geo-searches bounded to London can we make any assumptions about what the actual locations are? turns out we can! using a clustering algorithm like K-Means (maybe from Mahout) we probably could guess > this is considering search location, what about time?
  • 34. Berlin distribution of searches over days of the week and hours of the day certainly can make some comments about the hours that Berliners are awake can we make assumptions about average opening hours?
  • 35. Berlin upwards trend a couple hours before opening can also clearly make some statements about the best time to visit Ikea in Berlin - Sat night! BERLIN * Mon-Fri 10am-9pm * Saturday 10am-10pm
  • 36. London more data points again so we get smoother results
  • 37. London LONDON * Mon-Fri 10am-10pm * Saturday 9am-10pm * Sunday 11am-5pm > potential revenue stream? > what to do with this data or data like this?
  • 39. Berlin another example of something that can be productized Berlin * traffic sensors * map tiles
  • 40. Los Angeles LA * traffic sensors * map tiles
  • 41. Berlin Los Angeles
  • 42. Join Us • Nokia is hiring in Berlin! • software engineers • operations engineers • josh.devins@nokia.com • www.nokia.com/careers