SlideShare une entreprise Scribd logo
1  sur  51
Large­scale data processing
  at SARA and BiG Grid
   with Apache Hadoop




         Evert Lammerts
      April 10, 2012, SZTAKI
First off...

                  About me
 Consultant for SARA's eScience & Cloud Services
    Technical lead for LifeWatch Netherlands
            Lead Hadoop infrastructure


                  About you
Who uses large-scale computing as a supporting tool?
 For who is large-scale computing core-business?
In this talk
Large-scale data processing?
Large-scale @ SARA & BiG Grid
An introduction to Hadoop & MapReduce
Hadoop @ SARA & BiG Grid
Large-scale data processing?
         Large-scale @ SARA & BiG Grid
An introduction to Hadoop & MapReduce
            Hadoop @ SARA & BiG Grid
Three observations




I: Data is easier to collect
(Jimmy Lin, University of Maryland / Twitter, 2011)
More business is done on-line
   Mobile devices are more sophisticated
       Governments collect more data
 Sensing devices are becoming a commodity
  Technology advanced: DNA sequencers!
Enormous funding for research infrastructures
                And so on...

     Lesson: everybody collects data

       Cisco Visual Networking Index: Global Mobile Data Traffic Forecast Update, 2011–2016
Three observations




II: Data is easier to store
Storage price decreases




            http://www.mkomo.com/cost-per-gigabyte
Storage capacity increases




        http://en.wikipedia.org/wiki/File:Hard_drive_capacity_over_time.svg
Three observations




III: Quantity beats quality
(IEEE Intelligent Systems, 03/04-2009, vol 24, issue 2, p8-12)
s/knowledge/data/g




             Jimmy Lin, University of Maryland / Twitter, 2011
How are these observations addressed?


   We collect data, we store data, we have the
  knowledge to interpret data. What tools do we
        have that bring these together?


Pioneers: HPC centers, universities, and in recent
  years, Internet companies. (Lots of knowledge
              exchange, by the way.)
Some background (bear with me...) 1/3




                             Amdahl's Law
Some background (bear with me...) 2/3




        (The Datacenter as a Computer, 2009, Luiz André Barroso and Urs Hölzle)
Some background (bear with me...) 3/3

Nodes (x2000):
8GB DRAM
4 x 1TB disks

Rack:
40 nodes
1Gbps switch

Datacenter:
8Gbps rack-to-cluster
 switch connection




                         (The Datacenter as a Computer, 2009, Luiz André Barroso and Urs Hölzle)
(NYT, 14/06/2006)
Large-scale data processing?
         Large-scale @ SARA & BiG Grid
An introduction to Hadoop & MapReduce
            Hadoop @ SARA & BiG Grid
SARA
    the national center for scientific computing




Facilitating Science in The Netherlands with Equipment for
 and Expertise on Large-Scale Computing, Large-Scale
 Data Storage, High-Performance Networking,
       eScience, and Visualization
Large-scale data != new
Compute @ SARA
Case Study: Virtual Knowledge Studio


                               How do categories in WikiPedia
                                evolve over time? (And how do
                                they relate to internal links?)

                               2.7 TB raw text, single file

                               Java application, searches for
                                categories in Wiki markup,
                                like [[Category:NAME]]

                               Executed on the Grid




               http://simshelf2.virtualknowledgestudio.nl/activities/biggrid-wikipedia-experiment
Case Study: Virtual Knowledge Studio
Method
Take an article, including history, as input
Extract categories and links for each revision
Output all links for each category, per revision
Aggregate all links for each category, per revision
Generate graph linking all categories on links, per revision
Case Study: Virtual Knowledge Studio




1.1) Copy file from local    2.1) Stream file from Grid      3.1) Process all files in
   Machine to Grid storage      Storage to single machine       parallel: N machines
                             2.2) Cut into pieces of 10 GB      run the Java application,
                             2.3) Stream back to Grid           fetch a 10GB file as
                                Storage                         input, processing it, and
                                                                putting the result back
Large-scale data processing?
         Large-scale @ SARA & BiG Grid
An introduction to Hadoop & MapReduce
            Hadoop @ SARA & BiG Grid
A bit of history




2002     2004        2006

Nutch*   MR/GFS**    Hadoop




                *  http://nutch.apache.org/
                ** http://labs.google.com/papers/mapreduce.html
                   http://labs.google.com/papers/gfs.html
2010 - 2012: A Hype in Production




http://wiki.apache.org/hadoop/PoweredBy
What's different about Hadoop?


No more do-it-yourself parallelism – it's hard!
 But rather linearly scalable data parallelism

    Separating the what from the how




                      2009, Luiz André Barroso and Urs Hölzle)
Core principals
Scale out, not up
Move processing to the data
Process data sequentially, avoid random reads
Seamless scalability




                                 (Jimmy Lin, University of Maryland / Twitter, 2011)
A typical data-parallel problem in abstraction
Iterate over a large number of records
Extract something of interest
Create an ordering in intermediate results
Aggregate intermediate results
Generate output


  MapReduce: functional abstraction of step 2 & step 4




                                   (Jimmy Lin, University of Maryland / Twitter, 2011)
MapReduce
Programmer specifies two functions
map(k, v) → <k', v'>*
reduce(k', v') → <k', v'>*
All values associated with a single key are sent to the same
  reducer


              The framework handles the rest
The rest?




Scheduling, data distribution, ordering,
  synchronization, error handling...
Case Study: Virtual Knowledge Studio
                    This is how it would be done with Hadoop




1) Load file into          2) Submit code to
     HDFS                    MR

     Automatic distribution of data,
      Parallelism based on data,
Automatic ordering of intermediate results
The ecosystem




The Forrester WaveTM: Enterprise Hadoop Solutions, Q1 2012
Large-scale data processing?
         Large-scale @ SARA & BiG Grid
An introduction to Hadoop & MapReduce
            Hadoop @ SARA & BiG Grid
Timeline
2009:      Piloting Hadoop on Cloud
2010:      Test cluster available for scientists
           6 machines * 4 cores / 24 TB storage / 16GB
 RAM
           Just me!
2011:      Funding granted for production service
2012:      Production cluster available (~March)
           72 machines * 8 cores / 8 TB storage / 64GB
 RAM
           Integration with Kerberos for secure multi-
 tenancy
Architecture
Components




Hadoop, Hive, Pig, Hbase, HCatalog - others?
What are scientists doing?
Information Retrieval
Natural Language Processing
Machine Learning
Econometry
Bioinformatics
Computational Ecology / Ecoinformatics
Machine learning: Infrawatch, Hollandse Brug
Structural health monitoring




  145 x   100   x     60   x  60    x  24   x       365 = large data
sensors   Hz        seconds minutes   hours         days



                                    (Arno Knobbe, LIACS, 2011, http://infrawatch.liacs.nl)
And others: NLP & IR
e.g. ClueWeb: a ~13.4 TB webcrawl
e.g. Twitter gardenhose data
e.g. Wikipedia dumps
e.g. del.ico.us & flickr tags
Finding named entities: [person company place] names
Creating inverted indexes
Piloting real-time search
Personalization
Semantic web
Interest from industry




We're opening up shop.
Experiences: Data Science
DevOps   Programming algorithms   Domain knowledge
Experience: How we embrace Hadoop
Parallelism has never been easy… so we teach!
   December 2010: hackathon (~50 participants - full)
   April 2011: Workshop for Bioinformaticians
   November 2011: 2 day PhD course (~60 participants – full)
   June 2012: 1 day PhD course

The datascientist is still in school... so we fill the gap!
   Devops maintain the system, fix bugs, develop new
     functionality
   Technical consultants learn how to efficiently implement
     algorithms
http://www.nlhug.org/
Final thoughts
Hadoop is the first to provide commodity computing
   Hadoop is not the only
   Hadoop is probably not the best
   Hadoop has momentum
What degree of diversification of infrastructure should we
 embrace?
   MapReduce fits surprisingly well as a programming model for
    data-parallelism
Where is the data scientist?
   Teach. A lot. And work together.
Hadoop @ Sara & BiG Grid

Contenu connexe

Tendances

Hadoop: Distributed data processing
Hadoop: Distributed data processingHadoop: Distributed data processing
Hadoop: Distributed data processing
royans
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
Christopher Pezza
 

Tendances (20)

Presentation on Big Data Hadoop (Summer Training Demo)
Presentation on Big Data Hadoop (Summer Training Demo)Presentation on Big Data Hadoop (Summer Training Demo)
Presentation on Big Data Hadoop (Summer Training Demo)
 
Hadoop: Distributed data processing
Hadoop: Distributed data processingHadoop: Distributed data processing
Hadoop: Distributed data processing
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Hadoop technology doc
Hadoop technology docHadoop technology doc
Hadoop technology doc
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Hadoop Report
Hadoop ReportHadoop Report
Hadoop Report
 
Hadoop
HadoopHadoop
Hadoop
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
 
Final Year Project Guidance
Final Year Project GuidanceFinal Year Project Guidance
Final Year Project Guidance
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduce
 
Apache Hadoop at 10
Apache Hadoop at 10Apache Hadoop at 10
Apache Hadoop at 10
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Presentation on Hadoop Technology
Presentation on Hadoop TechnologyPresentation on Hadoop Technology
Presentation on Hadoop Technology
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPT
 

Similaire à Hadoop @ Sara & BiG Grid

Similaire à Hadoop @ Sara & BiG Grid (20)

BIG DATA
BIG DATABIG DATA
BIG DATA
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking AlgorithmPerformance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
 
Hadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and AssessmentHadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and Assessment
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoop
 
Hadoop Overview
Hadoop OverviewHadoop Overview
Hadoop Overview
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational Science
 
Hadoop
HadoopHadoop
Hadoop
 
Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
 
Survey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopSurvey Paper on Big Data and Hadoop
Survey Paper on Big Data and Hadoop
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 

Hadoop @ Sara & BiG Grid

  • 1. Large­scale data processing at SARA and BiG Grid with Apache Hadoop Evert Lammerts April 10, 2012, SZTAKI
  • 2. First off... About me Consultant for SARA's eScience & Cloud Services Technical lead for LifeWatch Netherlands Lead Hadoop infrastructure About you Who uses large-scale computing as a supporting tool? For who is large-scale computing core-business?
  • 3. In this talk Large-scale data processing? Large-scale @ SARA & BiG Grid An introduction to Hadoop & MapReduce Hadoop @ SARA & BiG Grid
  • 4. Large-scale data processing? Large-scale @ SARA & BiG Grid An introduction to Hadoop & MapReduce Hadoop @ SARA & BiG Grid
  • 5. Three observations I: Data is easier to collect
  • 6. (Jimmy Lin, University of Maryland / Twitter, 2011)
  • 7. More business is done on-line Mobile devices are more sophisticated Governments collect more data Sensing devices are becoming a commodity Technology advanced: DNA sequencers! Enormous funding for research infrastructures And so on... Lesson: everybody collects data Cisco Visual Networking Index: Global Mobile Data Traffic Forecast Update, 2011–2016
  • 8. Three observations II: Data is easier to store
  • 9. Storage price decreases http://www.mkomo.com/cost-per-gigabyte
  • 10. Storage capacity increases http://en.wikipedia.org/wiki/File:Hard_drive_capacity_over_time.svg
  • 12. (IEEE Intelligent Systems, 03/04-2009, vol 24, issue 2, p8-12)
  • 13. s/knowledge/data/g Jimmy Lin, University of Maryland / Twitter, 2011
  • 14. How are these observations addressed? We collect data, we store data, we have the knowledge to interpret data. What tools do we have that bring these together? Pioneers: HPC centers, universities, and in recent years, Internet companies. (Lots of knowledge exchange, by the way.)
  • 15. Some background (bear with me...) 1/3 Amdahl's Law
  • 16. Some background (bear with me...) 2/3 (The Datacenter as a Computer, 2009, Luiz André Barroso and Urs Hölzle)
  • 17. Some background (bear with me...) 3/3 Nodes (x2000): 8GB DRAM 4 x 1TB disks Rack: 40 nodes 1Gbps switch Datacenter: 8Gbps rack-to-cluster switch connection (The Datacenter as a Computer, 2009, Luiz André Barroso and Urs Hölzle)
  • 19.
  • 20. Large-scale data processing? Large-scale @ SARA & BiG Grid An introduction to Hadoop & MapReduce Hadoop @ SARA & BiG Grid
  • 21. SARA the national center for scientific computing Facilitating Science in The Netherlands with Equipment for and Expertise on Large-Scale Computing, Large-Scale Data Storage, High-Performance Networking, eScience, and Visualization
  • 24. Case Study: Virtual Knowledge Studio How do categories in WikiPedia evolve over time? (And how do they relate to internal links?) 2.7 TB raw text, single file Java application, searches for categories in Wiki markup, like [[Category:NAME]] Executed on the Grid http://simshelf2.virtualknowledgestudio.nl/activities/biggrid-wikipedia-experiment
  • 25. Case Study: Virtual Knowledge Studio Method Take an article, including history, as input Extract categories and links for each revision Output all links for each category, per revision Aggregate all links for each category, per revision Generate graph linking all categories on links, per revision
  • 26. Case Study: Virtual Knowledge Studio 1.1) Copy file from local 2.1) Stream file from Grid 3.1) Process all files in Machine to Grid storage Storage to single machine parallel: N machines 2.2) Cut into pieces of 10 GB run the Java application, 2.3) Stream back to Grid fetch a 10GB file as Storage input, processing it, and putting the result back
  • 27. Large-scale data processing? Large-scale @ SARA & BiG Grid An introduction to Hadoop & MapReduce Hadoop @ SARA & BiG Grid
  • 28. A bit of history 2002 2004 2006 Nutch* MR/GFS** Hadoop *  http://nutch.apache.org/ ** http://labs.google.com/papers/mapreduce.html    http://labs.google.com/papers/gfs.html
  • 29. 2010 - 2012: A Hype in Production http://wiki.apache.org/hadoop/PoweredBy
  • 30. What's different about Hadoop? No more do-it-yourself parallelism – it's hard! But rather linearly scalable data parallelism Separating the what from the how 2009, Luiz André Barroso and Urs Hölzle)
  • 31. Core principals Scale out, not up Move processing to the data Process data sequentially, avoid random reads Seamless scalability (Jimmy Lin, University of Maryland / Twitter, 2011)
  • 32. A typical data-parallel problem in abstraction Iterate over a large number of records Extract something of interest Create an ordering in intermediate results Aggregate intermediate results Generate output MapReduce: functional abstraction of step 2 & step 4 (Jimmy Lin, University of Maryland / Twitter, 2011)
  • 33. MapReduce Programmer specifies two functions map(k, v) → <k', v'>* reduce(k', v') → <k', v'>* All values associated with a single key are sent to the same reducer The framework handles the rest
  • 34. The rest? Scheduling, data distribution, ordering, synchronization, error handling...
  • 35. Case Study: Virtual Knowledge Studio This is how it would be done with Hadoop 1) Load file into 2) Submit code to HDFS MR Automatic distribution of data, Parallelism based on data, Automatic ordering of intermediate results
  • 36. The ecosystem The Forrester WaveTM: Enterprise Hadoop Solutions, Q1 2012
  • 37. Large-scale data processing? Large-scale @ SARA & BiG Grid An introduction to Hadoop & MapReduce Hadoop @ SARA & BiG Grid
  • 38. Timeline 2009: Piloting Hadoop on Cloud 2010: Test cluster available for scientists 6 machines * 4 cores / 24 TB storage / 16GB RAM Just me! 2011: Funding granted for production service 2012: Production cluster available (~March) 72 machines * 8 cores / 8 TB storage / 64GB RAM Integration with Kerberos for secure multi- tenancy
  • 40.
  • 41. Components Hadoop, Hive, Pig, Hbase, HCatalog - others?
  • 42. What are scientists doing? Information Retrieval Natural Language Processing Machine Learning Econometry Bioinformatics Computational Ecology / Ecoinformatics
  • 44. Structural health monitoring 145 x 100 x 60 x 60 x 24 x 365 = large data sensors Hz seconds minutes hours days (Arno Knobbe, LIACS, 2011, http://infrawatch.liacs.nl)
  • 45. And others: NLP & IR e.g. ClueWeb: a ~13.4 TB webcrawl e.g. Twitter gardenhose data e.g. Wikipedia dumps e.g. del.ico.us & flickr tags Finding named entities: [person company place] names Creating inverted indexes Piloting real-time search Personalization Semantic web
  • 46. Interest from industry We're opening up shop.
  • 47. Experiences: Data Science DevOps Programming algorithms Domain knowledge
  • 48. Experience: How we embrace Hadoop Parallelism has never been easy… so we teach! December 2010: hackathon (~50 participants - full) April 2011: Workshop for Bioinformaticians November 2011: 2 day PhD course (~60 participants – full) June 2012: 1 day PhD course The datascientist is still in school... so we fill the gap! Devops maintain the system, fix bugs, develop new functionality Technical consultants learn how to efficiently implement algorithms
  • 50. Final thoughts Hadoop is the first to provide commodity computing Hadoop is not the only Hadoop is probably not the best Hadoop has momentum What degree of diversification of infrastructure should we embrace? MapReduce fits surprisingly well as a programming model for data-parallelism Where is the data scientist? Teach. A lot. And work together.