SlideShare une entreprise Scribd logo
1  sur  14
Télécharger pour lire hors ligne
Hadoop: Playing with data, at scale


If you have lot of data to process….
           What should you know?

Mahesh Tiyyagura

25th November, Bangalore
Mahesh Tiyyagura

Email: tmahesh@gmail.com
http://www.twitter.com/tmahesh

Work on large scale crawling and extraction of structured data from the web.
Used Hadoop at Yahoo! to run machine learning algorithms and analyzing click logs
Hadoop
•    Massively scalable storage and batch data processing system

•    Its all about Scale……
       –  Scaling hardware infrastructure (horizontal scaling)
       –  Scaling operations and maintenance (handling failures)
       –  Scaling developer productivity (keep it simple)
Numbers you should know…
•    You can store say, 10TB of data per node
•    1 Disk: 75MB/sec (sequential read)
•    Say, you want to process, 200GB of data
•    That’s is ~ 1 hour to just read the data!!
•    Processing data (CPU) is much much faster (say, 10x)
•    To remove the bottleneck, we need to read data in parallel
•    Read from 100 Disks in parallel: 7.5GB/sec!!
•    Insight: Move computation, NOT data

•    Oh! BTW, Data should not (and cannot) reside on only one node
•    In a 1000 node cluster, you can expect ~10 failures per week
•    For peace of mind, Reliability should be handled by software


Hadoop is designed to address these issues.
The Platform, in brief…
•    HDFS: Storing Data
      –  Data spilt into multiple blocks across nodes
      –  Replication protects data from failures
      –  A master node orchestrates the read/write requests (without being a bottleneck!!)
      –  Scales linearly… 4TB of raw disk translates to ~ 1TB of storage (tunable)

•    MapReduce (MR): Processing Data
      –  A beautiful abstraction; asks user to implement just 2 functions (Map and Reduce)
      –  You don’t need no knowledge of network IO, node failures, checkpoints, distributed
         what??
      –  Most of the data processing jobs can be mapped into MapReduce Abstraction
      –  Data processed locally, in parallel. Reliability is implicit.
      –  A giant merge sort infrastructure does the magic

      Will revisit this slide. Something's are better understood in retrospect.
HDFS
MR: Programming Model
•    Map function: (key, value) -> (key1, value1) list
•    Reduce function: (key1, value1 list) -> key1, output

•    Examples:
      –  map(k, v) -> emit (k, v.toUpper())
      –  map(k, v) -> foreach c in v; do emit(k, c) done;

      –  reduce(k, vals) -> foreach v in vals; do sum+= v; done emit(k, sum)
MAPREDUCE
Thinking in MapReduce ….
•    Word Count Example
      –  map(docid, text) -> foreach word in text.split(); do emit(word, 1); done
      –  reduce(word, counts list) -> foreach count in counts; do sum+=count; done; emit(word,
         sum)




•    Document search index (Inverted index)
      –  map(docid, html) -> foreach term in getTerms(html); do emit(term, docid); done
      –  reduce(term, docid list) -> emit(term, docid list);.
Thinking in MapReduce ….

•    All the anchor text to a page
      –  map(docid, html) -> foreach link in getLinks(html); do emit(link, anchorText); done
      –  Reduce(link, anchorText list) -> emit(link, anchorText list);




•    Image resize
      –  Map(imgid, image) -> emit(imgid, image.resize());
      –  No need for reduce
Hadoop Streaming Demo
•    cat someInputFile | shellMapper.sh | shellReducer.sh > someOutputFile

•    Each line in InputFile written to stdin of shellMapper.sh
•    A line on stdout of shellMapper.sh; split into key, value (before first tab is key, rest is value)
•    Each key, value pair is fed as line into stdin of shellReducer.sh
•    A line on stdout of shellReducer written to outputFile

•    hadoop jar hadoop-streaming.jar 
       -input myInputDirs 
      -output myOutputDir 
      -mapper /bin/cat 
      -reducer /bin/wc

•    More Info: http://hadoop.apache.org/common/docs/r0.18.2/streaming.html

•    Will share the terminal session now… for DEMO
Brief intro to PIG
•    Adhoc data analysis
•    An abstraction over mapreduce
•    Think of it as a stdlib for mapreduce
•    Supports common data processing operators (join, group by)
•    A high level language for data processing

 PIG demo – switch to terminal

•    Also try… HIVE, exposes a SQL like interface over HDFS data

HIVE demo – switch to terminal
HadoopThe Hadoop Java Software Framework

Contenu connexe

Tendances

Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.Zekeriya Besiroglu
 
Hadoop 101 - Big Data Technology
Hadoop 101 - Big Data TechnologyHadoop 101 - Big Data Technology
Hadoop 101 - Big Data TechnologyFirman Gautama
 
Geek camp
Geek campGeek camp
Geek campjdhok
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Adam Kawa
 
Map reduce and hadoop at mylife
Map reduce and hadoop at mylifeMap reduce and hadoop at mylife
Map reduce and hadoop at myliferesponseteam
 
Scalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with HadoopScalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with HadoopDenis Shestakov
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceDenis Shestakov
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To HadoopAdeel Ahmad
 
Hadoop hbase introduction
Hadoop hbase introductionHadoop hbase introduction
Hadoop hbase introductionJakub Stransky
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to SchoolAdam Doyle
 

Tendances (20)

Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
 
Hadoop 101 - Big Data Technology
Hadoop 101 - Big Data TechnologyHadoop 101 - Big Data Technology
Hadoop 101 - Big Data Technology
 
Hadoop - How It Works
Hadoop - How It WorksHadoop - How It Works
Hadoop - How It Works
 
Geek camp
Geek campGeek camp
Geek camp
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
Hadoop at datasift
Hadoop at datasiftHadoop at datasift
Hadoop at datasift
 
Map reduce and hadoop at mylife
Map reduce and hadoop at mylifeMap reduce and hadoop at mylife
Map reduce and hadoop at mylife
 
Scalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with HadoopScalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with Hadoop
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice
 
Hadoop at datasift
Hadoop at datasiftHadoop at datasift
Hadoop at datasift
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
Nextag talk
Nextag talkNextag talk
Nextag talk
 
Hadoop 101 v2
Hadoop 101 v2Hadoop 101 v2
Hadoop 101 v2
 
JOSA TechTalks - Big Data on Hadoop
JOSA TechTalks - Big Data on HadoopJOSA TechTalks - Big Data on Hadoop
JOSA TechTalks - Big Data on Hadoop
 
Hadoop hbase introduction
Hadoop hbase introductionHadoop hbase introduction
Hadoop hbase introduction
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School
 
מיכאל
מיכאלמיכאל
מיכאל
 
Cache and Drupal
Cache and DrupalCache and Drupal
Cache and Drupal
 

Similaire à HadoopThe Hadoop Java Software Framework

Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerIke Ellis
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and DeploymentCisco Canada
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keownCisco Canada
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptMaruthiPrasad96
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 
Whirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIveWhirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIveEdward Capriolo
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentationpunesparkmeetup
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalabilityWANdisco Plc
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
Intro to big data choco devday - 23-01-2014
Intro to big data   choco devday - 23-01-2014Intro to big data   choco devday - 23-01-2014
Intro to big data choco devday - 23-01-2014Hassan Islamov
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 

Similaire à HadoopThe Hadoop Java Software Framework (20)

Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute Beginner
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Whirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIveWhirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIve
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentation
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Intro to big data choco devday - 23-01-2014
Intro to big data   choco devday - 23-01-2014Intro to big data   choco devday - 23-01-2014
Intro to big data choco devday - 23-01-2014
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 

Plus de ThoughtWorks

Online and Publishing casestudies
Online and Publishing casestudiesOnline and Publishing casestudies
Online and Publishing casestudiesThoughtWorks
 
Insurecom Case Study
Insurecom Case StudyInsurecom Case Study
Insurecom Case StudyThoughtWorks
 
Grameen Case Study
Grameen Case StudyGrameen Case Study
Grameen Case StudyThoughtWorks
 
Construction Techniques For Domain Specific Languages
Construction Techniques For Domain Specific LanguagesConstruction Techniques For Domain Specific Languages
Construction Techniques For Domain Specific LanguagesThoughtWorks
 
Concurrency patterns in Ruby
Concurrency patterns in RubyConcurrency patterns in Ruby
Concurrency patterns in RubyThoughtWorks
 
Concurrency patterns in Ruby
Concurrency patterns in RubyConcurrency patterns in Ruby
Concurrency patterns in RubyThoughtWorks
 
Lets build-ruby-app-server: Vineet tyagi
Lets build-ruby-app-server: Vineet tyagiLets build-ruby-app-server: Vineet tyagi
Lets build-ruby-app-server: Vineet tyagiThoughtWorks
 
Ruby on Rails versus Django - A newbie Web Developer's Perspective -Shreyank...
 Ruby on Rails versus Django - A newbie Web Developer's Perspective -Shreyank... Ruby on Rails versus Django - A newbie Web Developer's Perspective -Shreyank...
Ruby on Rails versus Django - A newbie Web Developer's Perspective -Shreyank...ThoughtWorks
 
Nick Sieger-Exploring Rails 3 Through Choices
Nick Sieger-Exploring Rails 3 Through Choices Nick Sieger-Exploring Rails 3 Through Choices
Nick Sieger-Exploring Rails 3 Through Choices ThoughtWorks
 
Present and Future of Programming Languages - ola bini
Present and Future of Programming Languages - ola biniPresent and Future of Programming Languages - ola bini
Present and Future of Programming Languages - ola biniThoughtWorks
 
The ruby on rails i18n core api-Neeraj Kumar
The ruby on rails i18n core api-Neeraj KumarThe ruby on rails i18n core api-Neeraj Kumar
The ruby on rails i18n core api-Neeraj KumarThoughtWorks
 
Ruby 124C41+ - Matz
Ruby 124C41+  - MatzRuby 124C41+  - Matz
Ruby 124C41+ - MatzThoughtWorks
 
Mac ruby to the max - Brendan G. Lim
Mac ruby to the max - Brendan G. LimMac ruby to the max - Brendan G. Lim
Mac ruby to the max - Brendan G. LimThoughtWorks
 
Project Fedena and Why Ruby on Rails - ArvindArvind G S
Project Fedena and Why Ruby on Rails - ArvindArvind G SProject Fedena and Why Ruby on Rails - ArvindArvind G S
Project Fedena and Why Ruby on Rails - ArvindArvind G SThoughtWorks
 
Glass fish rubyconf-india-2010-Arun gupta
Glass fish rubyconf-india-2010-Arun gupta Glass fish rubyconf-india-2010-Arun gupta
Glass fish rubyconf-india-2010-Arun gupta ThoughtWorks
 
Aman kingrubyoo pnew
Aman kingrubyoo pnew Aman kingrubyoo pnew
Aman kingrubyoo pnew ThoughtWorks
 
Bootstrapping iPhone Development
Bootstrapping iPhone DevelopmentBootstrapping iPhone Development
Bootstrapping iPhone DevelopmentThoughtWorks
 
DSL Construction rith Ruby
DSL Construction rith RubyDSL Construction rith Ruby
DSL Construction rith RubyThoughtWorks
 

Plus de ThoughtWorks (20)

Online and Publishing casestudies
Online and Publishing casestudiesOnline and Publishing casestudies
Online and Publishing casestudies
 
Insurecom Case Study
Insurecom Case StudyInsurecom Case Study
Insurecom Case Study
 
Grameen Case Study
Grameen Case StudyGrameen Case Study
Grameen Case Study
 
BFSI Case Sudies
BFSI Case SudiesBFSI Case Sudies
BFSI Case Sudies
 
Construction Techniques For Domain Specific Languages
Construction Techniques For Domain Specific LanguagesConstruction Techniques For Domain Specific Languages
Construction Techniques For Domain Specific Languages
 
Concurrency patterns in Ruby
Concurrency patterns in RubyConcurrency patterns in Ruby
Concurrency patterns in Ruby
 
Concurrency patterns in Ruby
Concurrency patterns in RubyConcurrency patterns in Ruby
Concurrency patterns in Ruby
 
Lets build-ruby-app-server: Vineet tyagi
Lets build-ruby-app-server: Vineet tyagiLets build-ruby-app-server: Vineet tyagi
Lets build-ruby-app-server: Vineet tyagi
 
Ruby on Rails versus Django - A newbie Web Developer's Perspective -Shreyank...
 Ruby on Rails versus Django - A newbie Web Developer's Perspective -Shreyank... Ruby on Rails versus Django - A newbie Web Developer's Perspective -Shreyank...
Ruby on Rails versus Django - A newbie Web Developer's Perspective -Shreyank...
 
Nick Sieger-Exploring Rails 3 Through Choices
Nick Sieger-Exploring Rails 3 Through Choices Nick Sieger-Exploring Rails 3 Through Choices
Nick Sieger-Exploring Rails 3 Through Choices
 
Present and Future of Programming Languages - ola bini
Present and Future of Programming Languages - ola biniPresent and Future of Programming Languages - ola bini
Present and Future of Programming Languages - ola bini
 
The ruby on rails i18n core api-Neeraj Kumar
The ruby on rails i18n core api-Neeraj KumarThe ruby on rails i18n core api-Neeraj Kumar
The ruby on rails i18n core api-Neeraj Kumar
 
Ruby 124C41+ - Matz
Ruby 124C41+  - MatzRuby 124C41+  - Matz
Ruby 124C41+ - Matz
 
Mac ruby to the max - Brendan G. Lim
Mac ruby to the max - Brendan G. LimMac ruby to the max - Brendan G. Lim
Mac ruby to the max - Brendan G. Lim
 
Project Fedena and Why Ruby on Rails - ArvindArvind G S
Project Fedena and Why Ruby on Rails - ArvindArvind G SProject Fedena and Why Ruby on Rails - ArvindArvind G S
Project Fedena and Why Ruby on Rails - ArvindArvind G S
 
Glass fish rubyconf-india-2010-Arun gupta
Glass fish rubyconf-india-2010-Arun gupta Glass fish rubyconf-india-2010-Arun gupta
Glass fish rubyconf-india-2010-Arun gupta
 
Aman kingrubyoo pnew
Aman kingrubyoo pnew Aman kingrubyoo pnew
Aman kingrubyoo pnew
 
Bootstrapping iPhone Development
Bootstrapping iPhone DevelopmentBootstrapping iPhone Development
Bootstrapping iPhone Development
 
DSL Construction rith Ruby
DSL Construction rith RubyDSL Construction rith Ruby
DSL Construction rith Ruby
 
Cloud Computing
Cloud  ComputingCloud  Computing
Cloud Computing
 

Dernier

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 

Dernier (20)

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

HadoopThe Hadoop Java Software Framework

  • 1.
  • 2. Hadoop: Playing with data, at scale If you have lot of data to process…. What should you know? Mahesh Tiyyagura 25th November, Bangalore
  • 3. Mahesh Tiyyagura Email: tmahesh@gmail.com http://www.twitter.com/tmahesh Work on large scale crawling and extraction of structured data from the web. Used Hadoop at Yahoo! to run machine learning algorithms and analyzing click logs
  • 4. Hadoop •  Massively scalable storage and batch data processing system •  Its all about Scale…… –  Scaling hardware infrastructure (horizontal scaling) –  Scaling operations and maintenance (handling failures) –  Scaling developer productivity (keep it simple)
  • 5. Numbers you should know… •  You can store say, 10TB of data per node •  1 Disk: 75MB/sec (sequential read) •  Say, you want to process, 200GB of data •  That’s is ~ 1 hour to just read the data!! •  Processing data (CPU) is much much faster (say, 10x) •  To remove the bottleneck, we need to read data in parallel •  Read from 100 Disks in parallel: 7.5GB/sec!! •  Insight: Move computation, NOT data •  Oh! BTW, Data should not (and cannot) reside on only one node •  In a 1000 node cluster, you can expect ~10 failures per week •  For peace of mind, Reliability should be handled by software Hadoop is designed to address these issues.
  • 6. The Platform, in brief… •  HDFS: Storing Data –  Data spilt into multiple blocks across nodes –  Replication protects data from failures –  A master node orchestrates the read/write requests (without being a bottleneck!!) –  Scales linearly… 4TB of raw disk translates to ~ 1TB of storage (tunable) •  MapReduce (MR): Processing Data –  A beautiful abstraction; asks user to implement just 2 functions (Map and Reduce) –  You don’t need no knowledge of network IO, node failures, checkpoints, distributed what?? –  Most of the data processing jobs can be mapped into MapReduce Abstraction –  Data processed locally, in parallel. Reliability is implicit. –  A giant merge sort infrastructure does the magic Will revisit this slide. Something's are better understood in retrospect.
  • 8. MR: Programming Model •  Map function: (key, value) -> (key1, value1) list •  Reduce function: (key1, value1 list) -> key1, output •  Examples: –  map(k, v) -> emit (k, v.toUpper()) –  map(k, v) -> foreach c in v; do emit(k, c) done; –  reduce(k, vals) -> foreach v in vals; do sum+= v; done emit(k, sum)
  • 10. Thinking in MapReduce …. •  Word Count Example –  map(docid, text) -> foreach word in text.split(); do emit(word, 1); done –  reduce(word, counts list) -> foreach count in counts; do sum+=count; done; emit(word, sum) •  Document search index (Inverted index) –  map(docid, html) -> foreach term in getTerms(html); do emit(term, docid); done –  reduce(term, docid list) -> emit(term, docid list);.
  • 11. Thinking in MapReduce …. •  All the anchor text to a page –  map(docid, html) -> foreach link in getLinks(html); do emit(link, anchorText); done –  Reduce(link, anchorText list) -> emit(link, anchorText list); •  Image resize –  Map(imgid, image) -> emit(imgid, image.resize()); –  No need for reduce
  • 12. Hadoop Streaming Demo •  cat someInputFile | shellMapper.sh | shellReducer.sh > someOutputFile •  Each line in InputFile written to stdin of shellMapper.sh •  A line on stdout of shellMapper.sh; split into key, value (before first tab is key, rest is value) •  Each key, value pair is fed as line into stdin of shellReducer.sh •  A line on stdout of shellReducer written to outputFile •  hadoop jar hadoop-streaming.jar -input myInputDirs -output myOutputDir -mapper /bin/cat -reducer /bin/wc •  More Info: http://hadoop.apache.org/common/docs/r0.18.2/streaming.html •  Will share the terminal session now… for DEMO
  • 13. Brief intro to PIG •  Adhoc data analysis •  An abstraction over mapreduce •  Think of it as a stdlib for mapreduce •  Supports common data processing operators (join, group by) •  A high level language for data processing PIG demo – switch to terminal •  Also try… HIVE, exposes a SQL like interface over HDFS data HIVE demo – switch to terminal