SlideShare une entreprise Scribd logo
1  sur  21
Hadoop – Large scale data analysis Abhijit Sharma Page 1    |    9/8/2011
Unprecedented growth in  Data set size - Facebook 21+ PB data warehouse, 12+ TB/day Un(semi)-structured data – logs, documents, graphs Connected data web, tags, graphs Relevant to enterprises – logs, social media, machine generated data, breaking of silos Page 2    |    9/8/2011 Big Data Trends
Page 3    |    9/8/2011 Putting Big Data to work Data driven Org – decision support, new offerings Analytics on large data sets (FB Insights – Page, App etc stats),  Data Mining – Clustering - Google News articles Search - Google
Embarrassingly data parallel problems Data chunked & distributed across cluster Parallel processing with data locality – task dispatched where data is Horizontal/Linear scaling approach using commodity hardware Write Once, Read Many Examples  Distributed logs – grep, # of accesses per URL Search - Term Vector generation, Reverse Links Page 4    |    9/8/2011 Problem characteristics and examples
Open source system for large scale batch distributed computing on big data Map Reduce Programming Paradigm & Framework  Map Reduce Infrastructure Distributed File System (HDFS) Endorsed/used extensively by web giants – Google, FB, Yahoo! Page 5    |    9/8/2011 What is Hadoop?
MapReduce is a programming model and an implementation for parallel processing of large data sets Map processes each logical record per input split to generate a set of intermediate key/value pairs Reduce merges all intermediate values associated with the same intermediate key Page 6    |    9/8/2011 Map Reduce - Definition
Map : Apply a function to each list member - Parallelizable [1, 2, 3].collect { it * it }  Output : [1, 2, 3] -> Map (Square) : [1, 4, 9] Reduce : Apply a function and an accumulator to each list member [1, 2, 3].inject(0) { sum, item -> sum + item }  Output : [1, 2, 3] -> Reduce (Sum) : 6 Map & Reduce  [1, 2, 3].collect { it * it }.inject(0) { sum, item -> sum + item }  Output : [1, 2, 3] -> Map (Square) -> [1, 4, 9] -> Reduce (Sum) : 14 Page 7    |    9/8/2011 Map Reduce - Functional Programming Origins
Page 8    |    9/8/2011 Word Count - Shell cat * | grep  | sort                | uniq –c input| map  | shuffle & sort  | reduce
Page 9    |    9/8/2011 Word Count - Map Reduce
mapper (filename, file-contents): for each word in file-contents:     emit (word, 1) // single count for a word e.g. (“the”, 1) for each occurrence of “the” reducer (word, Iterator values): // Iterator for list of counts for a word e.g. (“the”, [1,1,..]) sum = 0   for each value in intermediate_values:     sum = sum + value   emit (word, sum) Page 10    |    9/8/2011 Word Count  - Pseudo code
Word Count / Distributed logs search for # accesses to various URLs Map – emits word/URL, 1 for each doc/log split Reduce – sums up the counts for a specific word/URL Term Vector generation – term -> [doc-id] Map – emits term, doc-id for each doc split Reduce – Identity Reducer – accumulates the (term, [doc-id, doc-id ..]) Reverse Links – source -> target to target-> source Map – emits (target, source) for each doc split Reducer – Identity Reducer – accumulates the (target, [source, source ..])  Page 11    |    9/8/2011 Examples – Map Reduce Defn
Hides complexity of distributed computing Automatic parallelization of job Automatic data chunking & distribution (via HDFS) Data locality – MR task dispatched where data is Fault tolerant to server, storage, N/W failures Network and disk transfer optimization Load balancing Page 12    |    9/8/2011 Map Reduce – Hadoop Implementation
Page 13    |    9/8/2011 Hadoop Map Reduce Architecture
Very large files – block size 64 MB/128 MB Data access pattern - Write once read many Writes are large, create & append only Reads are large & streaming Commodity hardware Tolerant to failure – server, storage, network Highly available through transparent replication ,[object Object],Page 14    |    9/8/2011 HDFS Characteristics
Page 15    |    9/8/2011 HDFS Architecture
Thanks Page 16    |    9/8/2011
Page 17    |    9/8/2011 Backup Slides
Page 18    |    9/8/2011 Map & Reduce Functions
Page 19    |    9/8/2011 Job Configuration
Job Tracker tracks MR jobs – runs on master node Task Tracker Runs on data nodes and tracks Mapper, Reducer tasks assigned to the node Heartbeats to Job Tracker Maintains and picks up tasks from a queue Page 20    |    9/8/2011 Hadoop Map Reduce Components
Name Node  Manages the file system namespace and regulates access to files by clients – stores meta data Mapping of blocks to Data Nodes and replicas Manage replication Executes file system namespace operations like opening, closing, and renaming files and directories. Data Node One per node, which manages local storage attached to the node  Internally, a file is split into one or more blocks and these blocks are stored in a set of Data Nodes Responsible for serving read and write requests from the file system’s clients. The Data Nodes also perform block creation, deletion, and replication upon instruction from the Name Node. Page 21    |    9/8/2011 HDFS

Contenu connexe

Tendances

Hybrid Job-Driven Meta Data Scheduling for BigData with MapReduce Clusters an...
Hybrid Job-Driven Meta Data Scheduling for BigData with MapReduce Clusters an...Hybrid Job-Driven Meta Data Scheduling for BigData with MapReduce Clusters an...
Hybrid Job-Driven Meta Data Scheduling for BigData with MapReduce Clusters an...dbpublications
 
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
Hadoop ecosystem  J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...Hadoop ecosystem  J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...AyeeshaParveen
 
Hadoop Design Patterns
Hadoop Design PatternsHadoop Design Patterns
Hadoop Design PatternsEMC
 
Taking Advantage of a Spatial Database with MapInfo Professional
Taking Advantage of a Spatial Database with MapInfo ProfessionalTaking Advantage of a Spatial Database with MapInfo Professional
Taking Advantage of a Spatial Database with MapInfo ProfessionalPeter Horsbøll Møller
 
Hadoop, mapreduce and yarn networks
Hadoop, mapreduce and yarn networksHadoop, mapreduce and yarn networks
Hadoop, mapreduce and yarn networksHariniA7
 
WaterlooHiveTalk
WaterlooHiveTalkWaterlooHiveTalk
WaterlooHiveTalknzhang
 
Cred_hadoop_presenatation
Cred_hadoop_presenatationCred_hadoop_presenatation
Cred_hadoop_presenatationAshish Saraf
 
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...jencyjayastina
 
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...AyeeshaParveen
 
Exploring Spatial data in GIS Environment
Exploring Spatial data in GIS Environment Exploring Spatial data in GIS Environment
Exploring Spatial data in GIS Environment NAXA-Developers
 
Hadoop development series(1)
Hadoop development series(1)Hadoop development series(1)
Hadoop development series(1)Amar kumar
 

Tendances (20)

Adding data into GIS
Adding  data into GISAdding  data into GIS
Adding data into GIS
 
Hybrid Job-Driven Meta Data Scheduling for BigData with MapReduce Clusters an...
Hybrid Job-Driven Meta Data Scheduling for BigData with MapReduce Clusters an...Hybrid Job-Driven Meta Data Scheduling for BigData with MapReduce Clusters an...
Hybrid Job-Driven Meta Data Scheduling for BigData with MapReduce Clusters an...
 
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
Hadoop ecosystem  J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...Hadoop ecosystem  J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
 
Hadoop Design Patterns
Hadoop Design PatternsHadoop Design Patterns
Hadoop Design Patterns
 
Taking Advantage of a Spatial Database with MapInfo Professional
Taking Advantage of a Spatial Database with MapInfo ProfessionalTaking Advantage of a Spatial Database with MapInfo Professional
Taking Advantage of a Spatial Database with MapInfo Professional
 
Hadoop by sunitha
Hadoop by sunithaHadoop by sunitha
Hadoop by sunitha
 
Dbms quiz
Dbms quiz Dbms quiz
Dbms quiz
 
Hadoop, mapreduce and yarn networks
Hadoop, mapreduce and yarn networksHadoop, mapreduce and yarn networks
Hadoop, mapreduce and yarn networks
 
Hadoop
HadoopHadoop
Hadoop
 
WaterlooHiveTalk
WaterlooHiveTalkWaterlooHiveTalk
WaterlooHiveTalk
 
Geodatabases
GeodatabasesGeodatabases
Geodatabases
 
Cred_hadoop_presenatation
Cred_hadoop_presenatationCred_hadoop_presenatation
Cred_hadoop_presenatation
 
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
 
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
 
Zenith it-hadoop-training
Zenith it-hadoop-trainingZenith it-hadoop-training
Zenith it-hadoop-training
 
Exploring Spatial data in GIS Environment
Exploring Spatial data in GIS Environment Exploring Spatial data in GIS Environment
Exploring Spatial data in GIS Environment
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop development series(1)
Hadoop development series(1)Hadoop development series(1)
Hadoop development series(1)
 
Introduction to MapBasic
Introduction to MapBasicIntroduction to MapBasic
Introduction to MapBasic
 
9-Figures in LaTex
9-Figures in LaTex9-Figures in LaTex
9-Figures in LaTex
 

En vedette

Industrial Sector of Pakistan
Industrial Sector of PakistanIndustrial Sector of Pakistan
Industrial Sector of Pakistanshobia
 
Responders and Assessments Presentation
Responders  and  Assessments PresentationResponders  and  Assessments Presentation
Responders and Assessments Presentationfrewsmhuffman
 
Connect Globally For An Innovation Economy, Nastas Article In Moscow Times
Connect Globally For An Innovation Economy, Nastas Article In Moscow TimesConnect Globally For An Innovation Economy, Nastas Article In Moscow Times
Connect Globally For An Innovation Economy, Nastas Article In Moscow TimesThomas Nastas
 
Big Data and the growing relevance of NoSQL
Big Data and the growing relevance of NoSQLBig Data and the growing relevance of NoSQL
Big Data and the growing relevance of NoSQLAbhijit Sharma
 
Better Search With Structured Knowledge
Better Search With Structured KnowledgeBetter Search With Structured Knowledge
Better Search With Structured KnowledgeMichel Dumontier
 
Adapting health systems to the challenge of diversity in the US and Europe
Adapting health systems to the challenge of diversity in the US and EuropeAdapting health systems to the challenge of diversity in the US and Europe
Adapting health systems to the challenge of diversity in the US and EuropediversityRx
 
Android Bootcamp Santa Fe GTUG
Android Bootcamp Santa Fe GTUGAndroid Bootcamp Santa Fe GTUG
Android Bootcamp Santa Fe GTUGmatiasmolinas
 
U of L and The Social Web
U of L and The Social WebU of L and The Social Web
U of L and The Social Webjackbr4
 
Kenenisa
KenenisaKenenisa
Kenenisargana
 
We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...
We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...
We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...Michel Dumontier
 
Tema 5 1º bach tangencias y enlaces v4
Tema 5 1º bach tangencias y enlaces v4Tema 5 1º bach tangencias y enlaces v4
Tema 5 1º bach tangencias y enlaces v4qvrrafa
 
Powering Scientific Discovery with the Semantic Web (VanBUG 2014)
Powering Scientific Discovery with the Semantic Web (VanBUG 2014)Powering Scientific Discovery with the Semantic Web (VanBUG 2014)
Powering Scientific Discovery with the Semantic Web (VanBUG 2014)Michel Dumontier
 
The Economics of Grid-Connected Hybrid Distributed Generation
The Economics of Grid-Connected Hybrid Distributed GenerationThe Economics of Grid-Connected Hybrid Distributed Generation
The Economics of Grid-Connected Hybrid Distributed GenerationIain Sanders
 

En vedette (20)

Biosimilars in China
Biosimilars in ChinaBiosimilars in China
Biosimilars in China
 
Industrial Sector of Pakistan
Industrial Sector of PakistanIndustrial Sector of Pakistan
Industrial Sector of Pakistan
 
Responders and Assessments Presentation
Responders  and  Assessments PresentationResponders  and  Assessments Presentation
Responders and Assessments Presentation
 
Connect Globally For An Innovation Economy, Nastas Article In Moscow Times
Connect Globally For An Innovation Economy, Nastas Article In Moscow TimesConnect Globally For An Innovation Economy, Nastas Article In Moscow Times
Connect Globally For An Innovation Economy, Nastas Article In Moscow Times
 
Big Data and the growing relevance of NoSQL
Big Data and the growing relevance of NoSQLBig Data and the growing relevance of NoSQL
Big Data and the growing relevance of NoSQL
 
Tennessee Ballot
Tennessee BallotTennessee Ballot
Tennessee Ballot
 
Better Search With Structured Knowledge
Better Search With Structured KnowledgeBetter Search With Structured Knowledge
Better Search With Structured Knowledge
 
Rims Metals and Mining Session
Rims Metals and Mining Session Rims Metals and Mining Session
Rims Metals and Mining Session
 
Adapting health systems to the challenge of diversity in the US and Europe
Adapting health systems to the challenge of diversity in the US and EuropeAdapting health systems to the challenge of diversity in the US and Europe
Adapting health systems to the challenge of diversity in the US and Europe
 
Lourenza
LourenzaLourenza
Lourenza
 
Android Bootcamp Santa Fe GTUG
Android Bootcamp Santa Fe GTUGAndroid Bootcamp Santa Fe GTUG
Android Bootcamp Santa Fe GTUG
 
Squizz presentation
Squizz presentationSquizz presentation
Squizz presentation
 
Howgirlsunderstand
HowgirlsunderstandHowgirlsunderstand
Howgirlsunderstand
 
U of L and The Social Web
U of L and The Social WebU of L and The Social Web
U of L and The Social Web
 
Kenenisa
KenenisaKenenisa
Kenenisa
 
We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...
We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...
We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...
 
Tema 5 1º bach tangencias y enlaces v4
Tema 5 1º bach tangencias y enlaces v4Tema 5 1º bach tangencias y enlaces v4
Tema 5 1º bach tangencias y enlaces v4
 
HR head dilemma ideate assignment
HR head dilemma ideate assignmentHR head dilemma ideate assignment
HR head dilemma ideate assignment
 
Powering Scientific Discovery with the Semantic Web (VanBUG 2014)
Powering Scientific Discovery with the Semantic Web (VanBUG 2014)Powering Scientific Discovery with the Semantic Web (VanBUG 2014)
Powering Scientific Discovery with the Semantic Web (VanBUG 2014)
 
The Economics of Grid-Connected Hybrid Distributed Generation
The Economics of Grid-Connected Hybrid Distributed GenerationThe Economics of Grid-Connected Hybrid Distributed Generation
The Economics of Grid-Connected Hybrid Distributed Generation
 

Similaire à An introduction to Hadoop for large scale data analysis

Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010ragho
 
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analyticsAvinash Pandu
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiJoydeep Sen Sarma
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesappaji intelhunt
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster InnardsMartin Dvorak
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsRobert Grossman
 
Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson Hakka Labs
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixJeff Magnusson
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
assignment3
assignment3assignment3
assignment3Kirti J
 
Hadoop & Zing
Hadoop & ZingHadoop & Zing
Hadoop & ZingLong Dao
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analyticsAvinash Pandu
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & HadoopAhmed Gamil
 

Similaire à An introduction to Hadoop for large scale data analysis (20)

Big data
Big dataBig data
Big data
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
 
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analytics
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-Delhi
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologies
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster Innards
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
 
Lipstick On Pig
Lipstick On Pig Lipstick On Pig
Lipstick On Pig
 
Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at Netflix
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
assignment3
assignment3assignment3
assignment3
 
Hadoop & Zing
Hadoop & ZingHadoop & Zing
Hadoop & Zing
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & Hadoop
 

Dernier

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 

Dernier (20)

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 

An introduction to Hadoop for large scale data analysis

  • 1. Hadoop – Large scale data analysis Abhijit Sharma Page 1 | 9/8/2011
  • 2. Unprecedented growth in Data set size - Facebook 21+ PB data warehouse, 12+ TB/day Un(semi)-structured data – logs, documents, graphs Connected data web, tags, graphs Relevant to enterprises – logs, social media, machine generated data, breaking of silos Page 2 | 9/8/2011 Big Data Trends
  • 3. Page 3 | 9/8/2011 Putting Big Data to work Data driven Org – decision support, new offerings Analytics on large data sets (FB Insights – Page, App etc stats), Data Mining – Clustering - Google News articles Search - Google
  • 4. Embarrassingly data parallel problems Data chunked & distributed across cluster Parallel processing with data locality – task dispatched where data is Horizontal/Linear scaling approach using commodity hardware Write Once, Read Many Examples Distributed logs – grep, # of accesses per URL Search - Term Vector generation, Reverse Links Page 4 | 9/8/2011 Problem characteristics and examples
  • 5. Open source system for large scale batch distributed computing on big data Map Reduce Programming Paradigm & Framework Map Reduce Infrastructure Distributed File System (HDFS) Endorsed/used extensively by web giants – Google, FB, Yahoo! Page 5 | 9/8/2011 What is Hadoop?
  • 6. MapReduce is a programming model and an implementation for parallel processing of large data sets Map processes each logical record per input split to generate a set of intermediate key/value pairs Reduce merges all intermediate values associated with the same intermediate key Page 6 | 9/8/2011 Map Reduce - Definition
  • 7. Map : Apply a function to each list member - Parallelizable [1, 2, 3].collect { it * it } Output : [1, 2, 3] -> Map (Square) : [1, 4, 9] Reduce : Apply a function and an accumulator to each list member [1, 2, 3].inject(0) { sum, item -> sum + item } Output : [1, 2, 3] -> Reduce (Sum) : 6 Map & Reduce [1, 2, 3].collect { it * it }.inject(0) { sum, item -> sum + item } Output : [1, 2, 3] -> Map (Square) -> [1, 4, 9] -> Reduce (Sum) : 14 Page 7 | 9/8/2011 Map Reduce - Functional Programming Origins
  • 8. Page 8 | 9/8/2011 Word Count - Shell cat * | grep | sort | uniq –c input| map | shuffle & sort | reduce
  • 9. Page 9 | 9/8/2011 Word Count - Map Reduce
  • 10. mapper (filename, file-contents): for each word in file-contents: emit (word, 1) // single count for a word e.g. (“the”, 1) for each occurrence of “the” reducer (word, Iterator values): // Iterator for list of counts for a word e.g. (“the”, [1,1,..]) sum = 0 for each value in intermediate_values: sum = sum + value emit (word, sum) Page 10 | 9/8/2011 Word Count - Pseudo code
  • 11. Word Count / Distributed logs search for # accesses to various URLs Map – emits word/URL, 1 for each doc/log split Reduce – sums up the counts for a specific word/URL Term Vector generation – term -> [doc-id] Map – emits term, doc-id for each doc split Reduce – Identity Reducer – accumulates the (term, [doc-id, doc-id ..]) Reverse Links – source -> target to target-> source Map – emits (target, source) for each doc split Reducer – Identity Reducer – accumulates the (target, [source, source ..]) Page 11 | 9/8/2011 Examples – Map Reduce Defn
  • 12. Hides complexity of distributed computing Automatic parallelization of job Automatic data chunking & distribution (via HDFS) Data locality – MR task dispatched where data is Fault tolerant to server, storage, N/W failures Network and disk transfer optimization Load balancing Page 12 | 9/8/2011 Map Reduce – Hadoop Implementation
  • 13. Page 13 | 9/8/2011 Hadoop Map Reduce Architecture
  • 14.
  • 15. Page 15 | 9/8/2011 HDFS Architecture
  • 16. Thanks Page 16 | 9/8/2011
  • 17. Page 17 | 9/8/2011 Backup Slides
  • 18. Page 18 | 9/8/2011 Map & Reduce Functions
  • 19. Page 19 | 9/8/2011 Job Configuration
  • 20. Job Tracker tracks MR jobs – runs on master node Task Tracker Runs on data nodes and tracks Mapper, Reducer tasks assigned to the node Heartbeats to Job Tracker Maintains and picks up tasks from a queue Page 20 | 9/8/2011 Hadoop Map Reduce Components
  • 21. Name Node Manages the file system namespace and regulates access to files by clients – stores meta data Mapping of blocks to Data Nodes and replicas Manage replication Executes file system namespace operations like opening, closing, and renaming files and directories. Data Node One per node, which manages local storage attached to the node Internally, a file is split into one or more blocks and these blocks are stored in a set of Data Nodes Responsible for serving read and write requests from the file system’s clients. The Data Nodes also perform block creation, deletion, and replication upon instruction from the Name Node. Page 21 | 9/8/2011 HDFS