SlideShare a Scribd company logo
1 of 22
Download to read offline
Map Reduce along with
Amazon EMR
Sampath Rachakonda & Siva Krishna Battu
Bigdata Analytics on Cloud Meetup
14th March 2015
http://www.meetup.com/abctalks
Agenda
http://www.meetup.com/abctalks
 Introduction to BigData and Hadoop
MapReduce
 Core Hadoop and its Ecosystem
 Use Cases
 Hadoop Installation on Windows/Ubuntu
 Work flow on Map Reduce
 M-R on EMR
 Real-time examples on EMR
 What's Next ? Hadoop 2.0!
Introduction to BigData
http://www.meetup.com/abctalks
 It is the latest buzz but on the other hand data is an opportunity
BigData – What & How
http://www.meetup.com/abctalks
Contd..
http://www.meetup.com/abctalks
 Extremely large datasets that are hard to deal with using
relational databases
 Storage/Cost
 Search/Performance
 Analytics and Visualization
 Need for parallel processing on hundreds of machines
 ETL cannot complete within a reasonable amount of time
 Beyond 24 hours never catch up
http://www.meetup.com/abctalks
Solution to handle BigData
http://www.meetup.com/abctalks
 Distributed File System
 System shall manage and heal itself
 Automatically route around failure
 Speculatively execute redundant tasks based on
performance
 Performance Scale Linearly
 Proportional change in capacity with resource change
 Compute should move to data
 Lower Latency, Lower Bandwidth
Introduction Apache Hadoop
http://www.meetup.com/abctalks
 What is Hadoop ?
A scalable fault tolerant grid operating system for data storage and
processing.
 Open Source, Apache License
 Works with Structured and Unstructured Data
 HDFS: Fault-Tolerant high-bandwidth clustered storage
 Commodity Hardware
 Master (name-node) – Slave Architecture
 MapReduce : Distributed Data Processing
Hadoop Cluster
http://www.meetup.com/abctalks
 A set of “cheap” commodity hardware
 Networked together
 Resides in same location in set of
racks in a data centre
 No super computers, use commodity
unreliable hardware
Hadoop System Principles
http://www.meetup.com/abctalks
 Scale-Out rather than Scale-Up
 Bring Code to data rather than Data to Code
 Deal with failures - they are common
 Abstract complexity of distributed and concurrent applications
RDBMS Vs Hadoop
http://www.meetup.com/abctalks
 Before hadoop many applications
used RDBMS for batch processing
like Oracle, MySQL, Sybase, etc..
 Hadoop doesn't fully replace RDBMS
the architecture
 RDBMS products Scale-up rather
than Scale-Out with limitations of
100s of terabytes
 Structured Vs Unstructured
 Offline Batch Vs Online Transactions
Hadoop + RDBMS Complements each other
http://www.meetup.com/abctalks
 For example a small website with small number of users
generating large amount of audit logs :
 WebServer (1)--> RDBMS --> (2)&(4) --> Hadoop(3)
 Use RDBMS for rich user interface and enforce data
integrity
 RDBMS generates lots of audit logs; the logs are moved
periodically to hadoop cluster
 All logs are kept & processed in Hadoop for various
analytics
 Results from hadoop cluster are stored back onto RDBMS
to be used by web server. Ex: Suggestions based on audit
history
Hadoop Eco System
http://www.meetup.com/abctalks
 Hadoop mainly comprised of two
core components :
 HDFS(Hadoop Distributed File
System) to store data & process
data
 MapReduce(Distributed data
processing framework)
HDFS(Hadoop Distributed File System)
http://www.meetup.com/abctalks
 A scalable, fault-tolerant, High Performance distributed file
system
 Asynchronous Replication
 Write-Once Read Many (WORM)
 Hadoop cluster with 3 data nodes minimum
 Data divided into 64 MB(default) or 128 MB blocks, each
block replicated 3 times by default
 No RAID required for DataNode
 Interfaces: Java, Thrift, C Library, FUSE, WebDAV, HTTP, FTP
 NameNode holds the file system metadata
 Files are broken up and spread over DataNodes
MapReduce(Distributed data processing framework)
http://www.meetup.com/abctalks
 Software Framework for distributed Computation
 Input | map () | CopySort | Reduce {} | Output
 Jobtracker schedules and manages jobs
 Tasktracker executes individual map() and reduce() tasks on
each cluster node
HDFS – Read File
http://www.meetup.com/abctalks
HDFS - Write File
http://www.meetup.com/abctalks
MapReduce - Executing File
http://www.meetup.com/abctalks
 Client program is copied on each node
 JobTracker determines number of splits from input path & then select
some task trackers based on their network proximity to the data sources
 Now JobTracker sends task request to the selected TaskTrackers
 Each TaskTracker starts the map phase processing by extracting the
input data from the splits
 Once Map task completes, TaskTracker notifies the JobTracker.
 When all TaskTrackers complete mapper phase, TaskTracker will notify
the selected TaskTrackers for reducer phase.
 Each TaskTracker reads region files remotely & invokes the reverse
function, which collects the key/aggregated value into the output file
(one per reducer node).
 After both mapper & reducer phases are completed, the JobTracker
unblocks the client program.
Java MapReduce Example
http://www.meetup.com/abctalks
 Let us go with the basic word count example which helps us
to understand the workflow easily
 Let us now dive into the demo of word count and understand
how does mapper, reducer functions and more..
Introduction to Amazon AWS & EMR
http://www.meetup.com/abctalks
AWS is an cloud infrastructure which
provides
 Elastic Capacity
 Quick and Easy Deployment
 No CapEx, No initial investment
 Pay as you go, for what you use
 Automation & Reusable components
Amazon EMR : Hadoop in Cloud
http://www.meetup.com/abctalks
 Scalable and fault tolerant
 Flexibility for multiple languages
and data formats
 Open Source
 Ecosystem of tools
 Batch and real-time analytics
 Amazon EMR is the easiest way to
run hadoop in the cloud
 Now let us look at the same example
we did on single node cluster on EMR
and look at the feasibility of doing it
Thank You !!
http://www.meetup.com/abctalks
https://www.facebook.com/abctalks

More Related Content

What's hot

Apache Spark™ is here to stay
Apache Spark™ is here to stayApache Spark™ is here to stay
Apache Spark™ is here to stayGiovanna Roda
 
Learn what is Hadoop-and-BigData
Learn  what is Hadoop-and-BigDataLearn  what is Hadoop-and-BigData
Learn what is Hadoop-and-BigDataThanusha154
 
Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Scalable Acceleration of XGBoost Training on Apache Spark GPU ClustersScalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Scalable Acceleration of XGBoost Training on Apache Spark GPU ClustersDatabricks
 
Spark Summit EU talk by Javier Aguedes
Spark Summit EU talk by Javier AguedesSpark Summit EU talk by Javier Aguedes
Spark Summit EU talk by Javier AguedesSpark Summit
 
Steve Watt, Chief Architect, Hadoop and Big Data, Red Hat - 21st BDL meetup
Steve Watt, Chief Architect, Hadoop and Big Data, Red Hat - 21st BDL meetupSteve Watt, Chief Architect, Hadoop and Big Data, Red Hat - 21st BDL meetup
Steve Watt, Chief Architect, Hadoop and Big Data, Red Hat - 21st BDL meetupbigdatalondon
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reducePaladion Networks
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Databricks
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSatish Mohan
 
Change Data Capture - Scale by the Bay 2019
Change Data Capture - Scale by the Bay 2019Change Data Capture - Scale by the Bay 2019
Change Data Capture - Scale by the Bay 2019Petr Zapletal
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceBhupesh Chawda
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixJeff Magnusson
 
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Cloudera, Inc.
 
Hadoop Scheduling - a 7 year perspective
Hadoop Scheduling - a 7 year perspectiveHadoop Scheduling - a 7 year perspective
Hadoop Scheduling - a 7 year perspectiveJoydeep Sen Sarma
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to SchoolAdam Doyle
 
Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016
Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016
Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016Sergio Fernández
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2Aswini Ashu
 
Filtering vs Enriching Data in Apache Spark
Filtering vs Enriching Data in Apache SparkFiltering vs Enriching Data in Apache Spark
Filtering vs Enriching Data in Apache SparkDatabricks
 

What's hot (20)

Apache Spark™ is here to stay
Apache Spark™ is here to stayApache Spark™ is here to stay
Apache Spark™ is here to stay
 
Learn what is Hadoop-and-BigData
Learn  what is Hadoop-and-BigDataLearn  what is Hadoop-and-BigData
Learn what is Hadoop-and-BigData
 
Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Scalable Acceleration of XGBoost Training on Apache Spark GPU ClustersScalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
 
Spark Summit EU talk by Javier Aguedes
Spark Summit EU talk by Javier AguedesSpark Summit EU talk by Javier Aguedes
Spark Summit EU talk by Javier Aguedes
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
Steve Watt, Chief Architect, Hadoop and Big Data, Red Hat - 21st BDL meetup
Steve Watt, Chief Architect, Hadoop and Big Data, Red Hat - 21st BDL meetupSteve Watt, Chief Architect, Hadoop and Big Data, Red Hat - 21st BDL meetup
Steve Watt, Chief Architect, Hadoop and Big Data, Red Hat - 21st BDL meetup
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
F07-Cloud-Hadoop-BAM
F07-Cloud-Hadoop-BAMF07-Cloud-Hadoop-BAM
F07-Cloud-Hadoop-BAM
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
 
Change Data Capture - Scale by the Bay 2019
Change Data Capture - Scale by the Bay 2019Change Data Capture - Scale by the Bay 2019
Change Data Capture - Scale by the Bay 2019
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at Netflix
 
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
 
Hadoop Scheduling - a 7 year perspective
Hadoop Scheduling - a 7 year perspectiveHadoop Scheduling - a 7 year perspective
Hadoop Scheduling - a 7 year perspective
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School
 
Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016
Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016
Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
 
Filtering vs Enriching Data in Apache Spark
Filtering vs Enriching Data in Apache SparkFiltering vs Enriching Data in Apache Spark
Filtering vs Enriching Data in Apache Spark
 

Similar to Map Reduce along with Amazon EMR

CS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingCS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingPalani Kumar
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkManish Gupta
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5RojaT4
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesJon Meredith
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Provectus
 
Dataservices - Processing Big Data The Microservice Way
Dataservices - Processing Big Data The Microservice WayDataservices - Processing Big Data The Microservice Way
Dataservices - Processing Big Data The Microservice WayJosef Adersberger
 
Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...redpel dot com
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scalesamthemonad
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
 
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...Yahoo Developer Network
 
Introduction to GCP Data Flow Presentation
Introduction to GCP Data Flow PresentationIntroduction to GCP Data Flow Presentation
Introduction to GCP Data Flow PresentationKnoldus Inc.
 
Introduction to GCP DataFlow Presentation
Introduction to GCP DataFlow PresentationIntroduction to GCP DataFlow Presentation
Introduction to GCP DataFlow PresentationKnoldus Inc.
 

Similar to Map Reduce along with Amazon EMR (20)

CS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingCS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_Computing
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL Databases
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
 
Dataservices - Processing Big Data The Microservice Way
Dataservices - Processing Big Data The Microservice WayDataservices - Processing Big Data The Microservice Way
Dataservices - Processing Big Data The Microservice Way
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Ruby in the Clouds
Ruby in the CloudsRuby in the Clouds
Ruby in the Clouds
 
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
 
Introduction to GCP Data Flow Presentation
Introduction to GCP Data Flow PresentationIntroduction to GCP Data Flow Presentation
Introduction to GCP Data Flow Presentation
 
Introduction to GCP DataFlow Presentation
Introduction to GCP DataFlow PresentationIntroduction to GCP DataFlow Presentation
Introduction to GCP DataFlow Presentation
 

Recently uploaded

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 

Recently uploaded (20)

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Map Reduce along with Amazon EMR

  • 1. Map Reduce along with Amazon EMR Sampath Rachakonda & Siva Krishna Battu Bigdata Analytics on Cloud Meetup 14th March 2015 http://www.meetup.com/abctalks
  • 2. Agenda http://www.meetup.com/abctalks  Introduction to BigData and Hadoop MapReduce  Core Hadoop and its Ecosystem  Use Cases  Hadoop Installation on Windows/Ubuntu  Work flow on Map Reduce  M-R on EMR  Real-time examples on EMR  What's Next ? Hadoop 2.0!
  • 3. Introduction to BigData http://www.meetup.com/abctalks  It is the latest buzz but on the other hand data is an opportunity
  • 4. BigData – What & How http://www.meetup.com/abctalks
  • 5. Contd.. http://www.meetup.com/abctalks  Extremely large datasets that are hard to deal with using relational databases  Storage/Cost  Search/Performance  Analytics and Visualization  Need for parallel processing on hundreds of machines  ETL cannot complete within a reasonable amount of time  Beyond 24 hours never catch up
  • 7. Solution to handle BigData http://www.meetup.com/abctalks  Distributed File System  System shall manage and heal itself  Automatically route around failure  Speculatively execute redundant tasks based on performance  Performance Scale Linearly  Proportional change in capacity with resource change  Compute should move to data  Lower Latency, Lower Bandwidth
  • 8. Introduction Apache Hadoop http://www.meetup.com/abctalks  What is Hadoop ? A scalable fault tolerant grid operating system for data storage and processing.  Open Source, Apache License  Works with Structured and Unstructured Data  HDFS: Fault-Tolerant high-bandwidth clustered storage  Commodity Hardware  Master (name-node) – Slave Architecture  MapReduce : Distributed Data Processing
  • 9. Hadoop Cluster http://www.meetup.com/abctalks  A set of “cheap” commodity hardware  Networked together  Resides in same location in set of racks in a data centre  No super computers, use commodity unreliable hardware
  • 10. Hadoop System Principles http://www.meetup.com/abctalks  Scale-Out rather than Scale-Up  Bring Code to data rather than Data to Code  Deal with failures - they are common  Abstract complexity of distributed and concurrent applications
  • 11. RDBMS Vs Hadoop http://www.meetup.com/abctalks  Before hadoop many applications used RDBMS for batch processing like Oracle, MySQL, Sybase, etc..  Hadoop doesn't fully replace RDBMS the architecture  RDBMS products Scale-up rather than Scale-Out with limitations of 100s of terabytes  Structured Vs Unstructured  Offline Batch Vs Online Transactions
  • 12. Hadoop + RDBMS Complements each other http://www.meetup.com/abctalks  For example a small website with small number of users generating large amount of audit logs :  WebServer (1)--> RDBMS --> (2)&(4) --> Hadoop(3)  Use RDBMS for rich user interface and enforce data integrity  RDBMS generates lots of audit logs; the logs are moved periodically to hadoop cluster  All logs are kept & processed in Hadoop for various analytics  Results from hadoop cluster are stored back onto RDBMS to be used by web server. Ex: Suggestions based on audit history
  • 13. Hadoop Eco System http://www.meetup.com/abctalks  Hadoop mainly comprised of two core components :  HDFS(Hadoop Distributed File System) to store data & process data  MapReduce(Distributed data processing framework)
  • 14. HDFS(Hadoop Distributed File System) http://www.meetup.com/abctalks  A scalable, fault-tolerant, High Performance distributed file system  Asynchronous Replication  Write-Once Read Many (WORM)  Hadoop cluster with 3 data nodes minimum  Data divided into 64 MB(default) or 128 MB blocks, each block replicated 3 times by default  No RAID required for DataNode  Interfaces: Java, Thrift, C Library, FUSE, WebDAV, HTTP, FTP  NameNode holds the file system metadata  Files are broken up and spread over DataNodes
  • 15. MapReduce(Distributed data processing framework) http://www.meetup.com/abctalks  Software Framework for distributed Computation  Input | map () | CopySort | Reduce {} | Output  Jobtracker schedules and manages jobs  Tasktracker executes individual map() and reduce() tasks on each cluster node
  • 16. HDFS – Read File http://www.meetup.com/abctalks
  • 17. HDFS - Write File http://www.meetup.com/abctalks
  • 18. MapReduce - Executing File http://www.meetup.com/abctalks  Client program is copied on each node  JobTracker determines number of splits from input path & then select some task trackers based on their network proximity to the data sources  Now JobTracker sends task request to the selected TaskTrackers  Each TaskTracker starts the map phase processing by extracting the input data from the splits  Once Map task completes, TaskTracker notifies the JobTracker.  When all TaskTrackers complete mapper phase, TaskTracker will notify the selected TaskTrackers for reducer phase.  Each TaskTracker reads region files remotely & invokes the reverse function, which collects the key/aggregated value into the output file (one per reducer node).  After both mapper & reducer phases are completed, the JobTracker unblocks the client program.
  • 19. Java MapReduce Example http://www.meetup.com/abctalks  Let us go with the basic word count example which helps us to understand the workflow easily  Let us now dive into the demo of word count and understand how does mapper, reducer functions and more..
  • 20. Introduction to Amazon AWS & EMR http://www.meetup.com/abctalks AWS is an cloud infrastructure which provides  Elastic Capacity  Quick and Easy Deployment  No CapEx, No initial investment  Pay as you go, for what you use  Automation & Reusable components
  • 21. Amazon EMR : Hadoop in Cloud http://www.meetup.com/abctalks  Scalable and fault tolerant  Flexibility for multiple languages and data formats  Open Source  Ecosystem of tools  Batch and real-time analytics  Amazon EMR is the easiest way to run hadoop in the cloud  Now let us look at the same example we did on single node cluster on EMR and look at the feasibility of doing it

Editor's Notes

  1. http://hortonworks.com/blog/7-key-drivers-for-the-big-data-market/