SlideShare a Scribd company logo
1 of 29
Apache Hadoop
- Large Scale Data Processing
Sharath Bandaru & Sai Dinesh Koppuravuri
Advanced Topics Presentation
ISYE 582 :Engineering Information Systems
Overview
 Understanding Big Data
 Structured/Unstructured Data
 Limitations Of Existing Data Analytics Structure
 Apache Hadoop
 Hadoop Architecture
 HDFS
 Map Reduce
 Conclusions
 References
Understanding Big Data
Big Data
Is creating
Large And
Growing Files
Measured in:
Petabytes (10^12)
Terabytes (10^15)
Which is largely
unstructured
Structured/Unstructured Data
Why now ?DataGrowth
STRUCTURED DATA – 20%
1980 2013
UNSTRUCTUREDDATA–80%
Source : Cloudera, 2013
Challenges posed by Big Data
Velocity
Volume
Variety
400 million tweets in a day on Twitter
1 million transactions by Wal-Mart every hour
2.5 peta bytes created by Wal-Mart
transactions in an hour
Videos, Photos, Text messages, Images,
Audios, Documents, Emails, etc.,
Limitations Of Existing Data Analytics Architecture
BI Reports + Interactive Apps
RDBMS (aggregated data)
ETL Compute Grid
Storage Only Grid ( original raw data )
Collection
Instrumentation
Moving Data To
Compute Doesn’t Scale
Can’t Explore Original
High Fidelity Raw Data
Archiving=
Premature Data
Death
So What is Apache ?
• A set of tools that supports running of applications on big data.
• Core Hadoop has two main systems:
- HDFS : self-healing high-bandwidth clustered storage.
- Map Reduce : distributed fault-tolerant resource management
and scheduling coupled with a scalable data programming
abstraction.
History
Source : Cloudera, 2013
The Key Benefit: Agility/Flexibility
Schema-on-Write (RDBMS): Schema-on-Read (Hadoop):
• Schema must be created before any data
can be loaded.
• An explicit load operation has to take
place which transforms data to DB
internal structure.
• New columns must be added explicitly
before new data for such columns can be
loaded into the database
• Data is simply copied to the file store,
no transformation is needed.
• A SerDe (Serializer/Deserlizer) is applied
during read time to extract the required
columns (late binding).
• New data can start flowing anytime and
will appear retroactively once the SerDe is
updated to parse it.
• Read is Fast
• Standards/Governance
• Load is Fast
• Flexibility/Agility
Pros
Use The Right Tool For The Right Job
Relational Databases: Hadoop:
Use when:
• Interactive OLAP Analytics (< 1 sec)
• Multistep ACID transactions
• 100 % SQL compliance
Use when:
• Structured or Not (Flexibility)
• Scalability of Storage/Compute
• Complex Data Processing
Traditional Approach
Big Data
Powerful Computer
Processing limit
Enterprise Approach:
Hadoop Architecture
Task
Tracker
Job
Tracker
Name
Node
Data
Node
Master
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Slaves
Map
Reduce
HDFS
Hadoop Architecture
Task
Tracker
Job
Tracker
Name
Node
Data
Node
Master
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Slaves
Application
Job Tracker
Task
Tracker
Job
Tracker
Name
Node
Data
Node
Master
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Slaves
Application
Job Tracker
Task
Tracker
Job
Tracker
Name
Node
Data
Node
Master
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Slaves
Application
HDFS: Hadoop Distributed File System
• A given file is broken into blocks (default=64MB), then blocks are replicated across
cluster(default=3).
1
2
3
4
5
HDFS
3
4
5
1
2
5
1
3
4
2
4
5
1
2
3
Optimized for :
• Throughput
• Put/Get/Delete
• Appends
Block Replication for :
• Durability
• Availability
• Throughput
Block Replicas are distributed across servers
and racks
Fault Tolerance for Data
Task
Tracker
Job
Tracker
Name
Node
Data
Node
Master
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Slaves
HDFS
Fault Tolerance for Processing
Task
Tracker
Job
Tracker
Name
Node
Data
Node
Master
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Slaves
Map Reduce
Fault Tolerance for Processing
Task
Tracker
Job
Tracker
Name
Node
Data
Node
Master
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Slaves
Tables are backed up
Map Reduce
Input Data
Map Map Map Map Map
Shuffle
Reduce Reduce
Results
Understanding the concept of Map Reduce
Mother
Sam
An Apple
• Believed “an apple a day keeps a doctor away”
The Story Of Sam
Understanding the concept of Map Reduce
• Sam thought of “drinking” the apple
 He used a to cut the
and a to make juice.
Understanding the concept of Map Reduce
Next day
• Sam applied his invention to all the fruits he could find in
the fruit basket
 (map ‘( )’)
 (reduce ‘( )’) Classical Notion of Map Reduce
in Functional Programming
A list of values mapped into
another list of values, which
gets reduced into a single value
Understanding the concept of Map Reduce
18 Years Later
• Sam got his first job in “Tropicana” for his expertise in
making juices.
 Now, it’s not just one basket
but a whole container of fruits
 Also, they produce a list of
juice types separately
NOT ENOUGH !!
 But, Sam had just ONE
and ONE
Large data and list of
values for output
Wait!
Understanding the concept of Map Reduce
Brave Sam
(<a, > , <o, > , <p, > , …)
Each input to a map is a list of <key, value> pairs
Each output of a map is a list of <key, value> pairs
(<a’, > , <o’, > , <p’, > , …)
Grouped by key
Each input to a reduce is a <key, value-
list> (possibly a list of these, depending
on the grouping/hashing mechanism)
e.g. <a’, ( …)>
Reduced into a list of values
Implemented parallel version of his innovation
Understanding the concept of Map Reduce
• Sam realized,
– To create his favorite mix fruit juice he can use a combiner after the reducers
– If several <key, value-list> fall into the same group (based on the
grouping/hashing algorithm) then use the blender (reducer) separately on
each of them
– The knife (mapper) and blender (reducer) should not contain residue after use
– Side Effect Free
Source: (Map Reduce, 2010).
Conclusions
• The key benefits of Apache Hadoop:
1) Agility/ Flexibility (Quickest Time to Insight)
2) Complex Data Processing (Any Language, Any Problem)
3) Scalability of Storage/Compute (Freedom to Grow)
4) Economical Storage (Keep All Your Data Alive Forever)
• The key systems for Apache Hadoop are:
1) Hadoop Distributed File System : self-healing high-bandwidth
clustered storage.
2) Map Reduce : distributed fault-tolerant resource management
coupled with scalable data processing.
References
• Ekanayake, S. (2010, March). Map Reduce : The Story Of Sam. Retrieved April 13, 2013,
from http://esaliya.blogspot.com/2010/03/mapreduce-explained-simply-as-story- of.html.
• Jeffrey Dean and Sanjay Ghemawat. (2004, December). Map Reduce : Simplified Data
Processing on Large Clusters.
• The Apache Software Foundation. (2013, April). Hadoop. Retrieved April 19, 2013, from
http://hadoop.apache.org/.
• Isabel Drost. (2010, February). Apache Hadoop : Large Scale Data Analysis made Easy.
retrieved April 13, 2013, from http://www.youtube.com/watch?v=VFHqquABHB8.
• Dr. Amr Awadallah. (2011, November). Introducing Apache Hadoop : The Modern Data
Operating System. Retrieved April 15, 2013, from
http://www.youtube.com/watch?v=d2xeNpfzsYI

More Related Content

What's hot

Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
Hadoop distributions - ecosystem
Hadoop distributions - ecosystemHadoop distributions - ecosystem
Hadoop distributions - ecosystemJakub Stransky
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemInSemble
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop EcosystemLior Sidi
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Senthil Kumar
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopVigen Sahakyan
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
 
An introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveAn introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveMike Frampton
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseAsis Mohanty
 

What's hot (20)

Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Hadoop distributions - ecosystem
Hadoop distributions - ecosystemHadoop distributions - ecosystem
Hadoop distributions - ecosystem
 
Hive
HiveHive
Hive
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Apache Hadoop at 10
Apache Hadoop at 10Apache Hadoop at 10
Apache Hadoop at 10
 
Hadoop
HadoopHadoop
Hadoop
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 
An introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveAn introduction to Apache Hadoop Hive
An introduction to Apache Hadoop Hive
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouse
 

Similar to Apache Hadoop: Large Scale Data Processing

Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analyticsAvinash Pandu
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiativeMansi Mehra
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseCloudera, Inc.
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopMr. Ankit
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATarak Tar
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATarak Tar
 
2016-07-21-Godil-presentation.pptx
2016-07-21-Godil-presentation.pptx2016-07-21-Godil-presentation.pptx
2016-07-21-Godil-presentation.pptxD21CE161GOSWAMIPARTH
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and DeploymentCisco Canada
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keownCisco Canada
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoopAbhi Goyan
 

Similar to Apache Hadoop: Large Scale Data Processing (20)

Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiative
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBase
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
HADOOP
HADOOPHADOOP
HADOOP
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
2016-07-21-Godil-presentation.pptx
2016-07-21-Godil-presentation.pptx2016-07-21-Godil-presentation.pptx
2016-07-21-Godil-presentation.pptx
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoop
 

Recently uploaded

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 

Recently uploaded (20)

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 

Apache Hadoop: Large Scale Data Processing

  • 1. Apache Hadoop - Large Scale Data Processing Sharath Bandaru & Sai Dinesh Koppuravuri Advanced Topics Presentation ISYE 582 :Engineering Information Systems
  • 2. Overview  Understanding Big Data  Structured/Unstructured Data  Limitations Of Existing Data Analytics Structure  Apache Hadoop  Hadoop Architecture  HDFS  Map Reduce  Conclusions  References
  • 3. Understanding Big Data Big Data Is creating Large And Growing Files Measured in: Petabytes (10^12) Terabytes (10^15) Which is largely unstructured
  • 5. Why now ?DataGrowth STRUCTURED DATA – 20% 1980 2013 UNSTRUCTUREDDATA–80% Source : Cloudera, 2013
  • 6. Challenges posed by Big Data Velocity Volume Variety 400 million tweets in a day on Twitter 1 million transactions by Wal-Mart every hour 2.5 peta bytes created by Wal-Mart transactions in an hour Videos, Photos, Text messages, Images, Audios, Documents, Emails, etc.,
  • 7. Limitations Of Existing Data Analytics Architecture BI Reports + Interactive Apps RDBMS (aggregated data) ETL Compute Grid Storage Only Grid ( original raw data ) Collection Instrumentation Moving Data To Compute Doesn’t Scale Can’t Explore Original High Fidelity Raw Data Archiving= Premature Data Death
  • 8. So What is Apache ? • A set of tools that supports running of applications on big data. • Core Hadoop has two main systems: - HDFS : self-healing high-bandwidth clustered storage. - Map Reduce : distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction.
  • 10. The Key Benefit: Agility/Flexibility Schema-on-Write (RDBMS): Schema-on-Read (Hadoop): • Schema must be created before any data can be loaded. • An explicit load operation has to take place which transforms data to DB internal structure. • New columns must be added explicitly before new data for such columns can be loaded into the database • Data is simply copied to the file store, no transformation is needed. • A SerDe (Serializer/Deserlizer) is applied during read time to extract the required columns (late binding). • New data can start flowing anytime and will appear retroactively once the SerDe is updated to parse it. • Read is Fast • Standards/Governance • Load is Fast • Flexibility/Agility Pros
  • 11. Use The Right Tool For The Right Job Relational Databases: Hadoop: Use when: • Interactive OLAP Analytics (< 1 sec) • Multistep ACID transactions • 100 % SQL compliance Use when: • Structured or Not (Flexibility) • Scalability of Storage/Compute • Complex Data Processing
  • 12. Traditional Approach Big Data Powerful Computer Processing limit Enterprise Approach:
  • 17. HDFS: Hadoop Distributed File System • A given file is broken into blocks (default=64MB), then blocks are replicated across cluster(default=3). 1 2 3 4 5 HDFS 3 4 5 1 2 5 1 3 4 2 4 5 1 2 3 Optimized for : • Throughput • Put/Get/Delete • Appends Block Replication for : • Durability • Availability • Throughput Block Replicas are distributed across servers and racks
  • 18. Fault Tolerance for Data Task Tracker Job Tracker Name Node Data Node Master Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Slaves HDFS
  • 19. Fault Tolerance for Processing Task Tracker Job Tracker Name Node Data Node Master Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Slaves Map Reduce
  • 20. Fault Tolerance for Processing Task Tracker Job Tracker Name Node Data Node Master Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Slaves Tables are backed up
  • 21. Map Reduce Input Data Map Map Map Map Map Shuffle Reduce Reduce Results
  • 22. Understanding the concept of Map Reduce Mother Sam An Apple • Believed “an apple a day keeps a doctor away” The Story Of Sam
  • 23. Understanding the concept of Map Reduce • Sam thought of “drinking” the apple  He used a to cut the and a to make juice.
  • 24. Understanding the concept of Map Reduce Next day • Sam applied his invention to all the fruits he could find in the fruit basket  (map ‘( )’)  (reduce ‘( )’) Classical Notion of Map Reduce in Functional Programming A list of values mapped into another list of values, which gets reduced into a single value
  • 25. Understanding the concept of Map Reduce 18 Years Later • Sam got his first job in “Tropicana” for his expertise in making juices.  Now, it’s not just one basket but a whole container of fruits  Also, they produce a list of juice types separately NOT ENOUGH !!  But, Sam had just ONE and ONE Large data and list of values for output Wait!
  • 26. Understanding the concept of Map Reduce Brave Sam (<a, > , <o, > , <p, > , …) Each input to a map is a list of <key, value> pairs Each output of a map is a list of <key, value> pairs (<a’, > , <o’, > , <p’, > , …) Grouped by key Each input to a reduce is a <key, value- list> (possibly a list of these, depending on the grouping/hashing mechanism) e.g. <a’, ( …)> Reduced into a list of values Implemented parallel version of his innovation
  • 27. Understanding the concept of Map Reduce • Sam realized, – To create his favorite mix fruit juice he can use a combiner after the reducers – If several <key, value-list> fall into the same group (based on the grouping/hashing algorithm) then use the blender (reducer) separately on each of them – The knife (mapper) and blender (reducer) should not contain residue after use – Side Effect Free Source: (Map Reduce, 2010).
  • 28. Conclusions • The key benefits of Apache Hadoop: 1) Agility/ Flexibility (Quickest Time to Insight) 2) Complex Data Processing (Any Language, Any Problem) 3) Scalability of Storage/Compute (Freedom to Grow) 4) Economical Storage (Keep All Your Data Alive Forever) • The key systems for Apache Hadoop are: 1) Hadoop Distributed File System : self-healing high-bandwidth clustered storage. 2) Map Reduce : distributed fault-tolerant resource management coupled with scalable data processing.
  • 29. References • Ekanayake, S. (2010, March). Map Reduce : The Story Of Sam. Retrieved April 13, 2013, from http://esaliya.blogspot.com/2010/03/mapreduce-explained-simply-as-story- of.html. • Jeffrey Dean and Sanjay Ghemawat. (2004, December). Map Reduce : Simplified Data Processing on Large Clusters. • The Apache Software Foundation. (2013, April). Hadoop. Retrieved April 19, 2013, from http://hadoop.apache.org/. • Isabel Drost. (2010, February). Apache Hadoop : Large Scale Data Analysis made Easy. retrieved April 13, 2013, from http://www.youtube.com/watch?v=VFHqquABHB8. • Dr. Amr Awadallah. (2011, November). Introducing Apache Hadoop : The Modern Data Operating System. Retrieved April 15, 2013, from http://www.youtube.com/watch?v=d2xeNpfzsYI