Apache Hadoop: Large Scale Data Processing

Apache Hadoop
- Large Scale Data Processing
Sharath Bandaru & Sai Dinesh Koppuravuri
Advanced Topics Presentation
ISYE 582 :Engineering Information Systems

Overview
 Understanding Big Data
 Structured/Unstructured Data
 Limitations Of Existing Data Analytics Structure
 Apache Hadoop
 Hadoop Architecture
 HDFS
 Map Reduce
 Conclusions
 References

Understanding Big Data
Big Data
Is creating
Large And
Growing Files
Measured in:
Petabytes (10^12)
Terabytes (10^15)
Which is largely
unstructured

Why now ?DataGrowth
STRUCTURED DATA – 20%
1980 2013
UNSTRUCTUREDDATA–80%
Source : Cloudera, 2013

Challenges posed by Big Data
Velocity
Volume
Variety
400 million tweets in a day on Twitter
1 million transactions by Wal-Mart every hour
2.5 peta bytes created by Wal-Mart
transactions in an hour
Videos, Photos, Text messages, Images,
Audios, Documents, Emails, etc.,

Limitations Of Existing Data Analytics Architecture
BI Reports + Interactive Apps
RDBMS (aggregated data)
ETL Compute Grid
Storage Only Grid ( original raw data )
Collection
Instrumentation
Moving Data To
Compute Doesn’t Scale
Can’t Explore Original
High Fidelity Raw Data
Archiving=
Premature Data
Death

So What is Apache ?
• A set of tools that supports running of applications on big data.
• Core Hadoop has two main systems:
- HDFS : self-healing high-bandwidth clustered storage.
- Map Reduce : distributed fault-tolerant resource management
and scheduling coupled with a scalable data programming
abstraction.

History
Source : Cloudera, 2013

The Key Benefit: Agility/Flexibility
Schema-on-Write (RDBMS): Schema-on-Read (Hadoop):
• Schema must be created before any data
can be loaded.
• An explicit load operation has to take
place which transforms data to DB
internal structure.
• New columns must be added explicitly
before new data for such columns can be
loaded into the database
• Data is simply copied to the file store,
no transformation is needed.
• A SerDe (Serializer/Deserlizer) is applied
during read time to extract the required
columns (late binding).
• New data can start flowing anytime and
will appear retroactively once the SerDe is
updated to parse it.
• Read is Fast
• Standards/Governance
• Load is Fast
• Flexibility/Agility
Pros

Use The Right Tool For The Right Job
Relational Databases: Hadoop:
Use when:
• Interactive OLAP Analytics (< 1 sec)
• Multistep ACID transactions
• 100 % SQL compliance
Use when:
• Structured or Not (Flexibility)
• Scalability of Storage/Compute
• Complex Data Processing

Traditional Approach
Big Data
Powerful Computer
Processing limit
Enterprise Approach:

Hadoop Architecture
Task
Tracker
Job
Tracker
Name
Node
Data
Node
Master
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Slaves
Map
Reduce
HDFS

Hadoop Architecture
Task
Tracker
Job
Tracker
Name
Node
Data
Node
Master
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Slaves
Application

Job Tracker
Task
Tracker
Job
Tracker
Name
Node
Data
Node
Master
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Slaves
Application

HDFS: Hadoop Distributed File System
• A given file is broken into blocks (default=64MB), then blocks are replicated across
cluster(default=3).
1
2
3
4
5
HDFS
3
4
5
1
2
5
1
3
4
2
4
5
1
2
3
Optimized for :
• Throughput
• Put/Get/Delete
• Appends
Block Replication for :
• Durability
• Availability
• Throughput
Block Replicas are distributed across servers
and racks

Fault Tolerance for Data
Task
Tracker
Job
Tracker
Name
Node
Data
Node
Master
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Slaves
HDFS

Fault Tolerance for Processing
Task
Tracker
Job
Tracker
Name
Node
Data
Node
Master
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Slaves
Map Reduce

Fault Tolerance for Processing
Task
Tracker
Job
Tracker
Name
Node
Data
Node
Master
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Slaves
Tables are backed up

Map Reduce
Input Data
Map Map Map Map Map
Shuffle
Reduce Reduce
Results

Understanding the concept of Map Reduce
Mother
Sam
An Apple
• Believed “an apple a day keeps a doctor away”
The Story Of Sam

• Sam thought of “drinking” the apple
 He used a to cut the
and a to make juice.

Next day
• Sam applied his invention to all the fruits he could find in
the fruit basket
 (map ‘( )’)
 (reduce ‘( )’) Classical Notion of Map Reduce
in Functional Programming
A list of values mapped into
another list of values, which
gets reduced into a single value

18 Years Later
• Sam got his first job in “Tropicana” for his expertise in
making juices.
 Now, it’s not just one basket
but a whole container of fruits
 Also, they produce a list of
juice types separately
NOT ENOUGH !!
 But, Sam had just ONE
and ONE
Large data and list of
values for output
Wait!

Brave Sam
(<a, > , <o, > , <p, > , …)
Each input to a map is a list of <key, value> pairs
Each output of a map is a list of <key, value> pairs
(<a’, > , <o’, > , <p’, > , …)
Grouped by key
Each input to a reduce is a <key, value-
list> (possibly a list of these, depending
on the grouping/hashing mechanism)
e.g. <a’, ( …)>
Reduced into a list of values
Implemented parallel version of his innovation

• Sam realized,
– To create his favorite mix fruit juice he can use a combiner after the reducers
– If several <key, value-list> fall into the same group (based on the
grouping/hashing algorithm) then use the blender (reducer) separately on
each of them
– The knife (mapper) and blender (reducer) should not contain residue after use
– Side Effect Free
Source: (Map Reduce, 2010).

Conclusions
• The key benefits of Apache Hadoop:
1) Agility/ Flexibility (Quickest Time to Insight)
2) Complex Data Processing (Any Language, Any Problem)
3) Scalability of Storage/Compute (Freedom to Grow)
4) Economical Storage (Keep All Your Data Alive Forever)
• The key systems for Apache Hadoop are:
1) Hadoop Distributed File System : self-healing high-bandwidth
clustered storage.
2) Map Reduce : distributed fault-tolerant resource management
coupled with scalable data processing.

References
• Ekanayake, S. (2010, March). Map Reduce : The Story Of Sam. Retrieved April 13, 2013,
from http://esaliya.blogspot.com/2010/03/mapreduce-explained-simply-as-story- of.html.
• Jeffrey Dean and Sanjay Ghemawat. (2004, December). Map Reduce : Simplified Data
Processing on Large Clusters.
• The Apache Software Foundation. (2013, April). Hadoop. Retrieved April 19, 2013, from
http://hadoop.apache.org/.
• Isabel Drost. (2010, February). Apache Hadoop : Large Scale Data Analysis made Easy.
retrieved April 13, 2013, from http://www.youtube.com/watch?v=VFHqquABHB8.
• Dr. Amr Awadallah. (2011, November). Introducing Apache Hadoop : The Modern Data
Operating System. Retrieved April 15, 2013, from
http://www.youtube.com/watch?v=d2xeNpfzsYI

Apache Hadoop: Large Scale Data Processing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache Hadoop: Large Scale Data Processing

Similar to Apache Hadoop: Large Scale Data Processing (20)

Recently uploaded

Recently uploaded (20)

Apache Hadoop: Large Scale Data Processing