1. HADOOP & DISTRIBUTED CLOUD
COMPUTING
DATA PROCESSING IN CLOUD
Presentation By : Rajan Kumar Upadhyay || rajan24oct@gmail.com
2. CLOUD COMPUTING ?
Cloud computing is a virtual setup box that includes
following
- Delivery of computing as a service rather than product
- Shared resources are software, utility, hardware provided over a network ( Typically
Internet )
Delivery of computing
Public Utilities
Shared Resources
3. DISTRIBUTED CLOUD COMPUTING
As the name explains : Distributed computing in cloud
Examples:
• Distributed computing is nothing more than utilizing many networked computers to partition
(split it into many smaller pieces) a question or problem and allow the network to solve the
issue piecemeal
• Software like Hadoop. Written in Java, Hadoop is a scalable, efficient, distributed software
platform designed to process enormous amounts of data. Hadoop can scale to thousands of
computers across many clusters.
• Another instance of distributed computing, for storage instead of processing power, is
bittorrent. A torrent is a file that is split into many pieces and stored on many computers
around the internet. When a local machine wants to access that file, the small pieces are
retrieved and rebuilt.
• P2P network, that send communication/data packages into multiple pieces across multiple
network routes. Then assemble them in receivers end.
Distributed computing on cloud is nothing but next generation framework to utilize the
maximum value of resources over distributed architecure
4. WHAT IS HADOOP
Flexible infrastructure for large scale computation and data processing on a network of
commodity hardware.
Why Hadoop?
A common infrastructure pattern extracted from building distributed systems
•Scale • Apache.org Open Source project
•Incremental growth • Yahoo !, Facebook, Google, Fox, Amazon, IBM,
•Cost NY times uses it for their core infrastructure
•Flexibility • Widely Adopted A valuable and reusable skill set
• Distributed File System Taught at major universities
• Distributed Processing Framework Easier to hire for
Easier to train on
Portable across projects, groups
5. HOW IT WORKS
HDFS: Hadoop Distributed File System
A distributed file system for large data
• Your data in triplicate ( one local and two remote copies)
• Built-in redundancy, resiliency to large scale failures
(automated restart and re-allocation )
• Intelligent distribution, striping across racks
• Accommodates very large data sizes On commodity hardware
6. PROGRAMMING MODEL
There are various programming model for Hadoop
developments. I personally like & experienced with
Map/Reduce
Why Map/Reduce:
•Simple programming technique:
• Map(anything)->key, value
• Sort, partition on key
• Reduce(key,value)->key, value
• No parallel processing / message passing semantics
• Programmable in Java or any other language
Continued …
7. PROGRAMMING MODEL
Gather output of
Create/Allocate Move computation map, sort or
cluster to Data partition on key
Put Data Run Results of job
Program reduce stored on
into File
Execution task HDFS
System
Your Map code
Data is split is copied to the
into allocated nodes,
blocks, store preferring nodes
d in triplicate that contain
across your copies of your
data
cluster
8. PRACTICES
Put large data source into HDFS
Perform aggregations, transformations, normalizations on
the data
Load into RDBMS
9. THANK YOU
Thank you for reading this. I hope you find it useful. Please contact me to
rajan24oct@gmail.com if you have any queries/feedback. My Name is Rajan
Kumar Upadhyay, I have more than 10 years of collective IT experience as a
techie.
If you have anything to share/looking for consulting etc. Please feel free to contact
me.