1. GANDHI INSTITUTE FOR TECHNOLOGICAL
ADVANCEMENT, BHUBANESWAR
TECHNICAL SEMINAR ON
HADOOP
GUIDED BY- PRESENTED BY-
PROF.KUNDAN CHANDRA PATRA NAME-ABHIJEET RAJ
PROF. SWOGAT KUMAR JENA BRANCH-CSE(1)
PROF. SAROJ KUMAR MOHANTY REG NO.-1301287529
2. CONTENTS -
1. INTRODUCTION TO HADOOP
2. HADOOP-HISTORY AND ORIGIN
3. BIG DATA ANALYTICS AND CHALLENGES
4. HADOOP ECOSYSTEM
5. HDFS ARCHITECTURE
6. HADOOP VS RDBMS
7. MAP REDUCE
8. PIG AND HIVE
9. CONCLUSION
1Abhijeet raj,131001
3. INTRODUCTION-
• What is Hadoop-
• Apache Hadoop is an open-source software
framework for distribuited storage and
processing of large data
• Written in java
• Based on Google file system(GFS)
2Abhijeet raj,131001
4. Continued...
• It is designed to scale up from single servers to
thousands of machines, each offering local
computation and storage.
• Hadoop framework consists on two main layers
• HDFS
• Map Reduce
Abhijeet raj,131001 3
5. History and Origin
• Doug cutting trying to make an open source
search engine in 2003
• Google released their distributed system
papers called Map/Reduce and Google file
system (GFS) which powered Google search
engine:
Abhijeet raj,131001 4
6. Continued...
• Doug cutting took these ideas and started to
work on open source
• In 2006 he joins Yahoo! and the distributed
system named as Hadoop
• Yahoo open sourced it through Apache
organization
Abhijeet raj,131001 5
7. Organizations using Hadoop
• Amazon
• Adobe
• Cloudspace
• Ebay
• Facebook
• Google
• IBM
• LinkedIn
• yahoo
Abhijeet raj,131001 6
8. Big data analytics and
challenges
• Minimum size of that a Big Data file starts is
at least 1 Terabyte.
• 4 V’s tossed for Big Data:-
1. VOLUME- The scale of data
2. VARIETY- Different forms of data
3. VELOCITY- Analysis of streaming data
4. VARACITY- Uncertainity of data
Abhijeet raj,131001 7
9. Challenges for Big Data
processing
• Meeting the need for speed
• Scale
• Continuous Availability
• Displaying meaningful results
• Workload diversity
• Data security
• Cost
• Manageability
Abhijeet raj,131001 8
10. Hadoop vs traditional RDBMS
Abhijeet raj,131001 9
Factors Hadoop RDBMS
Size of data Petabytes Gigabytes
Integrity of data Low High
Data schema Dynamic Static
Access method Interactive and batch Batch
Scaling Linear Non linear
Data structure Unstructured/structured Structured
Normalization of data Not required Required
Query response time Has latency(due to
batch process)
Can be near immediate
12. HDFS(Hadoop Distribuited File System)
• a distributed file system designed to run on
commodity hardware
• It is suitable for the distributed storage and
processing.
• The built-in servers of namenode and
datanode help users to easily check the
status of cluster.
• HDFS provides file permissions and
authentication.
Abhijeet raj,131001 11
13. Continued...
Namenode
• Namenode is the node which stores the filesystem
metadata i.e. which file maps to what block
locations and which blocks are stored on which
datanode.
Datanode
• The data node is where the actual data resides.
Abhijeet raj,131001 12
14. Continued...
Job tracker
• primary function of the job tracker is resource
management ,tracking resource availability and
task life cycle management
Task tracker
• Follow the orders of the job tracker and
updating the job tracker with its progress status
periodically.
Abhijeet raj,131001 13
17. Map Reduce
• MapReduce is a processing technique and a
program model for distributed computing
based on java
• Map-data are broken into tuples
• Reduce-combines the tuples into a smaller
form
Abhijeet raj,131001 16
19. Advantages of Map Reduce
• Easy to scale data processing over multiple
computing nodes.
• Parallel processing.
• Fast.
• Simple model of programming
Abhijeet raj,131001 18
20. HBASE
• Developed by Apache software foundation
• Database for Hadoop.
• Open source
• Non-relational
Abhijeet raj,131001 19
22. YARN
• Yet Another Resource Negotiator
• In Yarn, the job tracker is split into two
different daemons called Resource
Manager and Node Manager
Abhijeet raj,131001 21
24. PIG
• Analyzing large data sets that consists of a
high-level language for expressing data
analysis programs
• Structure is amenable to substantial
parallelization
Abhijeet raj,131001 23
26. HIVE
• Data warehouse software facilitates querying
and managing large datasets
• Allows traditional map/reduce programmers
to plug in their custom mappers and
reducers
Abhijeet raj,131001 25
27. PIG VS HIVE
Abhijeet raj,131001 26
PIG HIVE
TYPES OF FLOW PROCEDURAL LANGUAGE DECLARATIVE LANGUAGE
EASY OF USE COMPLEX EASY
NATURE OF USAGE EFFICIENCY IN COMPUTING ANALYTICS AREA
TYPE OF DATA VARIABLES TABLES
DEBUGGING FACILITY DEBUGGED LOCALLY COMPLEX
MAINTENANCE MORE LESS
DEVELOPMENT TIME MORE LESS
HANDLING BIG DATA HANDLES MORE DATA MEMORY OVERFLOW
29. Conclusion
• Hadoop has been very effective solution for
companies dealing with the data in petabytes
or big data.
• Has overcame the limitations of traditional
data storage problems.
• Being open source , widely accepted
Abhijeet raj,131001 28