Contenu connexe Similaire à Introduction to Hadoop - The Essentials (20) Introduction to Hadoop - The Essentials2. About Me
•
•
•
•
•
Founder and Managing Director of Axeldata Systems
13+ years involved in designing data architectures
Previous life at Sun, Cisco, Oracle, Google, F5 Networks
Working with Hadoop since 2011
Certified as Cloudera Hadoop Developer, Administrator
and HBase Specialist
• Authorized Cloudera Hadoop trainer
• Perspective - Hadoop is the foundation of scalable big
data platforms
© 2013. Axeldata Systems FZ-LLC
2
3. Why Hadoop?
•
•
•
RDBMS technology has served us well for
30+ years
Excellent for low-latency, real-time
transaction-oriented data processing
In the age of big data, RDBMS has many
limitations:
– Volume: shared-all architecture limits linear
scalability and requires fork-lift upgrades of
hardware infrastructure when limits are reached
– Variety: data has to fit nicely in rows and column,
with a rigid schema, suitable for structured data
but fails to handle unstructured data
– Velocity: ingesting data at speed means you can’t
afford the time to shape data into the clean
structures of relational databases
© 2013. Axeldata Systems FZ-LLC
3
4. A Brief History of Hadoop
1000-node
Yahoo! cluster
Google publish
MapReduce
paper
Google publish
GFS paper
Nutch rearchitecture
Nutch
created
2002
Hadoop subproject
2003
© 2013. Axeldata Systems FZ-LLC
2004
2005
2006
First
commercial
distribution
Top-level
Apache Project
2007
2008
4
Hadoop 2.0
Hive, Pig,
HBase graduate
2009
Impala, the
first real-time
query engine
Further
commercial
distributions
2010
2011
2012
2013
5. The Birth of Hadoop
“The name my kid gave a stuffed yellow
elephant. Short, relatively easy to spell
and pronounce, meaningless, and not
used elsewhere: those are my naming
criteria. Kids are good at generating
such.”
- Doug Cutting, Creator of Hadoop
© 2013. Axeldata Systems FZ-LLC
5
6. Hadoop: The Big Data Platform
It is a framework that allows for the distributed
processing of large data sets across clusters of
computers using simple programming models
© 2013. Axeldata Systems FZ-LLC
6
7. Core Hadoop Concepts
• Applications are written in high level code
– Developers don’t need to worry about network programming and
dependencies
• Minimal communication between the nodes
– Shared nothing architecture
• Move compute to storage, not the opposite
– Computation happens locally on each machine
– No need to move data around
• Failure is accepted and tolerated
– Data is replicated multiple times across different machines
© 2013. Axeldata Systems FZ-LLC
7
8. Hadoop Then…
• Storage
Batch MR
– Hadoop
Distributed File
System
(HDFS)
Resource Management
• Programming
Framework
Storage
Integration
© 2013. Axeldata Systems FZ-LLC
– MapReduce
8
10. What is HDFS?
• Distributed file system
• Breaks large files into
smaller blocks that are
stored on clusters of nodes
• Master-Slave architecture
• Processes:
– NameNode (Master)
– Standby NameNode (Master)
– DataNode (Slave)
© 2013. Axeldata Systems FZ-LLC
10
Namenode
Standby
NameNode
Datanode
Datanode
Datanode
Datanode
12. What is MapReduce (MRv1)?
• Programming Framework
• Breaks processing into 2
phases:
– Map phase
– Reduce phase
TaskTracker
TaskTracker
• Master-Slave architecture
• Processes:
– JobTracker (Master)
– TaskTracker (Slave)
© 2013. Axeldata Systems FZ-LLC
JobTracker
TaskTracker
TaskTracker
TaskTracker
12
14. MapReduce: The Mapper
• Is a function that performs the map phase
• Each mapper usually operates on a single HDFS
block
• Takes a key and value as input can generate
multiple keys and values as output
• <k1,v1> list(<k2,v2>)
• The output of all mappers are then sorted by key
© 2013. Axeldata Systems FZ-LLC
14
15. MapReduce: The Reducer
• Is a function that performs the reduce phase
• Each reducer operates on a portion of the output
of all mappers
• Takes a key with a list of all values as input and
generates an aggregate of the values for each
key
• <k2,list(v2)> list(<k3,v3>)
© 2013. Axeldata Systems FZ-LLC
15
17. HDFS & MapReduce Example: Word Count
Original File
I will arise and go now, and go to
Innisfree,
And a small cabin build there, of
clay and wattles made:
Nine bean-rows will I have there, a
hive for the honey-bee;
And live alone in the bee-loud
glade.
And I shall have some peace there,
for peace comes dropping slow,
Dropping from the veils of the
morning to where the cricket sings;
There midnight's all a glimmer, and
noon a purple glow,
And evening full of the linnet's
wings.
I will arise and go now, for always
night and day
I hear lake water lapping with low
sounds by the shore;
While I stand on the roadway, or on
the pavements grey,
I hear it in the deep heart's core.
© 2013. Axeldata Systems FZ-LLC
File on HDFS
Mapper
I will arise and go now, and go to
Innisfree,
And a small cabin build there, of
clay and wattles made:
Nine bean-rows will I have there, a
hive for the honey-bee;
And live alone in the bee-loud
glade.
Map
And I shall have some peace
there, for peace comes dropping
slow,
Dropping from the veils of the
morning to where the cricket sings;
There midnight's all a glimmer, and
noon a purple glow,
And evening full of the linnet's
wings.
Map
I will arise and go now, for always
night and day
I hear lake water lapping with low
sounds by the shore;
While I stand on the roadway, or on
the pavements grey,
I hear it in the deep heart's core.
Map
Reduce
Reduce
Reduce
17
Output
19. Querying Data in Hadoop
Apache Hive
Apache Pig
• Developed at Facebook
• Data warehouse infrastructure built
on top of Hadoop for providing data
summarization, query, and analysis
• Provides a mechanism to project
structure onto this data and query
the data using a SQL-like language
called HiveQL
• Developed at Yahoo!
• High-level platform for creating
MapReduce programs used with
Hadoop
• Has a language called PigLatin
• Can be extended with UDFs written
in Java, Python and other
languages
© 2013. Axeldata Systems FZ-LLC
19
20. Hadoop Ecosystem
• Avro: a data serialization system
• Flume: a distributed, reliable, and available service for efficiently
collecting, aggregating, and moving large amounts of log data.
• HBase: a scalable, distributed database that supports structured
data storage for large tables
• Mahout: a Scalable machine learning and data mining library
• Oozie: a workflow scheduler system to manage Apache Hadoop
jobs.
• Sqoop: a tool designed for efficiently transferring bulk data between
Apache Hadoop and structured datastores such as relational
databases.
• Zookeeper: a high-performance coordination service for distributed
applications
© 2013. Axeldata Systems FZ-LLC
20
21. Yet Another Resource Negotiator (YARN)
– Also known as: YARN (MapReduce v2)
– New framework that facilitates writing arbitrary
distributed processing frameworks and applications.
– Splits up the two major functionalities of the
JobTracker, resource management and job
scheduling/monitoring, into separate daemons.
– Can run applications that do not follow the
MapReduce model
© 2013. Axeldata Systems FZ-LLC
21
22. Learn Hadoop
• Download the Cloudera QuickStart VM
–
–
–
–
http://bit.ly/1b00iZj
To make it easy for you to get started with Hadoop
Cloudera Distribution including Apache Hadoop (CDH)
With Cloudera Manager, Cloudera Impala, and Cloudera Search,
this virtual machine includes everything you need
• Formal training as Developer, Administrator, Analyst and other
• Free Courseware on Udacity: Introduction to Hadoop and
MapReduce
– https://www.udacity.com/course/ud617
© 2013. Axeldata Systems FZ-LLC
22
23. Other Hadoop Resources
Apache Project Websites
Hadoop:
Hive:
Pig:
Sqoop:
Flume:
http://hadoop.apache.org/
http://hive.apache.org/
http://pig.apache.org/
http://sqoop.apache.org/
http://flume.apache.org/
Original GFS and MapReduce Papers
GFS:
http://bit.ly/VZk9VL
MapReduce: http://bit.ly/8VDMHO
© 2013. Axeldata Systems FZ-LLC
23
24. Community
A community of Hadoop professionals
and users in the region
meetup.com/Hadoop-User-Group-UAE/
© 2013. Axeldata Systems FZ-LLC
24
Notes de l'éditeur In a nutshell, Hadoop grew out of research at Google, which got adopted by the Open Source community, and supported by heavyweights such as Yahoo!, Facebook and others. It had 6 years to mature. No it’s not Charles Darwin Hadoop was named after the creator’s son’s toy elephant. So What is Hadoop? Brief description of the operation of HDFS. There are 3 main components (daemons) in HDFS: NameNode, DataNode, and Secondary NameNode. There are 2 main components in MapReduce (daemons): JobTracker and TaskTracker MapReduce is composed of Map tasks and Reduce tasks. Those tasks run in parallel and do not depend on each other’s output. The major resources to start learning more about Hadoop. Also recommended is reading the research papers from Google that spurred the whole Hadoop ecosystem (by SanjarGhemawat and Jeff Dean). Q&A with the famous Hadoop elephant mascot