SlideShare une entreprise Scribd logo
1  sur  92
Télécharger pour lire hors ligne
Hadoop Ecosystem
Mohamed khouder
mohamedkhouder0100@Hotmail.com
mohamedkhouder0100@Gmail.com
1
Hadoop Ecosystem
2
Overview
The Hadoop Ecosystem
Hadoop core components
 HDFS
 Map Reduce
Other Hadoop ecosystem components
 Hive
 Pig
 Impala
 Sqoop
3
HDFS
Hadoop Distributed File System (HDFS) is designed to reliably
store very large files across machines in a large cluster.
It is inspired by the GoogleFileSystem.
Distribute large data file into blocks.
Blocks are managed by different nodes in the cluster.
Each block is replicated on multiple nodes.
Name node stored metadata information about files and blocks.
4
Hadoop Distributed File System (HDFS)
5
Centralized namenode
- Maintains metadata info about files
Many datanode (1000s)
- Store the actual data
- Files are divided into blocks
- Each block is replicated N times
(Default = 3)
File F 1 2 3 4 5
Blocks (64 MB)
HDFS Consists of a Name Node and Data Node.
HDFS Architecture
Name node Remembers where the data is stored in the cluster.
Data node Stores the actual data in the cluster.
Name node Master Node which clients must initiate read/write.
6
Has meta data information about a file.
File name ,permissions, directory.
Which nodes contain which blocks.
Name node
Disk backup of meta-data very important if you lose the name
node, you lose HDFS.
7
HDFS Architecture
8
HDFS ComparingVersions
HDFS 1.0 HDFS 2.0
disaster failure Name node single point failure Name node high availability
Resource manager
Resource Manager with map
reduce
Resource Manager withYarn
Scalability and
Performance
Scalability and performance
Suffer with larger clusters
Scalability and performance do
will with larger clusters
9
HDFS was built under the premise that hardware will fail.
FaultTolerance
Ensure that when hardware fails / users can still have their
Data available.
Achieved through storing multiple Copies throughout
Cluster.
10
Fault Tolerance
11
MapReduce
12
Hadoop Architecture
• Execution engine (Map Reduce)
13
Programming model for expressing distributed computations at a
massive scale.
What’s MapReduce?
A patented software framework introduced by Google.
Processes 20 petabytes of data per day.
Popularized by open-source Hadoop project.
Used atYahoo!, Facebook,Amazon, …
14
MapReduce: High Level
MapReduce job
submitted by
client computer
Job Tracker
Task Tracker
Task instance
Task Tracker
Task
instance
Task Tracker
Task
instance
15
 Code usually written in Java - though it can be written in other languages with the
Hadoop StreamingAPI.
MapReduce core functionality (I)
 Two fundamental components:
• Map step:
 Master node takes large problem and slices it into smaller sub problems;
distributes these to worker nodes.
 Worker node may do this again if necessary.
 Worker processes smaller problem and hands back to master.
• Reduce step:
 Master node takes the answers to the sub problems and combines them in
a predefined way to get the output/answer to original problem.
16
Hadoop data flow
17
Input reader reads a block and divides into splits.
Input reader
Each split would be sent to a map function.
a line is an input of a map function.
The key could be some internal number (filename - blockid – lineid ).
The value is the content of the textual line.
Apple Orange Mongo
Orange Grapes Plum
Apple Plum Mongo
Apple Apple Plum
Block 1
Block 2
Apple Orange Mongo
Orange Grapes Plum
Apple Plum Mongo
Apple Apple Plum
Input reader
18
Mapper: map function
Mapper takes the output generated by input reader.
output a list of intermediate <key, value> pairs.
Apple Orange Mongo
Orange Grapes Plum
Apple Plum Mongo
Apple Apple Plum
Apple, 1
Orange, 1
Mongo, 1
Orange, 1
Grapes, 1
Plum, 1
Apple, 1
Plum, 1
Mongo, 1
Apple, 1
Apple, 1
Plum, 1
mapper
m1
m2
m3
m4
19
Reducer: reduce function
Reducer takes the output generated by
the Mapper.
aggregates the value for each key, and
outputs the final result.
Apple, 1
Orange, 1
Mongo, 1
Orange, 1
Grapes, 1
Plum, 1
Apple, 1
Plum, 1
Mongo, 1
Apple, 1
Apple, 1
Plum, 1
Apple, 1
Apple, 1
Apple, 1
Apple, 1
Orange, 1
Orange, 1
Grapes, 1
Mongo, 1
Mongo, 1
Plum, 1
Plum, 1
Plum, 1
Apple, 4
Orange, 2
Grapes, 1
Mongo, 2
Plum, 3
reducer
shuffle/sort
r1
r2
r3
r4
r5
There is shuffle/sort before reducing.
20
Execute MapReduce on a
cluster of machines with HDFS
21
Execution
22
MapReduce: Execution Details
Input reader
Divide input into splits, assign each split to a Map task.
Map task
Apply the Map function to each record in the split.
Each Map function returns a list of (key, value) pairs.
Shuffle/Partition and Sort
Shuffle distributes sorting & aggregation to many reducers.
All records for key k are directed to the same reduce processor.
Sort groups the same keys together, and prepares for aggregation.
Reduce task
Apply the Reduce function to each key.
The result of the Reduce function is a list of (key, value) pairs.
23
MapReduce Phases
Deciding on what will be the key and what will be the value  developer’s responsibility
24
REDUCE(k,list(vMAP(k,v)
MapReduce – Group AVG Example
NewYork, US, 10
LosAngeles, US,40
London, GB, 20
Berlin, DE, 60
Glasgow, GB, 10
Munich, DE, 30
…
DE,45
GB,15
US,25
(US,10)
(US,40)
(GB,20
(GB,10
(DE,60
(DE,30
(US,10
(US,40
(GB,20
(GB,10
(DE,60
(DE,30
Input Data Intermediate
(K,V)-Pairs
Result
25
Map-Reduce Execution Engine
(Example: Color Count)
Shuffle & Sorting
based on k
Reduce
Reduce
Reduce
Map
Map
Map
Map
Input blocks
on HDFS
Produces (k, v)
( , 1)
Parse-hash
Parse-hash
Parse-hash
Parse-hash
Consumes(k, [v])
( , [1,1,1,1,1,1..])
Produces(k’, v’)
( , 100)
Users only provide the “Map” and “Reduce” functions 26
Properties of MapReduce Engine
JobTracker is the master node (runs with the namenode)
Receives the user’s job
Decides on how many tasks will run (number of mappers)
Decides on where to run each mapper (concept of locality)
• This file has 5 Blocks  run 5 map tasks
• Where to run the task reading block “1”
• Try to run it on Node 1 or Node 3
Node 1 Node 2 Node 3
27
Properties of MapReduce Engine (Cont’d)
 TaskTracker is the slave node (runs on each datanode)
 Receives the task from JobTracker
 Runs the task until completion (either map or reduce task)
 Always in communication with the JobTracker reporting progress
Reduce
Reduce
Reduce
Map
Map
Map
Map
Parse-hash
Parse-hash
Parse-hash
Parse-hash
In this example,
1 map-reduce job consists of 4
map tasks and 3 reduce tasks
28
Example -Word count
Hello
Cloud
TA cool
Hello
TA
cool
Input
Mapper
Mapper
Mapper
Hello [11]
TA [11]
Cloud [1]
cool [11] Reducer
Reducer
Hello 2
TA 2
Cloud 1
cool 2
Hello 1
TA 1
Cloud 1
Hello1
cool 1
cool 1
TA 1
Hello1
Hello1
TA 1
TA 1
Cloud1
cool 1
cool 1
Sort/Copy
Merge
Output
29
Example 1:Word Count
Map
Tasks Reduce
Tasks
 Job: Count the occurrences of each word in a data set
30
Example 2: Color Count
Shuffle & Sorting
based on k
Reduce
Reduce
Reduce
Map
Map
Map
Map
Input blocks
on HDFS
Produces (k, v)
( , 1)
Parse-hash
Parse-hash
Parse-hash
Parse-hash
Consumes(k, [v])
( , [1,1,1,1,1,1..])
Produces(k’, v’)
( , 100)
Job: Count the number of each color in a data set
Part0003
Part0002
Part0001
That’s the output file, it
has 3 parts on probably 3
different machines 31
Example 3: Color Filter
Job: Select only the blue and the green colors
Input blocks
on HDFS
Map
Map
Map
Map
Produces (k, v)
( , 1)
Write to HDFS
Write to HDFS
Write to HDFS
Write to HDFS
• Each map task will select only
the blue or green colors
• No need for reduce phase
Part0001
Part0002
Part0003
Part0004
That’s the output file, it
has 4 parts on probably 4
different machines
32
Word Count Execution
the quick
brown fox
the fox ate
the mouse
how now
brown cow
Map
Map
Map
Reduce
Reduce
brown, 2
fox, 2
how, 1
now, 1
the, 3
ate, 1
cow, 1
mouse, 1
quick, 1
the, 1
brown, 1
fox, 1
quick, 1
the, 1
fox, 1
the, 1
how, 1
now, 1
brown, 1
ate, 1
mouse, 1
cow, 1
Input Map Shuffle & Sort Reduce Output
33
Word Count with Combiner
Input Map & Combine Shuffle & Sort Reduce Output
the quick
brown fox
the fox ate
the mouse
how now
brown cow
Map
Map
Map
Reduce
Reduce
brown, 2
fox, 2
how, 1
now, 1
the, 3
ate, 1
cow, 1
mouse, 1
quick, 1
the, 1
brown, 1
fox, 1
quick, 1
the, 2
fox, 1
how, 1
now, 1
brown, 1
ate, 1
mouse, 1
cow, 1
34
Apache Hive
35
Hive:A data warehouse on Hadoop
36
Why Hive?
Problem: Data, data and more data
200GB per day in March 2008 back to 1TB compressed per
day today
The Hadoop Experiment
Problem: Map/Reduce (MR) is great but every one is not a
Map/Reduce expert.
I know SQL and I am a python and php expert.
So what do we do: HIVE
37
• A system for querying and managing structured data built on top of
Map/Reduce and Hadoop.
• MapReduce (MR) is very low level and requires customers to write custom
programs.
• HIVE supports queries expressed in SQL-like language called HiveQL
which are compiled into MR jobs that are executed on Hadoop.
What is HIVE?
• Data model
• Hive structures data into well-understood database concepts such as: tables, rows,
columns.
• It supports primitive types: integers, floats, doubles, and strings.
38
Hive Components
 Shell Interface: Like the MySQL shell
 Driver:
 Session handles, fetch, execution
 Complier:
 Prarse,plan,optimize.
 Execution Engine:
 DAG stage,Run map or reduce.
39
Hive
architecture
40
HDFS
Map Reduce
Web UI + Hive CLI +
JDBC/ODBC
Browse, Query, DDL
MetaStore
Thrift API
Hive QL
Parser
Planner
Optimizer
Execution
SerDe
CSV
Thrift
Regex
UDF/UDAF
substr
sum
average
FileFormats
TextFile
SequenceFile
RCFile
User-defined
Map-reduce Scripts
Architecture
41
Hive Metastore
 Stores Hive metadata.
 Default metastore database uses Apache Derby.
 Various configurations:
 Embedded(in-process metastore, in-process database)
 Mainly for unit tests.
 only one process can connect to the metastore at a time.
 Local (in-process metastore, out-of-process database)
 Each Hive client connects to the metastore directly
 Remote (out-of-process metastore, out-of-process database)
 Each Hive client connects to a metastore server, which connects to the
metadata database itself.
 Metastore server and Clint communicate usingThrift Protocol.
42
HiveWarehouse
 Hive tables are stored in the Hive “warehouse”.
Default HDFS location: /user/hive/warehouse.
 Tables are stored as sub-directories in the warehouse directory.
 Partitions are subdirectories of tables.
 External tables are supported in Hive.
 The actual data is stored in flat files.
43
Hive Schemas
 Hive is schema-on-read
oSchema is only enforced when the data is read (at query time)
oAllows greater flexibility: same data can be read using multiple
schemas
 Contrast with an RDBMS, which is schema-on-write
oSchema is enforced when the data is loaded.
oSpeeds up queries at the expense of load times.
44
Data Hierarchy
 Hive is organised hierarchically into:
 Databases: namespaces that separate tables and other objects.
 Tables: homogeneous units of data with the same schema.
Analogous to tables in an RDBMS.
 Partitions: determine how the data is stored
Allow efficient access to subsets of the data.
 Buckets/clusters
For subsampling within a partition.
Join optimization.
45
HiveQL
 HiveQL / HQL provides the basic SQL-like operations:
 Select columns using SELECT.
 Filter rows usingWHERE.
 JOIN between tables.
 Evaluate aggregates using GROUP BY.
 Store query results into another table.
 Download results to a local directory (i.e., export from HDFS).
 Manage tables and queries with CREATE, DROP, and ALTER.
46
Primitive DataTypes
Type Comments
TINYINT, SMALLINT, INT,
BIGINT
1, 2, 4 and 8-byte integers
BOOLEAN TRUE/FALSE
FLOAT, DOUBLE Single and double precision real numbers
STRING Character string
TIMESTAMP Unix-epoch offset or datetime string
DECIMAL Arbitrary-precision decimal
BINARY 47
Complex DataTypes
Type Comments
STRUCT
A collection of elements
If S is of type STRUCT {a INT, b INT}:
S.a returns element a
MAP
Key-value tuple
If M is a map from 'group' to GID:
M['group'] returns value of GID
ARRAY
Indexed list
IfA is an array of elements ['a','b','c']:
A[0] returns 'a'
48
CreateTable
 CreateTable is a statement used to create a table in Hive.
 The syntax and example are as follows:
49
CreateTable Example
 Let us assume you need to create a table named employee using CREATE
TABLE statement.
 The following table lists the fields and their data types in employee table:
50
Sr.No Field Name Data Type
1 Eid int
2 Name String
3 Salary Float
4 Designation string
CreateTable Example
 The following query creates a table named employee.
51
 If you add the option IF NOT EXISTS, Hive ignores the statement in case the table already exists.
 On successful creation of table, you get to see the following response:
Load Data
 we can insert data using the Insert statement. But in Hive, we can insert
data using the LOAD DATA statement.
 The syntax and example are as follows:
52
 LOCAL is identifier to specify the local path. It is optional.
 OVERWRITE is optional to overwrite the data in the table.
 PARTITION is optional.
Load Data Example
 We will insert the following data into the table.
It is a text file named sample.txt in /home/user directory.
53
 The following query loads the given text into the table.
 On successful download, you get to see the following response:
HiveQL - Select-Where
 Given below is the syntax of the SELECT query:
54
Select-Where Example
 Assume we have the employee table as given below, with fields named Id, Name,
Salary, Designation, and Dept. Generate a query to retrieve the employee details who
earn a salary of more than Rs 30000.
55
 The following query retrieves the employee details using the above scenario:
Select-Where Example
 On successful execution of the query, you get to see the following response:
56
HiveQL - Select-Where
 Given below is the syntax of the SELECT query:
57
HiveQL Limitations
HQL only supports equi-joins, outer joins, left semi-joins.
Because it is only a shell for mapreduce, complex queries can be
hard to optimise.
Missing large parts of full SQL specification:
 Correlated sub-queries.
 Sub-queries outside FROM clauses.
 Updatable or materialized views.
 Stored procedures.
58
External Table
CREATE EXTERNAL TABLE page_view_stg
(viewTime INT,
userid BIGINT,
page_url STRING,
referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
STORED AS TEXTFILE
LOCATION '/user/staging/page_view';
59
BrowsingTables And Partitions
Command Comments
SHOW TABLES; Show all the tables in the database
SHOW TABLES 'page.*'; Show tables matching the specification ( uses
regex syntax )
SHOW PARTITIONS page_view; Show the partitions of the page_view table
DESCRIBE page_view; List columns of the table
DESCRIBE EXTENDED page_view; More information on columns (useful only for
debugging )
DESCRIBE page_view
PARTITION (ds='2008-10-31');
List information about a partition
60
Loading Data
Use LOAD DATA to load data from a file or directory
 Will read from HDFS unless LOCAL keyword is specified
 Will append data unless OVERWRITE specified
 PARTITION required if destination table is partitioned
LOAD DATA LOCAL INPATH '/tmp/pv_2008-06-8_us.txt'
OVERWRITE INTO TABLE page_view
PARTITION (date='2008-06-08', country='US')
61
Inserting Data
Use INSERT to load data from a Hive query
 Will append data unless OVERWRITE specified
 PARTITION required if destination table is partitioned
FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view
PARTITION (dt='2008-06-08', country='US')
SELECT pvs.viewTime, pvs.userid,
pvs.page_url, pvs.referrer_url
WHERE pvs.country = 'US';
62
Apache Pig
63
What is Apache pig?
pig: is a high-level platform for creating MapReduce programs.
pig: is a tool/platform which is used to analyze larger sets of data
representing them as data flows.
Pig is made up of two components:
PigLatin.
 Runtime Environment.
64
Why Apache pig?
Programmers who are not so good at Java normally used to struggle
working with Hadoop, especially while performing any MapReduce tasks.
 Apache Pig is a boon for all such programmers.
Using Pig Latin, programmers can perform MapReduce tasks easily
without having to type complex codes in Java.
Pig Latin is SQL-like language and it is easy to learnApache Pig when you
are familiar with SQL.
65
Features of Pig
Rich set of operators − It provides many operators to perform operations
like join, sort, filer, etc.
Ease of programming − Pig Latin is similar to SQL and it is easy to write a
Pig script if you are good at SQL.
Handles all kinds of data − Apache Pig analyzes all kinds of data, both
structured as well as unstructured. It stores the results in HDFS.
UDF’s − Pig provides the facility to create User-defined Functions in other
programming languages such as Java and invoke or embed them in Pig
Scripts.
66
Apache PigVs MapReduce
67
Apache Pig MapReduce
Apache Pig is a data flow language. MapReduce is a data processing paradigm.
It is a high level language. MapReduce is low level and rigid.
Performing a Join operation inApache Pig is
pretty simple.
It is quite difficult in MapReduce to perform a
Join operation between datasets.
Any programmer with a basic knowledge of SQL
can work conveniently withApache Pig.
Exposure to Java is must to work with
MapReduce.
Apache Pig uses multi-query approach, thereby
reducing the length of the codes to a great extent.
MapReduce will require almost 20 times more the
number of lines to perform the same task.
There is no need for compilation. On execution,
every Apache Pig operator is converted internally
into a MapReduce job.
MapReduce jobs have a long compilation process.
Apache PigVs Hive
68
Apache Pig Hive
Apache Pig uses a language called Pig Latin.
It was originally created atYahoo.
Hive uses a language called HiveQL.
It was originally created at Facebook.
Pig Latin is a data flow language. HiveQL is a query processing language.
Pig Latin is a procedural language and it fits in
pipeline paradigm.
HiveQL is a declarative language.
Apache Pig can handle structured, unstructured,
and semi-structured data.
Hive is mostly for structured data.
Apache Pig - Architecture
69
Apache Pig converts scripts into a series of MapReduce jobs.
Apache Pig makes the programmer’s job easy.
 Parser :
checks the syntax, does type checking, and other
miscellaneous checks.
The output of the parser will be a DAG.
a DAG (directed acyclic graph) represents the Pig Latin
statements and logical operators.
 Optimizer :
The logical plan (DAG) is passed to the logical optimizer,
which carries out the logical optimizations such as
projection and pushdown.
Apache Pig - Architecture
70
 Compiler :
The compiler compiles the optimized logical plan into a
series of MapReduce jobs.
 Execution engine :
Finally the MapReduce jobs are submitted to Hadoop in
a sorted order.
 Finally, these MapReduce jobs are executed on Hadoop
producing the desired results.
Pig Latin Data Model
71
The data model of Pig Latin is fully nested and it allows complex non-
atomic data types such as map and tuple.
• A bag is a collection of tuples.
• A tuple is an ordered set of fields.
• A field is a piece of data.
Pig Latin statements
72
 Basic constructs :
These statements work with relations,They include expressions and schemas.
Every statement ends with a semicolon (;).
Pig Latin statements take a relation as input and produce another relation as output.
 Pig Latin example:
grunt> Student_data = LOAD 'student_data.txt' USING PigStorage(',')as
( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
Pig Latin Data types
73
DataType Description & Example
int Represents a signed 32-bit integer. Example : 8
long Represents a signed 64-bit integer. Example : 5L
float Represents a signed 32-bit floating point. Example : 5.5F
double Represents a 64-bit floating point. Example : 10.5
chararray Represents a character array (string) in Unicode UTF-8 format. Example :‘tutorials point’
Bytearray Represents a Byte array (blob).
Boolean Represents a Boolean value. Example : true/ false.
Datetime Represents a date-time. Example : 1970-01-01T00:00:00.000+00:00
Biginteger Represents a Java BigInteger. Example : 60708090709
Bigdecimal Represents a Java BigDecimal Example : 185.98376256272893883
Pig Latin ComplexTypes
74
DataType Description & Example
Tuple A tuple is an ordered set of fields. Example : (raja, 30)
Bag A bag is a collection of tuples. Example : {(raju,30),(Mohhammad,45)}
Map A Map is a set of key-value pairs. Example : [‘name’#’Raju’,‘age’#30]
Apache Pig Filter Operator
 The FILTER operator is used to select the required tuples from a relation based on a condition.
75
 syntax of the FILTER operator.
 Example:
 Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below.
Filter Operator Example
 And we have loaded this file into Pig with the relation name student_details as shown below.
76
grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING
PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray,
city:chararray);
 now use the Filter operator to get the details of the students who belong to the city Chennai.
 Verify the relation filter_data using the DUMP operator as shown below.
 It will produce the following output, displaying the contents of the relation filter_data as follows.
Apache Pig Distinct Operator
 The DISTINCT operator is used to remove redundant (duplicate) tuples from a relation.
77
 syntax of the DISTINCT operator.
 Example:
 Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below.
Distinct Operator Example
78
 remove the redundant (duplicate) tuples from the relation named student_details using the DISTINCT operator
 Verify the relation distinct_data using the DUMP operator as shown below.
 It will produce the following output, displaying the contents of the relation distinct_data as follows.
Apache Pig Group Operator
 The GROUP operator is used to group the data in one or more relations. It collects the data having the same key.
79
 syntax of the group operator.
 Example:
 Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below.
Group Operator Example
80
 let us group the records/tuples in the relation by age as shown below.
 Verify the relation group_data using the DUMP operator as shown below.
 It will produce the following output, displaying the contents of the relation group_data as follows.
Apache Pig Join Operator
 The JOIN operator is used to combine records from two or more relations.
81
 Joins can be of the following types:
 Self-join: is used to join a table with itself.
 Inner Join:An inner join returns rows when there is a match in both tables.
 left outer Join: returns all rows from the left table, even if there are no matches in the right relation.
 right outer join: returns all rows from the right table, even if there are no matches in the left table.
 full outer join: operation returns rows when there is a match in one of the relations.
Impala
82
What is Impala?
Cloudera Impala is a query engine that runs onApache Hadoop.
Similar to HiveQL.
Does not use Map reduce.
Optimized for low latency queries.
Open source apache project.
Developed by Cloudera.
Much faster than Hive or pig.
83
Comparing Pig, Hive and Impala
Description of Feature Pig Hive Impala
SQL based query language No yes yes
Schema optional required required
Process data with external scripts yes yes no
Extensible file format support yes yes no
Query speed slow slow fast
Accessible via ODBC/JDBC no yes yes
84
Apache Sqoop
85
What is Sqoop?
 Command-line interface for transforming data between relational database and
Hadoop
 Support incremental imports
 Imports use to populate tables in Hadoop
 Exports use to put data from Hadoop into relational database such as SQL server
Hadoop RDBMSsqoop
86
Sqoop Import
 sqoop import
--connect jdbc:postgresql://hdp-master/sqoop_db
--username sqoop_user
--password postgres
--table cities
87
Sqoop Export
sqoop export
--connect jdbc:postgresql://hdp-master/sqoop_db
--username sqoop_user
--password postgres
--table cities
--export-dir cities
88
Scoop – Example
 An example scoop command to
– load data from mySql into Hive
bin/sqoop-import
--connect jdbc:mysql://<mysql host>:<msql port>/db3 
-username <username> 
-password <password> 
--table <tableName> 
--hive-table <Hive tableName> 
--create-hive-table 
--hive-import 
--hive-home <hive path>
89
How Sqoop works
The dataset being transferred is broken into small blocks.
Map only job is launched.
Individual mapper is responsible for transferring a block of the
dataset.
90
How Sqoop works
91
92

Contenu connexe

Tendances

Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 

Tendances (20)

Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop
 
Hive
HiveHive
Hive
 
Apache hive
Apache hiveApache hive
Apache hive
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Design cube in Apache Kylin
Design cube in Apache KylinDesign cube in Apache Kylin
Design cube in Apache Kylin
 
Hadoop
HadoopHadoop
Hadoop
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Hadoop hdfs
Hadoop hdfsHadoop hdfs
Hadoop hdfs
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 

Similaire à Hadoop ecosystem

Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
Cisco Canada
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
Csaba Toth
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
Pallav Jha
 

Similaire à Hadoop ecosystem (20)

Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 
Big Data and Hadoop with MapReduce Paradigms
Big Data and Hadoop with MapReduce ParadigmsBig Data and Hadoop with MapReduce Paradigms
Big Data and Hadoop with MapReduce Paradigms
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptx
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 

Dernier

Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 

Dernier (20)

Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 

Hadoop ecosystem

  • 3. Overview The Hadoop Ecosystem Hadoop core components  HDFS  Map Reduce Other Hadoop ecosystem components  Hive  Pig  Impala  Sqoop 3
  • 4. HDFS Hadoop Distributed File System (HDFS) is designed to reliably store very large files across machines in a large cluster. It is inspired by the GoogleFileSystem. Distribute large data file into blocks. Blocks are managed by different nodes in the cluster. Each block is replicated on multiple nodes. Name node stored metadata information about files and blocks. 4
  • 5. Hadoop Distributed File System (HDFS) 5 Centralized namenode - Maintains metadata info about files Many datanode (1000s) - Store the actual data - Files are divided into blocks - Each block is replicated N times (Default = 3) File F 1 2 3 4 5 Blocks (64 MB)
  • 6. HDFS Consists of a Name Node and Data Node. HDFS Architecture Name node Remembers where the data is stored in the cluster. Data node Stores the actual data in the cluster. Name node Master Node which clients must initiate read/write. 6
  • 7. Has meta data information about a file. File name ,permissions, directory. Which nodes contain which blocks. Name node Disk backup of meta-data very important if you lose the name node, you lose HDFS. 7
  • 9. HDFS ComparingVersions HDFS 1.0 HDFS 2.0 disaster failure Name node single point failure Name node high availability Resource manager Resource Manager with map reduce Resource Manager withYarn Scalability and Performance Scalability and performance Suffer with larger clusters Scalability and performance do will with larger clusters 9
  • 10. HDFS was built under the premise that hardware will fail. FaultTolerance Ensure that when hardware fails / users can still have their Data available. Achieved through storing multiple Copies throughout Cluster. 10
  • 13. Hadoop Architecture • Execution engine (Map Reduce) 13
  • 14. Programming model for expressing distributed computations at a massive scale. What’s MapReduce? A patented software framework introduced by Google. Processes 20 petabytes of data per day. Popularized by open-source Hadoop project. Used atYahoo!, Facebook,Amazon, … 14
  • 15. MapReduce: High Level MapReduce job submitted by client computer Job Tracker Task Tracker Task instance Task Tracker Task instance Task Tracker Task instance 15
  • 16.  Code usually written in Java - though it can be written in other languages with the Hadoop StreamingAPI. MapReduce core functionality (I)  Two fundamental components: • Map step:  Master node takes large problem and slices it into smaller sub problems; distributes these to worker nodes.  Worker node may do this again if necessary.  Worker processes smaller problem and hands back to master. • Reduce step:  Master node takes the answers to the sub problems and combines them in a predefined way to get the output/answer to original problem. 16
  • 18. Input reader reads a block and divides into splits. Input reader Each split would be sent to a map function. a line is an input of a map function. The key could be some internal number (filename - blockid – lineid ). The value is the content of the textual line. Apple Orange Mongo Orange Grapes Plum Apple Plum Mongo Apple Apple Plum Block 1 Block 2 Apple Orange Mongo Orange Grapes Plum Apple Plum Mongo Apple Apple Plum Input reader 18
  • 19. Mapper: map function Mapper takes the output generated by input reader. output a list of intermediate <key, value> pairs. Apple Orange Mongo Orange Grapes Plum Apple Plum Mongo Apple Apple Plum Apple, 1 Orange, 1 Mongo, 1 Orange, 1 Grapes, 1 Plum, 1 Apple, 1 Plum, 1 Mongo, 1 Apple, 1 Apple, 1 Plum, 1 mapper m1 m2 m3 m4 19
  • 20. Reducer: reduce function Reducer takes the output generated by the Mapper. aggregates the value for each key, and outputs the final result. Apple, 1 Orange, 1 Mongo, 1 Orange, 1 Grapes, 1 Plum, 1 Apple, 1 Plum, 1 Mongo, 1 Apple, 1 Apple, 1 Plum, 1 Apple, 1 Apple, 1 Apple, 1 Apple, 1 Orange, 1 Orange, 1 Grapes, 1 Mongo, 1 Mongo, 1 Plum, 1 Plum, 1 Plum, 1 Apple, 4 Orange, 2 Grapes, 1 Mongo, 2 Plum, 3 reducer shuffle/sort r1 r2 r3 r4 r5 There is shuffle/sort before reducing. 20
  • 21. Execute MapReduce on a cluster of machines with HDFS 21
  • 23. MapReduce: Execution Details Input reader Divide input into splits, assign each split to a Map task. Map task Apply the Map function to each record in the split. Each Map function returns a list of (key, value) pairs. Shuffle/Partition and Sort Shuffle distributes sorting & aggregation to many reducers. All records for key k are directed to the same reduce processor. Sort groups the same keys together, and prepares for aggregation. Reduce task Apply the Reduce function to each key. The result of the Reduce function is a list of (key, value) pairs. 23
  • 24. MapReduce Phases Deciding on what will be the key and what will be the value  developer’s responsibility 24
  • 25. REDUCE(k,list(vMAP(k,v) MapReduce – Group AVG Example NewYork, US, 10 LosAngeles, US,40 London, GB, 20 Berlin, DE, 60 Glasgow, GB, 10 Munich, DE, 30 … DE,45 GB,15 US,25 (US,10) (US,40) (GB,20 (GB,10 (DE,60 (DE,30 (US,10 (US,40 (GB,20 (GB,10 (DE,60 (DE,30 Input Data Intermediate (K,V)-Pairs Result 25
  • 26. Map-Reduce Execution Engine (Example: Color Count) Shuffle & Sorting based on k Reduce Reduce Reduce Map Map Map Map Input blocks on HDFS Produces (k, v) ( , 1) Parse-hash Parse-hash Parse-hash Parse-hash Consumes(k, [v]) ( , [1,1,1,1,1,1..]) Produces(k’, v’) ( , 100) Users only provide the “Map” and “Reduce” functions 26
  • 27. Properties of MapReduce Engine JobTracker is the master node (runs with the namenode) Receives the user’s job Decides on how many tasks will run (number of mappers) Decides on where to run each mapper (concept of locality) • This file has 5 Blocks  run 5 map tasks • Where to run the task reading block “1” • Try to run it on Node 1 or Node 3 Node 1 Node 2 Node 3 27
  • 28. Properties of MapReduce Engine (Cont’d)  TaskTracker is the slave node (runs on each datanode)  Receives the task from JobTracker  Runs the task until completion (either map or reduce task)  Always in communication with the JobTracker reporting progress Reduce Reduce Reduce Map Map Map Map Parse-hash Parse-hash Parse-hash Parse-hash In this example, 1 map-reduce job consists of 4 map tasks and 3 reduce tasks 28
  • 29. Example -Word count Hello Cloud TA cool Hello TA cool Input Mapper Mapper Mapper Hello [11] TA [11] Cloud [1] cool [11] Reducer Reducer Hello 2 TA 2 Cloud 1 cool 2 Hello 1 TA 1 Cloud 1 Hello1 cool 1 cool 1 TA 1 Hello1 Hello1 TA 1 TA 1 Cloud1 cool 1 cool 1 Sort/Copy Merge Output 29
  • 30. Example 1:Word Count Map Tasks Reduce Tasks  Job: Count the occurrences of each word in a data set 30
  • 31. Example 2: Color Count Shuffle & Sorting based on k Reduce Reduce Reduce Map Map Map Map Input blocks on HDFS Produces (k, v) ( , 1) Parse-hash Parse-hash Parse-hash Parse-hash Consumes(k, [v]) ( , [1,1,1,1,1,1..]) Produces(k’, v’) ( , 100) Job: Count the number of each color in a data set Part0003 Part0002 Part0001 That’s the output file, it has 3 parts on probably 3 different machines 31
  • 32. Example 3: Color Filter Job: Select only the blue and the green colors Input blocks on HDFS Map Map Map Map Produces (k, v) ( , 1) Write to HDFS Write to HDFS Write to HDFS Write to HDFS • Each map task will select only the blue or green colors • No need for reduce phase Part0001 Part0002 Part0003 Part0004 That’s the output file, it has 4 parts on probably 4 different machines 32
  • 33. Word Count Execution the quick brown fox the fox ate the mouse how now brown cow Map Map Map Reduce Reduce brown, 2 fox, 2 how, 1 now, 1 the, 3 ate, 1 cow, 1 mouse, 1 quick, 1 the, 1 brown, 1 fox, 1 quick, 1 the, 1 fox, 1 the, 1 how, 1 now, 1 brown, 1 ate, 1 mouse, 1 cow, 1 Input Map Shuffle & Sort Reduce Output 33
  • 34. Word Count with Combiner Input Map & Combine Shuffle & Sort Reduce Output the quick brown fox the fox ate the mouse how now brown cow Map Map Map Reduce Reduce brown, 2 fox, 2 how, 1 now, 1 the, 3 ate, 1 cow, 1 mouse, 1 quick, 1 the, 1 brown, 1 fox, 1 quick, 1 the, 2 fox, 1 how, 1 now, 1 brown, 1 ate, 1 mouse, 1 cow, 1 34
  • 36. Hive:A data warehouse on Hadoop 36
  • 37. Why Hive? Problem: Data, data and more data 200GB per day in March 2008 back to 1TB compressed per day today The Hadoop Experiment Problem: Map/Reduce (MR) is great but every one is not a Map/Reduce expert. I know SQL and I am a python and php expert. So what do we do: HIVE 37
  • 38. • A system for querying and managing structured data built on top of Map/Reduce and Hadoop. • MapReduce (MR) is very low level and requires customers to write custom programs. • HIVE supports queries expressed in SQL-like language called HiveQL which are compiled into MR jobs that are executed on Hadoop. What is HIVE? • Data model • Hive structures data into well-understood database concepts such as: tables, rows, columns. • It supports primitive types: integers, floats, doubles, and strings. 38
  • 39. Hive Components  Shell Interface: Like the MySQL shell  Driver:  Session handles, fetch, execution  Complier:  Prarse,plan,optimize.  Execution Engine:  DAG stage,Run map or reduce. 39
  • 41. HDFS Map Reduce Web UI + Hive CLI + JDBC/ODBC Browse, Query, DDL MetaStore Thrift API Hive QL Parser Planner Optimizer Execution SerDe CSV Thrift Regex UDF/UDAF substr sum average FileFormats TextFile SequenceFile RCFile User-defined Map-reduce Scripts Architecture 41
  • 42. Hive Metastore  Stores Hive metadata.  Default metastore database uses Apache Derby.  Various configurations:  Embedded(in-process metastore, in-process database)  Mainly for unit tests.  only one process can connect to the metastore at a time.  Local (in-process metastore, out-of-process database)  Each Hive client connects to the metastore directly  Remote (out-of-process metastore, out-of-process database)  Each Hive client connects to a metastore server, which connects to the metadata database itself.  Metastore server and Clint communicate usingThrift Protocol. 42
  • 43. HiveWarehouse  Hive tables are stored in the Hive “warehouse”. Default HDFS location: /user/hive/warehouse.  Tables are stored as sub-directories in the warehouse directory.  Partitions are subdirectories of tables.  External tables are supported in Hive.  The actual data is stored in flat files. 43
  • 44. Hive Schemas  Hive is schema-on-read oSchema is only enforced when the data is read (at query time) oAllows greater flexibility: same data can be read using multiple schemas  Contrast with an RDBMS, which is schema-on-write oSchema is enforced when the data is loaded. oSpeeds up queries at the expense of load times. 44
  • 45. Data Hierarchy  Hive is organised hierarchically into:  Databases: namespaces that separate tables and other objects.  Tables: homogeneous units of data with the same schema. Analogous to tables in an RDBMS.  Partitions: determine how the data is stored Allow efficient access to subsets of the data.  Buckets/clusters For subsampling within a partition. Join optimization. 45
  • 46. HiveQL  HiveQL / HQL provides the basic SQL-like operations:  Select columns using SELECT.  Filter rows usingWHERE.  JOIN between tables.  Evaluate aggregates using GROUP BY.  Store query results into another table.  Download results to a local directory (i.e., export from HDFS).  Manage tables and queries with CREATE, DROP, and ALTER. 46
  • 47. Primitive DataTypes Type Comments TINYINT, SMALLINT, INT, BIGINT 1, 2, 4 and 8-byte integers BOOLEAN TRUE/FALSE FLOAT, DOUBLE Single and double precision real numbers STRING Character string TIMESTAMP Unix-epoch offset or datetime string DECIMAL Arbitrary-precision decimal BINARY 47
  • 48. Complex DataTypes Type Comments STRUCT A collection of elements If S is of type STRUCT {a INT, b INT}: S.a returns element a MAP Key-value tuple If M is a map from 'group' to GID: M['group'] returns value of GID ARRAY Indexed list IfA is an array of elements ['a','b','c']: A[0] returns 'a' 48
  • 49. CreateTable  CreateTable is a statement used to create a table in Hive.  The syntax and example are as follows: 49
  • 50. CreateTable Example  Let us assume you need to create a table named employee using CREATE TABLE statement.  The following table lists the fields and their data types in employee table: 50 Sr.No Field Name Data Type 1 Eid int 2 Name String 3 Salary Float 4 Designation string
  • 51. CreateTable Example  The following query creates a table named employee. 51  If you add the option IF NOT EXISTS, Hive ignores the statement in case the table already exists.  On successful creation of table, you get to see the following response:
  • 52. Load Data  we can insert data using the Insert statement. But in Hive, we can insert data using the LOAD DATA statement.  The syntax and example are as follows: 52  LOCAL is identifier to specify the local path. It is optional.  OVERWRITE is optional to overwrite the data in the table.  PARTITION is optional.
  • 53. Load Data Example  We will insert the following data into the table. It is a text file named sample.txt in /home/user directory. 53  The following query loads the given text into the table.  On successful download, you get to see the following response:
  • 54. HiveQL - Select-Where  Given below is the syntax of the SELECT query: 54
  • 55. Select-Where Example  Assume we have the employee table as given below, with fields named Id, Name, Salary, Designation, and Dept. Generate a query to retrieve the employee details who earn a salary of more than Rs 30000. 55  The following query retrieves the employee details using the above scenario:
  • 56. Select-Where Example  On successful execution of the query, you get to see the following response: 56
  • 57. HiveQL - Select-Where  Given below is the syntax of the SELECT query: 57
  • 58. HiveQL Limitations HQL only supports equi-joins, outer joins, left semi-joins. Because it is only a shell for mapreduce, complex queries can be hard to optimise. Missing large parts of full SQL specification:  Correlated sub-queries.  Sub-queries outside FROM clauses.  Updatable or materialized views.  Stored procedures. 58
  • 59. External Table CREATE EXTERNAL TABLE page_view_stg (viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS TEXTFILE LOCATION '/user/staging/page_view'; 59
  • 60. BrowsingTables And Partitions Command Comments SHOW TABLES; Show all the tables in the database SHOW TABLES 'page.*'; Show tables matching the specification ( uses regex syntax ) SHOW PARTITIONS page_view; Show the partitions of the page_view table DESCRIBE page_view; List columns of the table DESCRIBE EXTENDED page_view; More information on columns (useful only for debugging ) DESCRIBE page_view PARTITION (ds='2008-10-31'); List information about a partition 60
  • 61. Loading Data Use LOAD DATA to load data from a file or directory  Will read from HDFS unless LOCAL keyword is specified  Will append data unless OVERWRITE specified  PARTITION required if destination table is partitioned LOAD DATA LOCAL INPATH '/tmp/pv_2008-06-8_us.txt' OVERWRITE INTO TABLE page_view PARTITION (date='2008-06-08', country='US') 61
  • 62. Inserting Data Use INSERT to load data from a Hive query  Will append data unless OVERWRITE specified  PARTITION required if destination table is partitioned FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION (dt='2008-06-08', country='US') SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url WHERE pvs.country = 'US'; 62
  • 64. What is Apache pig? pig: is a high-level platform for creating MapReduce programs. pig: is a tool/platform which is used to analyze larger sets of data representing them as data flows. Pig is made up of two components: PigLatin.  Runtime Environment. 64
  • 65. Why Apache pig? Programmers who are not so good at Java normally used to struggle working with Hadoop, especially while performing any MapReduce tasks.  Apache Pig is a boon for all such programmers. Using Pig Latin, programmers can perform MapReduce tasks easily without having to type complex codes in Java. Pig Latin is SQL-like language and it is easy to learnApache Pig when you are familiar with SQL. 65
  • 66. Features of Pig Rich set of operators − It provides many operators to perform operations like join, sort, filer, etc. Ease of programming − Pig Latin is similar to SQL and it is easy to write a Pig script if you are good at SQL. Handles all kinds of data − Apache Pig analyzes all kinds of data, both structured as well as unstructured. It stores the results in HDFS. UDF’s − Pig provides the facility to create User-defined Functions in other programming languages such as Java and invoke or embed them in Pig Scripts. 66
  • 67. Apache PigVs MapReduce 67 Apache Pig MapReduce Apache Pig is a data flow language. MapReduce is a data processing paradigm. It is a high level language. MapReduce is low level and rigid. Performing a Join operation inApache Pig is pretty simple. It is quite difficult in MapReduce to perform a Join operation between datasets. Any programmer with a basic knowledge of SQL can work conveniently withApache Pig. Exposure to Java is must to work with MapReduce. Apache Pig uses multi-query approach, thereby reducing the length of the codes to a great extent. MapReduce will require almost 20 times more the number of lines to perform the same task. There is no need for compilation. On execution, every Apache Pig operator is converted internally into a MapReduce job. MapReduce jobs have a long compilation process.
  • 68. Apache PigVs Hive 68 Apache Pig Hive Apache Pig uses a language called Pig Latin. It was originally created atYahoo. Hive uses a language called HiveQL. It was originally created at Facebook. Pig Latin is a data flow language. HiveQL is a query processing language. Pig Latin is a procedural language and it fits in pipeline paradigm. HiveQL is a declarative language. Apache Pig can handle structured, unstructured, and semi-structured data. Hive is mostly for structured data.
  • 69. Apache Pig - Architecture 69 Apache Pig converts scripts into a series of MapReduce jobs. Apache Pig makes the programmer’s job easy.  Parser : checks the syntax, does type checking, and other miscellaneous checks. The output of the parser will be a DAG. a DAG (directed acyclic graph) represents the Pig Latin statements and logical operators.  Optimizer : The logical plan (DAG) is passed to the logical optimizer, which carries out the logical optimizations such as projection and pushdown.
  • 70. Apache Pig - Architecture 70  Compiler : The compiler compiles the optimized logical plan into a series of MapReduce jobs.  Execution engine : Finally the MapReduce jobs are submitted to Hadoop in a sorted order.  Finally, these MapReduce jobs are executed on Hadoop producing the desired results.
  • 71. Pig Latin Data Model 71 The data model of Pig Latin is fully nested and it allows complex non- atomic data types such as map and tuple. • A bag is a collection of tuples. • A tuple is an ordered set of fields. • A field is a piece of data.
  • 72. Pig Latin statements 72  Basic constructs : These statements work with relations,They include expressions and schemas. Every statement ends with a semicolon (;). Pig Latin statements take a relation as input and produce another relation as output.  Pig Latin example: grunt> Student_data = LOAD 'student_data.txt' USING PigStorage(',')as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
  • 73. Pig Latin Data types 73 DataType Description & Example int Represents a signed 32-bit integer. Example : 8 long Represents a signed 64-bit integer. Example : 5L float Represents a signed 32-bit floating point. Example : 5.5F double Represents a 64-bit floating point. Example : 10.5 chararray Represents a character array (string) in Unicode UTF-8 format. Example :‘tutorials point’ Bytearray Represents a Byte array (blob). Boolean Represents a Boolean value. Example : true/ false. Datetime Represents a date-time. Example : 1970-01-01T00:00:00.000+00:00 Biginteger Represents a Java BigInteger. Example : 60708090709 Bigdecimal Represents a Java BigDecimal Example : 185.98376256272893883
  • 74. Pig Latin ComplexTypes 74 DataType Description & Example Tuple A tuple is an ordered set of fields. Example : (raja, 30) Bag A bag is a collection of tuples. Example : {(raju,30),(Mohhammad,45)} Map A Map is a set of key-value pairs. Example : [‘name’#’Raju’,‘age’#30]
  • 75. Apache Pig Filter Operator  The FILTER operator is used to select the required tuples from a relation based on a condition. 75  syntax of the FILTER operator.  Example:  Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below.
  • 76. Filter Operator Example  And we have loaded this file into Pig with the relation name student_details as shown below. 76 grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);  now use the Filter operator to get the details of the students who belong to the city Chennai.  Verify the relation filter_data using the DUMP operator as shown below.  It will produce the following output, displaying the contents of the relation filter_data as follows.
  • 77. Apache Pig Distinct Operator  The DISTINCT operator is used to remove redundant (duplicate) tuples from a relation. 77  syntax of the DISTINCT operator.  Example:  Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below.
  • 78. Distinct Operator Example 78  remove the redundant (duplicate) tuples from the relation named student_details using the DISTINCT operator  Verify the relation distinct_data using the DUMP operator as shown below.  It will produce the following output, displaying the contents of the relation distinct_data as follows.
  • 79. Apache Pig Group Operator  The GROUP operator is used to group the data in one or more relations. It collects the data having the same key. 79  syntax of the group operator.  Example:  Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below.
  • 80. Group Operator Example 80  let us group the records/tuples in the relation by age as shown below.  Verify the relation group_data using the DUMP operator as shown below.  It will produce the following output, displaying the contents of the relation group_data as follows.
  • 81. Apache Pig Join Operator  The JOIN operator is used to combine records from two or more relations. 81  Joins can be of the following types:  Self-join: is used to join a table with itself.  Inner Join:An inner join returns rows when there is a match in both tables.  left outer Join: returns all rows from the left table, even if there are no matches in the right relation.  right outer join: returns all rows from the right table, even if there are no matches in the left table.  full outer join: operation returns rows when there is a match in one of the relations.
  • 83. What is Impala? Cloudera Impala is a query engine that runs onApache Hadoop. Similar to HiveQL. Does not use Map reduce. Optimized for low latency queries. Open source apache project. Developed by Cloudera. Much faster than Hive or pig. 83
  • 84. Comparing Pig, Hive and Impala Description of Feature Pig Hive Impala SQL based query language No yes yes Schema optional required required Process data with external scripts yes yes no Extensible file format support yes yes no Query speed slow slow fast Accessible via ODBC/JDBC no yes yes 84
  • 86. What is Sqoop?  Command-line interface for transforming data between relational database and Hadoop  Support incremental imports  Imports use to populate tables in Hadoop  Exports use to put data from Hadoop into relational database such as SQL server Hadoop RDBMSsqoop 86
  • 87. Sqoop Import  sqoop import --connect jdbc:postgresql://hdp-master/sqoop_db --username sqoop_user --password postgres --table cities 87
  • 88. Sqoop Export sqoop export --connect jdbc:postgresql://hdp-master/sqoop_db --username sqoop_user --password postgres --table cities --export-dir cities 88
  • 89. Scoop – Example  An example scoop command to – load data from mySql into Hive bin/sqoop-import --connect jdbc:mysql://<mysql host>:<msql port>/db3 -username <username> -password <password> --table <tableName> --hive-table <Hive tableName> --create-hive-table --hive-import --hive-home <hive path> 89
  • 90. How Sqoop works The dataset being transferred is broken into small blocks. Map only job is launched. Individual mapper is responsible for transferring a block of the dataset. 90
  • 92. 92