SlideShare une entreprise Scribd logo
1  sur  76
Practical Hadoop with Pig
Dave Wellman
#openwest @dwellman
How does it all work?
HDFS
Hadoop Shell
MR Data Structures
Pig Commands
Pig Example
HDFS
HDFS has 3 main actors
The Name Node
The Name Node is “The Conductor”.
It directs the performance of the cluster.
The Data Nodes:
A Data Node stores blocks of data.
Clusters can be contain thousands
of Data Nodes.
*Yahoo has a 40,000 node cluster.
The Client
The client is a window to the
cluster.
The Name Node
The heart of the System.
The heart of the System.
Maintains a virtual File Directory.
The heart of the System.
Maintains a virtual File Directory.
Tracks all the nodes.
The heart of the System.
Maintains a virtual File Directory.
Tracks all the nodes.
Listens for “heartbeats” and “Block Reports”
(more on this later).
The heart of the System.
Maintains a virtual File Directory.
Tracks all the nodes.
Listens for “heartbeats” and “Block Reports”
(more on this later).
If the NameNode is down, the cluster is
offline.
Storing Data
The Data Nodes
Add a Data Node:
Add a Data Node:
The Data Node says “Hello” to the Name
Node.
Add a Data Node:
The Data Node says “Hello” to the Name
Node.
The Name Node offers the Data Node a
handshake with version requirements.
Add a Data Node:
The Data Node says “Hello” to the Name
Node.
The Name Node offers the Data Node a
handshake with version requirements.
The Data Node replies back to the Name
Node, “Okay”, or “Shuts Down”.
Add a Data Node:
The Data Node says “Hello” to the Name
Node.
The Name Node offers the Data Node a
handshake with version requirements.
The Data Node replies back to the Name
Node, “Okay”, or “Shuts Down”.
The Name Node hands the Data Node a
NodeId that it remembers.
.
Add a Data Node:
The Data Node says “Hello” to the Name
Node.
The Name Node offers the Data Node a
handshake with version requirements.
The Data Node replies back to the Name
Node, “Okay”, or “Shuts Down”.
The Name Node hands the Data Node a
NodeId that it remembers.
The Data Node is now part of cluster and it
checks in with the Name Node every 3
seconds.
Data Node Heartbeat:
Data Node Heartbeat:
The “check-in” is a simple HTTP
Request/Response.
Data Node Heartbeat:
The “check-in” is a simple HTTP
Request/Response.
This "check-in" is very important
communication protocol that guarantees the
health of the cluster.
Data Node Heartbeat:
The “check-in” is a simple HTTP
Request/Response.
This "check-in" is very important
communication protocol that guarantees the
health of the cluster.
Block Reports – what data I have and is it
okay.
Data Node Heartbeat:
The “check-in” is a simple HTTP
Request/Response.
This "check-in" is very important
communication protocol that guarantees the
health of the cluster.
Block Reports – what data I have and is it
okay.
Name Node controls the Data Nodes by
issuing orders when they return and report
their status.
Data Node Heartbeat:
The “check-in” is a simple HTTP
Request/Response.
This "check-in" is very important
communication protocol that guarantees the
health of the cluster.
Block Reports – what data I have and is it
okay.
Name Node controls the Data Nodes by
issuing orders when they return and report
their status.
Replicate Data, Delete Data, Verify Data
Data Node Heartbeat:
The “check-in” is a simple HTTP
Request/Response.
This "check-in" is very important
communication protocol that guarantees the
health of the cluster.
Block Reports – what data I have and is it
okay.
Name Node controls the Data Nodes by
issuing orders when they return and report
their status.
Replicate Data, Delete Data, Verify Data
Same process for all nodes within a cluster.
Writing Data
The client “tells” the NameNode the
virtual directory location for the file.
A64 B64 C28
The client “tells” the NameNode the
virtual directory location for the file.
The Client breaks the file into 64MB
“blocks”
A64 B64 C28
The client “tells” the NameNode the
virtual directory location for the file.
The Client breaks the file into 64MB
“blocks”
The client “ask” the NameNode where
the blocks go.
A64 B64 C28
A64 B64 C28
The client “tells” the NameNode the
virtual directory location for the file.
The Client breaks the file into 64MB
“blocks”
The client “ask” the NameNode where
the blocks go.
The Client “stream” the blocks, in
parallel, to the DataNodes.
A64 B64 C28
The client “tells” the NameNode the
virtual directory location for the file.
The Client breaks the file into 64MB
“blocks”
The client “ask” the NameNode where
the blocks go.
The Client “stream” the blocks, in
parallel, to the DataNodes.
DataNode(s) tells the NameNode they
have the data via the block report
The client “tells” the NameNode the
virtual directory location for the file.
The Client breaks the file into 64MB
“blocks”
The client “ask” the NameNode where
the blocks go.
The Client “stream” the blocks, in
parallel, to the DataNodes.
DataNode(s) tells the NameNode they
have the data via the block report
The NameNode tells the DataNode
where to replicate the block.
A64 A64
A64
Reading Data
The client tells the NameNode it would
like to read a file.
The client tells the NameNode it would
like to read a file.
The NameNode reply’s with the list of
blocks and the nodes the blocks are on.
A64
B64 C28
The client tells the NameNode it would
like to read a file.
The NameNode reply’s with the list of
blocks and the nodes the blocks are on.
The client request the first block from a
DataNode
B64 C28
A64
The client tells the NameNode it would
like to read a file.
The NameNode reply’s with the list of
blocks and the nodes the blocks are on.
The client request the first block from a
DataNode
The client compares the checksum of the
block against the manifest from the
NameNode.
The client tells the NameNode it would
like to read a file.
The NameNode reply’s with the list of
blocks and the nodes the blocks are on.
The client request the first block from a
DataNode
The client compares the checksum of the
block against the manifest from the
NameNode.
The client moves on to the next block in
the sequence until the file has been read.
B64 C28
A64 B64 C28
Failure Recovery
A Data Node Fails to “check-in”
A64
A Data Node Fails to “check-in”
After 10 minutes the Name Node gives up
on that Data Node.
A64
A Data Node Fails to “check-in”
After 10 minutes the Name Node gives up
on that Data Node.
When another node that has blocks
originally assigned to the lost node
checks-in, the name node sends a block
replication command. A64A64
A64
A Data Node Fails to “check-in”
After 10 minutes the Name Node gives up
on that Data Node.
When another node that has blocks
originally assigned to the lost node
checks-in, the name node sends a block
replication command.
The Data Node replicates that block of
data. (Just like a write)
A64A64
A64A64
Interacting with Hadoop
HDFS Shell Commands
HDFS Shell Commands.
> Hadoop fs –ls <args>
Same as unix or osx ls command.
/user/hadoop/file1
/user/hadoop/file2
...
HDFS Shell Commands.
> Hadoop fs –mkdir <path>
Creates directories in HDFS using path.
HDFS Shell Commands.
> hadoop fs -copyFromLocal <localsrc>
URI
Copy a file from your client to HDFS.
Similar to put command, except that the source
is restricted to a local file reference.
HDFS Shell Commands.
> hadoop fs -cat <path>
Copies source paths to stdout.
HDFS Shell Commands.
> hadoop fs -copyToLocal URI
<localdst>
Copy a file from HDFS to your client.
Similar to get command, except that the
destination is restricted to a local file reference.
HDFS Shell Commands.
cat
chgrp
chmod
chown
copyFromLocal
copyToLocal
cp
du
dus
expunge
get
getmerge
ls
lsr
mkdir
movefromLocal
mv
put
rm
rmr
setrep
stat
tail
test
text
touchz
Map Reduce Data Structures
Basic, Tuples & Bags
Basic Data Types:
Strings, Integers, Doubles, Longs, Byte, Boolean,
etc.
Advanced Data Types:
Tuples and Bags
Tuples are JSON like and simple.
raw_data: {
date_time: bytearray,
seconds: bytearray
}
Bags hold Tuples and Bags
element: {
date_time: bytearray,
seconds: bytearray
group: chararray,
ordered_list: {
date: chararray,
hour: chararray,
score: long
}
}
Expert Advice:
Always know your data structures.
They are the foundation for all Map Reduce operations.
Complex (deep) data structures will kill -9 performance.
Keep them simple!
Processing Data
Interacting with Pig using Grunt
GRUNT
Grunt is a command line interface used to debug
pig jobs. Similar to Ruby IRB or Groovy CLI.
Grunt is your best weapon against bad pigs.
pig -x local
Grunt> |
GRUNT
Grunt> describe Element
Describe will display the data structure of an
Element
Grunt> dump Element
Dump will display the data represented by an
Element
GRUNT
> describe raw_data
Produces the output:
> raw_data: { date_time: bytearray,
items: bytearray }
Or in a more human readable form:
Raw_data: {
date_time: bytearray,
items: bytearray
}
GRUNT
> dump raw_data
You can dump terabytes of data to your screen,
so be careful.
(05/10/2011 20:30:00.0,0)
(05/10/2011 20:45:00.0,0)
(05/10/2011 21:00:00.0,0)
(05/10/2011 21:15:00.0,0)
...
Pig Programs
Map Reduce Made Simple
Most PIG commands are assignments.
• The element names the collection of records that exist out in
the cluster.
• It’s not a traditional programming variable.
• It describes the data from the operation.
• It does not change.
Element = Operation;
The SET command
Used to set a hadoop job variable. Like the name of your pig
job.
SET job.name 'Day over Day - [$input]’;
The REGISTER and DEFINE commands
-- Setup udf jars
REGISTER $jar_prefix/sidekick-hadoop-0.0.1.jar
DEFINE BUCKET_FORMAT_DATE
com.sidekick.hadoop.udf.UnixTimeFormatter('MM/dd/
yyyy HH:mm', 'HH');
The LOAD USING command
-- load in the data from HDFS
raw_data = LOAD '$input' USING
PigStorage('t') AS (date_time, items);
The FILTER BY command
Selects tuples from a relation based on some condition.
-- filter to the week we want
broadcast_week = FILTER bucket_list BY (date >=
'03-Oct-2011') AND (date <= '10-Oct-2011');
The GROUP BY command
Groups the data in one or multiple relations.
daily_stats = GROUP broadcast_week BY (date,
hour);
The FOREACH command
Generates data transformations based on columns of data.
bucket_list = FOREACH raw_data GENERATE
FLATTEN(DATE_FORMAT_DATE(date_time)) AS date,
MINUTE_BUCKET(date_time) AS hour,
MAX_ITEMS(items) AS items;
*DATE_FORMAT_DATE is a user defined function, an advanced topic we’ll come to in a minute.
The GENERATE command
Use the FOREACH GENERATE operation to work with columns
of data.
bucket_list = FOREACH raw_data GENERATE
FLATTEN(DATE_FORMAT_DATE(date_time)) AS date,
MINUTE_BUCKET(date_time) AS hour,
MAX_ITEMSS(items) AS items;
The FLATTEN command
FLATTEN substitutes the fields of a tuple in place of the tuple.
traffic_stats = FOREACH daily_stats GENERATE
FLATTEN(GROUP),
COUNT(broadcast_week) AS cnt,
SUM(broadcast_week.items) AS total;
The STORE INTO USING command
Store function determine how data stored after a pig job.
-- All done, now store it
STORE final_results INTO '$output' USING
PigStorage();
Demo Time!
“Because, it’s all a big lie
until someone demos’ the code.”
- Genghis Khan
Thank You.
- Genghis Khan

Contenu connexe

Tendances

Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scale
MapR Technologies
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
ragho
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
jeffturner
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
Adam Kawa
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
Zheng Shao
 

Tendances (20)

An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scale
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Big data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guide
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
 
Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle Professionals
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandas
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 

En vedette

Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Hadoop User Group
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 

En vedette (20)

Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010
 
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig Latin
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
 
Un introduction à Pig
Un introduction à PigUn introduction à Pig
Un introduction à Pig
 
Apache Pig - JavaZone 2013
Apache Pig - JavaZone 2013Apache Pig - JavaZone 2013
Apache Pig - JavaZone 2013
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
Apache pig
Apache pigApache pig
Apache pig
 
Pig statements
Pig statementsPig statements
Pig statements
 
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
 
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010
 
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
 
Apache Pig: A big data processor
Apache Pig: A big data processorApache Pig: A big data processor
Apache Pig: A big data processor
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Hadoop et son écosystème
Hadoop et son écosystèmeHadoop et son écosystème
Hadoop et son écosystème
 
Smarty Pig - Life style gamification - Manu Melwin Joy
Smarty Pig - Life style gamification - Manu Melwin JoySmarty Pig - Life style gamification - Manu Melwin Joy
Smarty Pig - Life style gamification - Manu Melwin Joy
 
Big Data: Concepts, techniques et démonstration de Apache Hadoop
Big Data: Concepts, techniques et démonstration de Apache HadoopBig Data: Concepts, techniques et démonstration de Apache Hadoop
Big Data: Concepts, techniques et démonstration de Apache Hadoop
 
Pig and Python to Process Big Data
Pig and Python to Process Big DataPig and Python to Process Big Data
Pig and Python to Process Big Data
 
Solr + Hadoop - Fouillez facilement dans votre système Big Data
Solr + Hadoop - Fouillez facilement dans votre système Big DataSolr + Hadoop - Fouillez facilement dans votre système Big Data
Solr + Hadoop - Fouillez facilement dans votre système Big Data
 

Similaire à Practical Hadoop using Pig

Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
 
Understanding Hadoop Clusters and the Network
Understanding Hadoop Clusters and the NetworkUnderstanding Hadoop Clusters and the Network
Understanding Hadoop Clusters and the Network
bradhedlund
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetup
vmoorthy
 

Similaire à Practical Hadoop using Pig (20)

Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Introduction P2p
Introduction P2pIntroduction P2p
Introduction P2p
 
Understanding Hadoop Clusters and the Network
Understanding Hadoop Clusters and the NetworkUnderstanding Hadoop Clusters and the Network
Understanding Hadoop Clusters and the Network
 
Introduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptxIntroduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptx
 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
 
Data Grids with Oracle Coherence
Data Grids with Oracle CoherenceData Grids with Oracle Coherence
Data Grids with Oracle Coherence
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
Hadoop
HadoopHadoop
Hadoop
 
Querying datasets on the Web with high availability
Querying datasets on the Web with high availabilityQuerying datasets on the Web with high availability
Querying datasets on the Web with high availability
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreduce
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file system
 
Hadoop and HDFS
Hadoop and HDFSHadoop and HDFS
Hadoop and HDFS
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetup
 
Top 10 Performance Gotchas for scaling in-memory Algorithms.
Top 10 Performance Gotchas for scaling in-memory Algorithms.Top 10 Performance Gotchas for scaling in-memory Algorithms.
Top 10 Performance Gotchas for scaling in-memory Algorithms.
 
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
 
Hadoop admin
Hadoop adminHadoop admin
Hadoop admin
 
Unit 1
Unit 1Unit 1
Unit 1
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Dernier (20)

Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 

Practical Hadoop using Pig

  • 1. Practical Hadoop with Pig Dave Wellman #openwest @dwellman
  • 2. How does it all work? HDFS Hadoop Shell MR Data Structures Pig Commands Pig Example
  • 4. HDFS has 3 main actors
  • 5. The Name Node The Name Node is “The Conductor”. It directs the performance of the cluster.
  • 6. The Data Nodes: A Data Node stores blocks of data. Clusters can be contain thousands of Data Nodes. *Yahoo has a 40,000 node cluster.
  • 7. The Client The client is a window to the cluster.
  • 9. The heart of the System.
  • 10. The heart of the System. Maintains a virtual File Directory.
  • 11. The heart of the System. Maintains a virtual File Directory. Tracks all the nodes.
  • 12. The heart of the System. Maintains a virtual File Directory. Tracks all the nodes. Listens for “heartbeats” and “Block Reports” (more on this later).
  • 13. The heart of the System. Maintains a virtual File Directory. Tracks all the nodes. Listens for “heartbeats” and “Block Reports” (more on this later). If the NameNode is down, the cluster is offline.
  • 16. Add a Data Node:
  • 17. Add a Data Node: The Data Node says “Hello” to the Name Node.
  • 18. Add a Data Node: The Data Node says “Hello” to the Name Node. The Name Node offers the Data Node a handshake with version requirements.
  • 19. Add a Data Node: The Data Node says “Hello” to the Name Node. The Name Node offers the Data Node a handshake with version requirements. The Data Node replies back to the Name Node, “Okay”, or “Shuts Down”.
  • 20. Add a Data Node: The Data Node says “Hello” to the Name Node. The Name Node offers the Data Node a handshake with version requirements. The Data Node replies back to the Name Node, “Okay”, or “Shuts Down”. The Name Node hands the Data Node a NodeId that it remembers. .
  • 21. Add a Data Node: The Data Node says “Hello” to the Name Node. The Name Node offers the Data Node a handshake with version requirements. The Data Node replies back to the Name Node, “Okay”, or “Shuts Down”. The Name Node hands the Data Node a NodeId that it remembers. The Data Node is now part of cluster and it checks in with the Name Node every 3 seconds.
  • 23. Data Node Heartbeat: The “check-in” is a simple HTTP Request/Response.
  • 24. Data Node Heartbeat: The “check-in” is a simple HTTP Request/Response. This "check-in" is very important communication protocol that guarantees the health of the cluster.
  • 25. Data Node Heartbeat: The “check-in” is a simple HTTP Request/Response. This "check-in" is very important communication protocol that guarantees the health of the cluster. Block Reports – what data I have and is it okay.
  • 26. Data Node Heartbeat: The “check-in” is a simple HTTP Request/Response. This "check-in" is very important communication protocol that guarantees the health of the cluster. Block Reports – what data I have and is it okay. Name Node controls the Data Nodes by issuing orders when they return and report their status.
  • 27. Data Node Heartbeat: The “check-in” is a simple HTTP Request/Response. This "check-in" is very important communication protocol that guarantees the health of the cluster. Block Reports – what data I have and is it okay. Name Node controls the Data Nodes by issuing orders when they return and report their status. Replicate Data, Delete Data, Verify Data
  • 28. Data Node Heartbeat: The “check-in” is a simple HTTP Request/Response. This "check-in" is very important communication protocol that guarantees the health of the cluster. Block Reports – what data I have and is it okay. Name Node controls the Data Nodes by issuing orders when they return and report their status. Replicate Data, Delete Data, Verify Data Same process for all nodes within a cluster.
  • 30. The client “tells” the NameNode the virtual directory location for the file.
  • 31. A64 B64 C28 The client “tells” the NameNode the virtual directory location for the file. The Client breaks the file into 64MB “blocks”
  • 32. A64 B64 C28 The client “tells” the NameNode the virtual directory location for the file. The Client breaks the file into 64MB “blocks” The client “ask” the NameNode where the blocks go.
  • 33. A64 B64 C28 A64 B64 C28 The client “tells” the NameNode the virtual directory location for the file. The Client breaks the file into 64MB “blocks” The client “ask” the NameNode where the blocks go. The Client “stream” the blocks, in parallel, to the DataNodes.
  • 34. A64 B64 C28 The client “tells” the NameNode the virtual directory location for the file. The Client breaks the file into 64MB “blocks” The client “ask” the NameNode where the blocks go. The Client “stream” the blocks, in parallel, to the DataNodes. DataNode(s) tells the NameNode they have the data via the block report
  • 35. The client “tells” the NameNode the virtual directory location for the file. The Client breaks the file into 64MB “blocks” The client “ask” the NameNode where the blocks go. The Client “stream” the blocks, in parallel, to the DataNodes. DataNode(s) tells the NameNode they have the data via the block report The NameNode tells the DataNode where to replicate the block. A64 A64 A64
  • 37. The client tells the NameNode it would like to read a file.
  • 38. The client tells the NameNode it would like to read a file. The NameNode reply’s with the list of blocks and the nodes the blocks are on.
  • 39. A64 B64 C28 The client tells the NameNode it would like to read a file. The NameNode reply’s with the list of blocks and the nodes the blocks are on. The client request the first block from a DataNode
  • 40. B64 C28 A64 The client tells the NameNode it would like to read a file. The NameNode reply’s with the list of blocks and the nodes the blocks are on. The client request the first block from a DataNode The client compares the checksum of the block against the manifest from the NameNode.
  • 41. The client tells the NameNode it would like to read a file. The NameNode reply’s with the list of blocks and the nodes the blocks are on. The client request the first block from a DataNode The client compares the checksum of the block against the manifest from the NameNode. The client moves on to the next block in the sequence until the file has been read. B64 C28 A64 B64 C28
  • 43. A Data Node Fails to “check-in” A64
  • 44. A Data Node Fails to “check-in” After 10 minutes the Name Node gives up on that Data Node. A64
  • 45. A Data Node Fails to “check-in” After 10 minutes the Name Node gives up on that Data Node. When another node that has blocks originally assigned to the lost node checks-in, the name node sends a block replication command. A64A64 A64
  • 46. A Data Node Fails to “check-in” After 10 minutes the Name Node gives up on that Data Node. When another node that has blocks originally assigned to the lost node checks-in, the name node sends a block replication command. The Data Node replicates that block of data. (Just like a write) A64A64 A64A64
  • 48. HDFS Shell Commands. > Hadoop fs –ls <args> Same as unix or osx ls command. /user/hadoop/file1 /user/hadoop/file2 ...
  • 49. HDFS Shell Commands. > Hadoop fs –mkdir <path> Creates directories in HDFS using path.
  • 50. HDFS Shell Commands. > hadoop fs -copyFromLocal <localsrc> URI Copy a file from your client to HDFS. Similar to put command, except that the source is restricted to a local file reference.
  • 51. HDFS Shell Commands. > hadoop fs -cat <path> Copies source paths to stdout.
  • 52. HDFS Shell Commands. > hadoop fs -copyToLocal URI <localdst> Copy a file from HDFS to your client. Similar to get command, except that the destination is restricted to a local file reference.
  • 54. Map Reduce Data Structures Basic, Tuples & Bags
  • 55. Basic Data Types: Strings, Integers, Doubles, Longs, Byte, Boolean, etc. Advanced Data Types: Tuples and Bags
  • 56. Tuples are JSON like and simple. raw_data: { date_time: bytearray, seconds: bytearray }
  • 57. Bags hold Tuples and Bags element: { date_time: bytearray, seconds: bytearray group: chararray, ordered_list: { date: chararray, hour: chararray, score: long } }
  • 58. Expert Advice: Always know your data structures. They are the foundation for all Map Reduce operations. Complex (deep) data structures will kill -9 performance. Keep them simple!
  • 60. GRUNT Grunt is a command line interface used to debug pig jobs. Similar to Ruby IRB or Groovy CLI. Grunt is your best weapon against bad pigs. pig -x local Grunt> |
  • 61. GRUNT Grunt> describe Element Describe will display the data structure of an Element Grunt> dump Element Dump will display the data represented by an Element
  • 62. GRUNT > describe raw_data Produces the output: > raw_data: { date_time: bytearray, items: bytearray } Or in a more human readable form: Raw_data: { date_time: bytearray, items: bytearray }
  • 63. GRUNT > dump raw_data You can dump terabytes of data to your screen, so be careful. (05/10/2011 20:30:00.0,0) (05/10/2011 20:45:00.0,0) (05/10/2011 21:00:00.0,0) (05/10/2011 21:15:00.0,0) ...
  • 64. Pig Programs Map Reduce Made Simple
  • 65. Most PIG commands are assignments. • The element names the collection of records that exist out in the cluster. • It’s not a traditional programming variable. • It describes the data from the operation. • It does not change. Element = Operation;
  • 66. The SET command Used to set a hadoop job variable. Like the name of your pig job. SET job.name 'Day over Day - [$input]’;
  • 67. The REGISTER and DEFINE commands -- Setup udf jars REGISTER $jar_prefix/sidekick-hadoop-0.0.1.jar DEFINE BUCKET_FORMAT_DATE com.sidekick.hadoop.udf.UnixTimeFormatter('MM/dd/ yyyy HH:mm', 'HH');
  • 68. The LOAD USING command -- load in the data from HDFS raw_data = LOAD '$input' USING PigStorage('t') AS (date_time, items);
  • 69. The FILTER BY command Selects tuples from a relation based on some condition. -- filter to the week we want broadcast_week = FILTER bucket_list BY (date >= '03-Oct-2011') AND (date <= '10-Oct-2011');
  • 70. The GROUP BY command Groups the data in one or multiple relations. daily_stats = GROUP broadcast_week BY (date, hour);
  • 71. The FOREACH command Generates data transformations based on columns of data. bucket_list = FOREACH raw_data GENERATE FLATTEN(DATE_FORMAT_DATE(date_time)) AS date, MINUTE_BUCKET(date_time) AS hour, MAX_ITEMS(items) AS items; *DATE_FORMAT_DATE is a user defined function, an advanced topic we’ll come to in a minute.
  • 72. The GENERATE command Use the FOREACH GENERATE operation to work with columns of data. bucket_list = FOREACH raw_data GENERATE FLATTEN(DATE_FORMAT_DATE(date_time)) AS date, MINUTE_BUCKET(date_time) AS hour, MAX_ITEMSS(items) AS items;
  • 73. The FLATTEN command FLATTEN substitutes the fields of a tuple in place of the tuple. traffic_stats = FOREACH daily_stats GENERATE FLATTEN(GROUP), COUNT(broadcast_week) AS cnt, SUM(broadcast_week.items) AS total;
  • 74. The STORE INTO USING command Store function determine how data stored after a pig job. -- All done, now store it STORE final_results INTO '$output' USING PigStorage();
  • 75. Demo Time! “Because, it’s all a big lie until someone demos’ the code.” - Genghis Khan