SlideShare une entreprise Scribd logo
1  sur  79
Pig Workshop
         Sudar Muthu
    http://sudarmuthu.com
http://twitter.com/sudarmuthu
   https://github.com/sudar
Who am I?


Research Engineer by profession
I mine useful information from data
You might recognize me from other HasGeek events
Blog at http://sudarmuthu.com
Builds robots as hobby ;)
Special Thanks


HasGeek
What I will not cover?
What I will not cover?


What is BigData, or why it is needed?
What is MapReduce?
What is Hadoop?
Internal architecture of Pig


    http://sudarmuthu.com/blog/getting-started-with-hadoop-and-pig
What we will see today?
What we will see today?


What is Pig
How to use it
  Loading and storing data
  Pig Latin
  SQL vs Pig
  Writing UDF’s
Debugging Pig Scripts
Optimizing Pig Scripts
When to use Pig
So, all of you have Pig installed
             right? ;)
What is Pig?


“Platform for analyzing large
        sets of data”
Components of Pig


Pig Shell (Grunt)
Pig Language (Latin)
Libraries (Piggy Bank)
User Defined Functions (UDF)
Why Pig?


  It is a data flow language
  Provides standard data processing operations
  Insulates Hadoop complexity
  Abstracts Map Reduce
  Increases programmer productivity

… but there are cases where Pig is not suitable.
Pig Modes
For this workshop, we will be
 using Pig only in local mode
Getting to know your Pig shell
pig –x local


Similar to Python’s shell
Different ways of executing Pig
            Scripts


Inline in shell
From a file
Streaming through other executable
Embed script in other languages
Loading and Storing data


Pigs eat anything
Loading Data into Pig


file = LOAD 'data/dropbox-policy.txt' AS (line);

data = LOAD 'data/tweets.csv' USING PigStorage(',');

data = LOAD 'data/tweets.csv' USING PigStorage(',')
AS ('list', 'of', 'fields');
Loading Data into Pig


PigStorage – for most cases
TextLoader – to load text files
JSONLoader – to load JSON files
Custom loaders – You can write your own custom
loaders as well
Viewing Data


DUMP input;



Very useful for debugging, but don’t use it on huge
datasets
Storing Data from Pig


STORE data INTO 'output_location';

STORE data INTO 'output_location' USING PigStorage();

STORE data INTO 'output_location' USING
PigStorage(',');

STORE data INTO 'output_location' USING BinStorage();
Storing Data


Similar to `LOAD`, lot of options are available
Can store locally or in HDFS
You can write your own custom Storage as well
Load and Store example


data = LOAD 'data/data-bag.txt' USING
PigStorage(',');

STORE data INTO 'data/output/load-store' USING
PigStorage('|');



https://github.com/sudar/pig-samples/load-store.pig
Pig Latin
Data Types


Scalar Types
Complex Types
Scalar Types


  int, long – (32, 64 bit) integer
  float, double – (32, 64 bit) floating point
  boolean (true/false)
  chararray (String in UTF-8)
  bytearray (blob) (DataByteArray in Java)

If you don’t specify anything bytearray is used by
default
Complex Types


tuple – ordered set of fields
(data) bag – collection of tuples
map – set of key value pairs
Tuple


 Row with one or more fields
 Fields can be of any data type
 Ordering is important
 Enclosed inside parentheses ()

Eg:
(Sudar, Muthu, Haris, Dinesh)
(Sudar, 176, 80.2F)
Bag


Set of tuples
SQL equivalent is Table
Each tuple can have different set of fields
Can have duplicates
Inner bag uses curly braces {}
Outer bag doesn’t use anything
Bag - Example


Outer bag

(1,2,3)
(1,2,4)
(2,3,4)
(3,4,5)
(4,5,6)

https://github.com/sudar/pig-samples/data-bag.pig
Bag - Example


Inner bag

(1,{(1,2,3),(1,2,4)})
(2,{(2,3,4)})
(3,{(3,4,5)})
(4,{(4,5,6)})

https://github.com/sudar/pig-samples/data-bag.pig
Map


Set of key value pairs
Similar to HashMap in Java
Key must be unique
Key must be of chararray data type
Values can be any type
Key/value is separated by #
Map is enclosed by []
Map - Example


[name#sudar, height#176, weight#80.5F]

[name#(sudar, muthu), height#176, weight#80.5F]

[name#(sudar, muthu), languages#(Java, Pig, Python
)]
Null


Similar to SQL
Denotes that value of data element is unknown
Any data type can be null
Schemas in Load statement


We can specify a schema (collection of datatypes) to `LOAD`
statements

data = LOAD 'data/data-bag.txt' USING PigStorage(',') AS
(f1:int, f2:int, f3:int);

data = LOAD 'data/nested-schema.txt' AS
(f1:int, f2:bag{t:tuple(n1:int, n2:int)}, f3:map[]);
Expressions


Fields can be looked up by

  Position
  Name
  Map Lookup
Expressions - Example


data = LOAD 'data/nested-schema.txt' AS
(f1:int, f2:bag{t:tuple(n1:int, n2:int)}, f3:map[]);

by_pos = FOREACH data GENERATE $0;
DUMP by_pos;

by_field = FOREACH data GENERATE f2;
DUMP by_field;

by_map = FOREACH data GENERATE f3#'name';
DUMP by_map;

https://github.com/sudar/pig-samples/lookup.pig
Operators
Arithmetic Operators


All usual arithmetic operators are supported

  Addition (+)
  Subtraction (-)
  Multiplication (*)
  Division (/)
  Modulo (%)
Boolean Operators


All usual boolean operators are supported

  AND
  OR
  NOT
Comparison Operators


All usual comparison operators are supported

  ==
  !=
  <
  >
  <=
  >=
Relational Operators


FOREACH
FLATTERN
GROUP
FILTER
COUNT
ORDER BY
DISTINCT
LIMIT
JOIN
FOREACH


Generates data transformations based on columns of data

x = FOREACH data GENERATE *;

x = FOREACH data GENERATE $0, $1;

x = FOREACH data GENERATE $0 AS first, $1 AS
second;
FLATTEN


Un-nests tuples and bags. Most of the time results in
cross product

(a, (b, c)) => (a,b,c)

({(a,b),(d,e)}) => (a,b) and (d,e)

(a, {(b,c), (d,e)}) => (a, b, c) and (a, d, e)
GROUP


   Groups data in one or more relations
   Groups tuples that have the same group key
   Similar to SQL group by operator

outerbag = LOAD 'data/data-bag.txt' USING PigStorage(',') AS (f1:int, f2:int, f3:int);
DUMP outerbag;

innerbag = GROUP outerbag BY f1;
DUMP innerbag;

https://github.com/sudar/pig-samples/group-by.pig
FILTER


Selects tuples from a relation based on some condition

data = LOAD 'data/data-bag.txt' USING PigStorage(',') AS
(f1:int, f2:int, f3:int);
DUMP data;

filtered = FILTER data BY f1 == 1;
DUMP filtered;


https://github.com/sudar/pig-samples/filter-by.pig
COUNT


Counts the number of tuples in a relationship

data = LOAD 'data/data-bag.txt' USING PigStorage(',') AS (f1:int, f2:int, f3:int);
grouped = GROUP data BY f2;

counted = FOREACH grouped GENERATE group, COUNT (data);
DUMP counted;


https://github.com/sudar/pig-samples/count.pig
ORDER By


Sort a relation based on one or more fields. Similar to SQL order by

data = LOAD 'data/nested-sample.txt' USING PigStorage(',') AS (f1:int, f2:int, f3:int);
DUMP data;

ordera = ORDER data BY f1 ASC;
DUMP ordera;

orderd = ORDER data BY f1 DESC;
DUMP orderd;


https://github.com/sudar/pig-samples/order-by.pig
DISTINCT


Removes duplicates from a relation

data = LOAD 'data/data-bag.txt' USING PigStorage(',') AS (f1:int, f2:int, f3:int);
DUMP data;

unique = DISTINCT data;
DUMP unique;

https://github.com/sudar/pig-samples/distinct.pig
LIMIT


Limits the number of tuples in the output.

data = LOAD 'data/data-bag.txt' USING PigStorage(',') AS (f1:int, f2:int, f3:int);
DUMP data;

limited = LIMIT data 3;
DUMP limited;


https://github.com/sudar/pig-samples/limit.pig
JOIN


Joins relation based on a field. Both outer and inner
joins are supported

a = LOAD 'data/data-bag.txt' USING PigStorage(',') AS (f1:int, f2:int, f3:int);
DUMP a;

b = LOAD 'data/simple-tuples.txt' USING PigStorage(',') AS (t1:int, t2:int);
DUMP b;

joined = JOIN a by f1, b by t1;
DUMP joined;
https://github.com/sudar/pig-samples/join.pig
SQL vs Pig


From Table – Load file(s)
Select – FOREACH GENERATE
Where – FILTER BY
Group By – GROUP BY + FOREACH GENERATE
Having – FILTER BY
Order By – ORDER BY
Distinct - DISTINCT
Let’s see a complete example


Count the number of words in a
           text file

   https://github.com/sudar/pig-samples/count-words.pig
Extending Pig - UDF
Why UDF?


  Do operations on more than one field
  Do more than grouping and filtering
  Programmer is comfortable
  Want to reuse existing logic

Traditionally UDF can be written only in Java. Now other
languages like Python are also supported
Different types of UDF’s


Eval Functions
Filter functions
Load functions
Store functions
Eval Functions


  Can be used in FOREACH statement
  Most common type of UDF
  Can return simple types or Tuples

b = FOREACH a generate udf.Function($0);

b = FOREACH a generate udf.Function($0, $1);
Eval Functions


Extend EvalFunc<T> interface
The generic <T> should contain the return type
Input comes as a Tuple
Should check for empty and nulls in input
Extend exec() function and it should return the value
Extend getArgToFuncMapping() to let UDF know about
Argument mapping
Extend outputSchema() to let UDF know about output
schema
Using Java UDF in Pig Scripts


Create a jar file which contains your UDF classes
Register the jar at the top of Pig script
Register other jars if needed
Define the UDF function
Use your UDF function
Let’s see an example which
       returns a string
  https://github.com/sudar/pig-samples/strip-quote.pig
Let’s see an example which
       returns a Tuple

  https://github.com/sudar/pig-samples/get-twitter-names.pig
Filter Functions


  Can be used in the Filter statements
  Returns a boolean value



Eg:
vim_tweets = FILTER data By FromVim(StripQuote($6));
Filter Functions


Extends FilterFun, which is a EvalFunc<Boolean>
Should return a boolean
Input it is same as EvalFunc<T>
Should check for empty and nulls in input
Extend getArgToFuncMapping() to let UDF know
about Argument mapping
Let’s see an example which
     returns a Boolean
  https://github.com/sudar/pig-samples/from-vim.pig
Error Handling in UDF


If the error affects only particular row then return
null.
If the error affects other rows, but can recover, then
throw an IOException
If the error affects other rows, and can’t
recover, then also throw an IOException. Pig and
Hadoop will quit, if there are many IOExceptions.
Can we try to write some more
            UDF’s?
Writing UDF in other languages
Streaming
Streaming


Entire data set is passed through an external task
The external task can be in any language
Even shell script also works
Uses the `STREAM` function
Stream through shell script


data = LOAD 'data/tweets.csv' USING PigStorage(',');

filtered = STREAM data THROUGH `cut -f6,8`;

DUMP filtered;



https://github.com/sudar/pig-samples/stream-shell-script.pig
Stream through Python


data = LOAD 'data/tweets.csv' USING PigStorage(',');

filtered = STREAM data THROUGH `strip.py`;

DUMP filtered;


https://github.com/sudar/pig-samples/stream-python.pig
Debugging Pig Scripts


DUMP is your friend, but use with LIMIT
DESCRIBE – will print the schema names
ILLUSTRATE – Will show the structure of the schema
In UDF’s, we can use warn() function. It supports
upto 15 different debug levels
Use Penny -
https://cwiki.apache.org/PIG/pennytoollibrary.html
Optimizing Pig Scripts


Project early and often
Filter early and often
Drop nulls before a join
Prefer DISTINCT over GROUP BY
Use the right data structure
Using Param substitution


 -p key=value - substitutes a single key, value
 -m file.ini – substitutes using an ini file
 default – provide default values

http://sudarmuthu.com/blog/passing-command-line-
arguments-to-pig-scripts
Problems that can be solved using Pig


Anything data related
When not to use Pig?


Lot of custom logic needs to be implemented
Need to do lot of cross lookup
Data is mostly binary (processing image files)
Real-time processing of data is needed
External Libraries


PiggyBank -
https://cwiki.apache.org/PIG/piggybank.html
DataFu – Linked-In Pig Library -
https://github.com/linkedin/datafu
Elephant Bird – Twitter Pig Library -
https://github.com/kevinweil/elephant-bird
Useful Links


  Pig homepage - http://pig.apache.org/
  My blog about Pig -
http://sudarmuthu.com/blog/category/hadoop-pig
  Sample code – https://github.com/sudar/pig-samples
  Slides – http://slideshare.net/sudar
Thank you

Contenu connexe

Tendances

Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!
 
iptables 101- bottom-up
iptables 101- bottom-upiptables 101- bottom-up
iptables 101- bottom-upHungWei Chiu
 
MySQL with DRBD/Pacemaker/Corosync on Linux
 MySQL with DRBD/Pacemaker/Corosync on Linux MySQL with DRBD/Pacemaker/Corosync on Linux
MySQL with DRBD/Pacemaker/Corosync on LinuxPawan Kumar
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkDatabricks
 
Outrageous Performance: RageDB's Experience with the Seastar Framework
Outrageous Performance: RageDB's Experience with the Seastar FrameworkOutrageous Performance: RageDB's Experience with the Seastar Framework
Outrageous Performance: RageDB's Experience with the Seastar FrameworkScyllaDB
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with PythonGokhan Atil
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
The Zen of High Performance Messaging with NATS
The Zen of High Performance Messaging with NATS The Zen of High Performance Messaging with NATS
The Zen of High Performance Messaging with NATS NATS
 
Care and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst OptimizerCare and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst OptimizerDatabricks
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin SeyfeSOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin SeyfeDatabricks
 
Apaceh Ambari Overview
Apaceh Ambari OverviewApaceh Ambari Overview
Apaceh Ambari OverviewJEONGPHIL HAN
 
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Edureka!
 
Structure of java program diff c- cpp and java
Structure of java program  diff c- cpp and javaStructure of java program  diff c- cpp and java
Structure of java program diff c- cpp and javaMadishetty Prathibha
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldDataWorks Summit
 
Open vSwitch Introduction
Open vSwitch IntroductionOpen vSwitch Introduction
Open vSwitch IntroductionHungWei Chiu
 

Tendances (20)

Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 
iptables 101- bottom-up
iptables 101- bottom-upiptables 101- bottom-up
iptables 101- bottom-up
 
MySQL with DRBD/Pacemaker/Corosync on Linux
 MySQL with DRBD/Pacemaker/Corosync on Linux MySQL with DRBD/Pacemaker/Corosync on Linux
MySQL with DRBD/Pacemaker/Corosync on Linux
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
Outrageous Performance: RageDB's Experience with the Seastar Framework
Outrageous Performance: RageDB's Experience with the Seastar FrameworkOutrageous Performance: RageDB's Experience with the Seastar Framework
Outrageous Performance: RageDB's Experience with the Seastar Framework
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
The Zen of High Performance Messaging with NATS
The Zen of High Performance Messaging with NATS The Zen of High Performance Messaging with NATS
The Zen of High Performance Messaging with NATS
 
Care and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst OptimizerCare and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst Optimizer
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin SeyfeSOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
 
Apaceh Ambari Overview
Apaceh Ambari OverviewApaceh Ambari Overview
Apaceh Ambari Overview
 
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
 
Structure of java program diff c- cpp and java
Structure of java program  diff c- cpp and javaStructure of java program  diff c- cpp and java
Structure of java program diff c- cpp and java
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable World
 
Open vSwitch Introduction
Open vSwitch IntroductionOpen vSwitch Introduction
Open vSwitch Introduction
 

Similaire à Pig workshop

AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewDan Morrill
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsViswanath Gangavaram
 
Practical pig
Practical pigPractical pig
Practical pigtrihug
 
Golang basics for Java developers - Part 1
Golang basics for Java developers - Part 1Golang basics for Java developers - Part 1
Golang basics for Java developers - Part 1Robert Stern
 
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Pig Introduction to Pig
Pig Introduction to PigPig Introduction to Pig
Pig Introduction to PigChris Wilkes
 
power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkbhargavi804095
 
Unit 6
Unit 6Unit 6
Unit 6siddr
 
Pig: Data Analysis Tool in Cloud
Pig: Data Analysis Tool in Cloud Pig: Data Analysis Tool in Cloud
Pig: Data Analysis Tool in Cloud Jianfeng Zhang
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache PigJason Shao
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurSiddharth Mathur
 
PigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptxPigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptxRahul Borate
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurSiddharth Mathur
 

Similaire à Pig workshop (20)

Apache pig
Apache pigApache pig
Apache pig
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overview
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 
Practical pig
Practical pigPractical pig
Practical pig
 
Golang basics for Java developers - Part 1
Golang basics for Java developers - Part 1Golang basics for Java developers - Part 1
Golang basics for Java developers - Part 1
 
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
 
Pig Introduction to Pig
Pig Introduction to PigPig Introduction to Pig
Pig Introduction to Pig
 
power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop framework
 
Unit 6
Unit 6Unit 6
Unit 6
 
Pig
PigPig
Pig
 
Pig: Data Analysis Tool in Cloud
Pig: Data Analysis Tool in Cloud Pig: Data Analysis Tool in Cloud
Pig: Data Analysis Tool in Cloud
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Bioinformatica 10-11-2011-p6-bioperl
Bioinformatica 10-11-2011-p6-bioperlBioinformatica 10-11-2011-p6-bioperl
Bioinformatica 10-11-2011-p6-bioperl
 
PigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptxPigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptx
 
PigHive.pptx
PigHive.pptxPigHive.pptx
PigHive.pptx
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur
 
4.1-Pig.pptx
4.1-Pig.pptx4.1-Pig.pptx
4.1-Pig.pptx
 

Plus de Sudar Muthu

A quick preview of WP CLI - Chennai WordPress Meetup
A quick preview of WP CLI - Chennai WordPress MeetupA quick preview of WP CLI - Chennai WordPress Meetup
A quick preview of WP CLI - Chennai WordPress MeetupSudar Muthu
 
WordPress Developer tools
WordPress Developer toolsWordPress Developer tools
WordPress Developer toolsSudar Muthu
 
WordPress Developer Tools to increase productivity
WordPress Developer Tools to increase productivityWordPress Developer Tools to increase productivity
WordPress Developer Tools to increase productivitySudar Muthu
 
Unit testing for WordPress
Unit testing for WordPressUnit testing for WordPress
Unit testing for WordPressSudar Muthu
 
Unit testing in php
Unit testing in phpUnit testing in php
Unit testing in phpSudar Muthu
 
Using arduino and raspberry pi for internet of things
Using arduino and raspberry pi for internet of thingsUsing arduino and raspberry pi for internet of things
Using arduino and raspberry pi for internet of thingsSudar Muthu
 
How arduino helped me in life
How arduino helped me in lifeHow arduino helped me in life
How arduino helped me in lifeSudar Muthu
 
Having fun with hardware
Having fun with hardwareHaving fun with hardware
Having fun with hardwareSudar Muthu
 
Getting started with arduino workshop
Getting started with arduino workshopGetting started with arduino workshop
Getting started with arduino workshopSudar Muthu
 
Python in raspberry pi
Python in raspberry piPython in raspberry pi
Python in raspberry piSudar Muthu
 
Hack 101 at IIT Kanpur
Hack 101 at IIT KanpurHack 101 at IIT Kanpur
Hack 101 at IIT KanpurSudar Muthu
 
PureCSS open hack 2013
PureCSS open hack 2013PureCSS open hack 2013
PureCSS open hack 2013Sudar Muthu
 
Arduino Robotics workshop day2
Arduino Robotics workshop day2Arduino Robotics workshop day2
Arduino Robotics workshop day2Sudar Muthu
 
Arduino Robotics workshop Day1
Arduino Robotics workshop Day1Arduino Robotics workshop Day1
Arduino Robotics workshop Day1Sudar Muthu
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Lets make robots
Lets make robotsLets make robots
Lets make robotsSudar Muthu
 
Capabilities of Arduino (including Due)
Capabilities of Arduino (including Due)Capabilities of Arduino (including Due)
Capabilities of Arduino (including Due)Sudar Muthu
 
Controlling robots using javascript
Controlling robots using javascriptControlling robots using javascript
Controlling robots using javascriptSudar Muthu
 
Picture perfect hacks with flickr API
Picture perfect hacks with flickr APIPicture perfect hacks with flickr API
Picture perfect hacks with flickr APISudar Muthu
 

Plus de Sudar Muthu (20)

A quick preview of WP CLI - Chennai WordPress Meetup
A quick preview of WP CLI - Chennai WordPress MeetupA quick preview of WP CLI - Chennai WordPress Meetup
A quick preview of WP CLI - Chennai WordPress Meetup
 
WordPress Developer tools
WordPress Developer toolsWordPress Developer tools
WordPress Developer tools
 
WordPress Developer Tools to increase productivity
WordPress Developer Tools to increase productivityWordPress Developer Tools to increase productivity
WordPress Developer Tools to increase productivity
 
Unit testing for WordPress
Unit testing for WordPressUnit testing for WordPress
Unit testing for WordPress
 
Unit testing in php
Unit testing in phpUnit testing in php
Unit testing in php
 
Using arduino and raspberry pi for internet of things
Using arduino and raspberry pi for internet of thingsUsing arduino and raspberry pi for internet of things
Using arduino and raspberry pi for internet of things
 
How arduino helped me in life
How arduino helped me in lifeHow arduino helped me in life
How arduino helped me in life
 
Having fun with hardware
Having fun with hardwareHaving fun with hardware
Having fun with hardware
 
Getting started with arduino workshop
Getting started with arduino workshopGetting started with arduino workshop
Getting started with arduino workshop
 
Python in raspberry pi
Python in raspberry piPython in raspberry pi
Python in raspberry pi
 
Hack 101 at IIT Kanpur
Hack 101 at IIT KanpurHack 101 at IIT Kanpur
Hack 101 at IIT Kanpur
 
PureCSS open hack 2013
PureCSS open hack 2013PureCSS open hack 2013
PureCSS open hack 2013
 
Arduino Robotics workshop day2
Arduino Robotics workshop day2Arduino Robotics workshop day2
Arduino Robotics workshop day2
 
Arduino Robotics workshop Day1
Arduino Robotics workshop Day1Arduino Robotics workshop Day1
Arduino Robotics workshop Day1
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Lets make robots
Lets make robotsLets make robots
Lets make robots
 
Capabilities of Arduino (including Due)
Capabilities of Arduino (including Due)Capabilities of Arduino (including Due)
Capabilities of Arduino (including Due)
 
Controlling robots using javascript
Controlling robots using javascriptControlling robots using javascript
Controlling robots using javascript
 
Picture perfect hacks with flickr API
Picture perfect hacks with flickr APIPicture perfect hacks with flickr API
Picture perfect hacks with flickr API
 
Hacking 101
Hacking 101Hacking 101
Hacking 101
 

Dernier

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 

Dernier (20)

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 

Pig workshop

  • 1. Pig Workshop Sudar Muthu http://sudarmuthu.com http://twitter.com/sudarmuthu https://github.com/sudar
  • 2. Who am I? Research Engineer by profession I mine useful information from data You might recognize me from other HasGeek events Blog at http://sudarmuthu.com Builds robots as hobby ;)
  • 4. What I will not cover?
  • 5. What I will not cover? What is BigData, or why it is needed? What is MapReduce? What is Hadoop? Internal architecture of Pig http://sudarmuthu.com/blog/getting-started-with-hadoop-and-pig
  • 6. What we will see today?
  • 7. What we will see today? What is Pig How to use it Loading and storing data Pig Latin SQL vs Pig Writing UDF’s Debugging Pig Scripts Optimizing Pig Scripts When to use Pig
  • 8. So, all of you have Pig installed right? ;)
  • 9. What is Pig? “Platform for analyzing large sets of data”
  • 10. Components of Pig Pig Shell (Grunt) Pig Language (Latin) Libraries (Piggy Bank) User Defined Functions (UDF)
  • 11. Why Pig? It is a data flow language Provides standard data processing operations Insulates Hadoop complexity Abstracts Map Reduce Increases programmer productivity … but there are cases where Pig is not suitable.
  • 13. For this workshop, we will be using Pig only in local mode
  • 14. Getting to know your Pig shell
  • 15. pig –x local Similar to Python’s shell
  • 16. Different ways of executing Pig Scripts Inline in shell From a file Streaming through other executable Embed script in other languages
  • 17. Loading and Storing data Pigs eat anything
  • 18. Loading Data into Pig file = LOAD 'data/dropbox-policy.txt' AS (line); data = LOAD 'data/tweets.csv' USING PigStorage(','); data = LOAD 'data/tweets.csv' USING PigStorage(',') AS ('list', 'of', 'fields');
  • 19. Loading Data into Pig PigStorage – for most cases TextLoader – to load text files JSONLoader – to load JSON files Custom loaders – You can write your own custom loaders as well
  • 20. Viewing Data DUMP input; Very useful for debugging, but don’t use it on huge datasets
  • 21. Storing Data from Pig STORE data INTO 'output_location'; STORE data INTO 'output_location' USING PigStorage(); STORE data INTO 'output_location' USING PigStorage(','); STORE data INTO 'output_location' USING BinStorage();
  • 22. Storing Data Similar to `LOAD`, lot of options are available Can store locally or in HDFS You can write your own custom Storage as well
  • 23. Load and Store example data = LOAD 'data/data-bag.txt' USING PigStorage(','); STORE data INTO 'data/output/load-store' USING PigStorage('|'); https://github.com/sudar/pig-samples/load-store.pig
  • 26. Scalar Types int, long – (32, 64 bit) integer float, double – (32, 64 bit) floating point boolean (true/false) chararray (String in UTF-8) bytearray (blob) (DataByteArray in Java) If you don’t specify anything bytearray is used by default
  • 27. Complex Types tuple – ordered set of fields (data) bag – collection of tuples map – set of key value pairs
  • 28. Tuple Row with one or more fields Fields can be of any data type Ordering is important Enclosed inside parentheses () Eg: (Sudar, Muthu, Haris, Dinesh) (Sudar, 176, 80.2F)
  • 29. Bag Set of tuples SQL equivalent is Table Each tuple can have different set of fields Can have duplicates Inner bag uses curly braces {} Outer bag doesn’t use anything
  • 30. Bag - Example Outer bag (1,2,3) (1,2,4) (2,3,4) (3,4,5) (4,5,6) https://github.com/sudar/pig-samples/data-bag.pig
  • 31. Bag - Example Inner bag (1,{(1,2,3),(1,2,4)}) (2,{(2,3,4)}) (3,{(3,4,5)}) (4,{(4,5,6)}) https://github.com/sudar/pig-samples/data-bag.pig
  • 32. Map Set of key value pairs Similar to HashMap in Java Key must be unique Key must be of chararray data type Values can be any type Key/value is separated by # Map is enclosed by []
  • 33. Map - Example [name#sudar, height#176, weight#80.5F] [name#(sudar, muthu), height#176, weight#80.5F] [name#(sudar, muthu), languages#(Java, Pig, Python )]
  • 34. Null Similar to SQL Denotes that value of data element is unknown Any data type can be null
  • 35. Schemas in Load statement We can specify a schema (collection of datatypes) to `LOAD` statements data = LOAD 'data/data-bag.txt' USING PigStorage(',') AS (f1:int, f2:int, f3:int); data = LOAD 'data/nested-schema.txt' AS (f1:int, f2:bag{t:tuple(n1:int, n2:int)}, f3:map[]);
  • 36. Expressions Fields can be looked up by Position Name Map Lookup
  • 37. Expressions - Example data = LOAD 'data/nested-schema.txt' AS (f1:int, f2:bag{t:tuple(n1:int, n2:int)}, f3:map[]); by_pos = FOREACH data GENERATE $0; DUMP by_pos; by_field = FOREACH data GENERATE f2; DUMP by_field; by_map = FOREACH data GENERATE f3#'name'; DUMP by_map; https://github.com/sudar/pig-samples/lookup.pig
  • 39. Arithmetic Operators All usual arithmetic operators are supported Addition (+) Subtraction (-) Multiplication (*) Division (/) Modulo (%)
  • 40. Boolean Operators All usual boolean operators are supported AND OR NOT
  • 41. Comparison Operators All usual comparison operators are supported == != < > <= >=
  • 43. FOREACH Generates data transformations based on columns of data x = FOREACH data GENERATE *; x = FOREACH data GENERATE $0, $1; x = FOREACH data GENERATE $0 AS first, $1 AS second;
  • 44. FLATTEN Un-nests tuples and bags. Most of the time results in cross product (a, (b, c)) => (a,b,c) ({(a,b),(d,e)}) => (a,b) and (d,e) (a, {(b,c), (d,e)}) => (a, b, c) and (a, d, e)
  • 45. GROUP Groups data in one or more relations Groups tuples that have the same group key Similar to SQL group by operator outerbag = LOAD 'data/data-bag.txt' USING PigStorage(',') AS (f1:int, f2:int, f3:int); DUMP outerbag; innerbag = GROUP outerbag BY f1; DUMP innerbag; https://github.com/sudar/pig-samples/group-by.pig
  • 46. FILTER Selects tuples from a relation based on some condition data = LOAD 'data/data-bag.txt' USING PigStorage(',') AS (f1:int, f2:int, f3:int); DUMP data; filtered = FILTER data BY f1 == 1; DUMP filtered; https://github.com/sudar/pig-samples/filter-by.pig
  • 47. COUNT Counts the number of tuples in a relationship data = LOAD 'data/data-bag.txt' USING PigStorage(',') AS (f1:int, f2:int, f3:int); grouped = GROUP data BY f2; counted = FOREACH grouped GENERATE group, COUNT (data); DUMP counted; https://github.com/sudar/pig-samples/count.pig
  • 48. ORDER By Sort a relation based on one or more fields. Similar to SQL order by data = LOAD 'data/nested-sample.txt' USING PigStorage(',') AS (f1:int, f2:int, f3:int); DUMP data; ordera = ORDER data BY f1 ASC; DUMP ordera; orderd = ORDER data BY f1 DESC; DUMP orderd; https://github.com/sudar/pig-samples/order-by.pig
  • 49. DISTINCT Removes duplicates from a relation data = LOAD 'data/data-bag.txt' USING PigStorage(',') AS (f1:int, f2:int, f3:int); DUMP data; unique = DISTINCT data; DUMP unique; https://github.com/sudar/pig-samples/distinct.pig
  • 50. LIMIT Limits the number of tuples in the output. data = LOAD 'data/data-bag.txt' USING PigStorage(',') AS (f1:int, f2:int, f3:int); DUMP data; limited = LIMIT data 3; DUMP limited; https://github.com/sudar/pig-samples/limit.pig
  • 51. JOIN Joins relation based on a field. Both outer and inner joins are supported a = LOAD 'data/data-bag.txt' USING PigStorage(',') AS (f1:int, f2:int, f3:int); DUMP a; b = LOAD 'data/simple-tuples.txt' USING PigStorage(',') AS (t1:int, t2:int); DUMP b; joined = JOIN a by f1, b by t1; DUMP joined; https://github.com/sudar/pig-samples/join.pig
  • 52. SQL vs Pig From Table – Load file(s) Select – FOREACH GENERATE Where – FILTER BY Group By – GROUP BY + FOREACH GENERATE Having – FILTER BY Order By – ORDER BY Distinct - DISTINCT
  • 53. Let’s see a complete example Count the number of words in a text file https://github.com/sudar/pig-samples/count-words.pig
  • 55. Why UDF? Do operations on more than one field Do more than grouping and filtering Programmer is comfortable Want to reuse existing logic Traditionally UDF can be written only in Java. Now other languages like Python are also supported
  • 56. Different types of UDF’s Eval Functions Filter functions Load functions Store functions
  • 57. Eval Functions Can be used in FOREACH statement Most common type of UDF Can return simple types or Tuples b = FOREACH a generate udf.Function($0); b = FOREACH a generate udf.Function($0, $1);
  • 58. Eval Functions Extend EvalFunc<T> interface The generic <T> should contain the return type Input comes as a Tuple Should check for empty and nulls in input Extend exec() function and it should return the value Extend getArgToFuncMapping() to let UDF know about Argument mapping Extend outputSchema() to let UDF know about output schema
  • 59. Using Java UDF in Pig Scripts Create a jar file which contains your UDF classes Register the jar at the top of Pig script Register other jars if needed Define the UDF function Use your UDF function
  • 60. Let’s see an example which returns a string https://github.com/sudar/pig-samples/strip-quote.pig
  • 61. Let’s see an example which returns a Tuple https://github.com/sudar/pig-samples/get-twitter-names.pig
  • 62. Filter Functions Can be used in the Filter statements Returns a boolean value Eg: vim_tweets = FILTER data By FromVim(StripQuote($6));
  • 63. Filter Functions Extends FilterFun, which is a EvalFunc<Boolean> Should return a boolean Input it is same as EvalFunc<T> Should check for empty and nulls in input Extend getArgToFuncMapping() to let UDF know about Argument mapping
  • 64. Let’s see an example which returns a Boolean https://github.com/sudar/pig-samples/from-vim.pig
  • 65. Error Handling in UDF If the error affects only particular row then return null. If the error affects other rows, but can recover, then throw an IOException If the error affects other rows, and can’t recover, then also throw an IOException. Pig and Hadoop will quit, if there are many IOExceptions.
  • 66. Can we try to write some more UDF’s?
  • 67. Writing UDF in other languages
  • 69. Streaming Entire data set is passed through an external task The external task can be in any language Even shell script also works Uses the `STREAM` function
  • 70. Stream through shell script data = LOAD 'data/tweets.csv' USING PigStorage(','); filtered = STREAM data THROUGH `cut -f6,8`; DUMP filtered; https://github.com/sudar/pig-samples/stream-shell-script.pig
  • 71. Stream through Python data = LOAD 'data/tweets.csv' USING PigStorage(','); filtered = STREAM data THROUGH `strip.py`; DUMP filtered; https://github.com/sudar/pig-samples/stream-python.pig
  • 72. Debugging Pig Scripts DUMP is your friend, but use with LIMIT DESCRIBE – will print the schema names ILLUSTRATE – Will show the structure of the schema In UDF’s, we can use warn() function. It supports upto 15 different debug levels Use Penny - https://cwiki.apache.org/PIG/pennytoollibrary.html
  • 73. Optimizing Pig Scripts Project early and often Filter early and often Drop nulls before a join Prefer DISTINCT over GROUP BY Use the right data structure
  • 74. Using Param substitution -p key=value - substitutes a single key, value -m file.ini – substitutes using an ini file default – provide default values http://sudarmuthu.com/blog/passing-command-line- arguments-to-pig-scripts
  • 75. Problems that can be solved using Pig Anything data related
  • 76. When not to use Pig? Lot of custom logic needs to be implemented Need to do lot of cross lookup Data is mostly binary (processing image files) Real-time processing of data is needed
  • 77. External Libraries PiggyBank - https://cwiki.apache.org/PIG/piggybank.html DataFu – Linked-In Pig Library - https://github.com/linkedin/datafu Elephant Bird – Twitter Pig Library - https://github.com/kevinweil/elephant-bird
  • 78. Useful Links Pig homepage - http://pig.apache.org/ My blog about Pig - http://sudarmuthu.com/blog/category/hadoop-pig Sample code – https://github.com/sudar/pig-samples Slides – http://slideshare.net/sudar