This document provides an overview of Pig, an analytics platform for Hadoop. It discusses what Pig is, how it works, its data types and operations like LOAD, FILTER, GROUP, FOREACH, and ORDER. The hands-on section demonstrates counting arrests by team using the Pig Latin scripting language to LOAD data, GROUP by team, then FOREACH to COUNT arrests and return the results.
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
Pig Hands On November
1. Hands-‐on
Pig
with
the
NFL
Play
by
Play
Dataset
Headline
Goes
Here
Ryan
Bosshart
|
Systems
Engineer
Speaker
Nov
2013
v1
Name
or
Subhead
Goes
Here
1
DO
NOT
USE
PUBLICLY
PRIOR
TO
10/23/12
2. Outline
• What
is
Pig
• Pig
LaLn
by
Example
• Data
Model/Architecture
• Hands-‐on
with
Pig
2
3. What
is
Pig?
Give
me
every
run
in
the
2010
season:
SELECT
*
FROM
playbyplay
WHERE
playtype
=
"RUN”
and
year
=
2010;
playbyplay
=
LOAD
'playbyplay’
….;
run_plays
=
FILTER
playbyplay
BY
(playtype=='RUN')
AND
(year==2010);
DUMP
run_plays;
3
7. How
Pig
Works
Pig
La2n:
Count
Job
A = LOAD ‘myfile’
AS (x, y, z);
B = FILTER A by x> 0;
C = GROUP B BY x;
D = FOREACH A GENERATE
x, COUNT(B);
STORE D INTO ‘output’;
7
•
•
•
•
•
•
Parses
Checks
OpLmizes
Plans
execuLon
Submits
jar
to
Hadoop
Monitors
job
progress
ExecuLon
Plan
Map:
Filter
Reduce:
Counter
14. Grouping & Types
• GROUP
BY
makes
an
output
bag
containing
tuples,
containing
more
bags
Gprd = GROUP arrests BY user;
• In:
BagOf(year,
team,
player)
• Out:
BagOf(group,
BagOf(year,
team,
player),
named
arrests)
• The
grouping
item
is
always
named
“group”
14
16. CounLng
Arrests
by
Team
num_arrests
=
FOREACH
arrests_by_team
GENERATE
group
AS
team,
COUNT(arrests)
AS
total;
(TEN,
{(2010,TEN,Derrick
Morgan),
(2010,TEN,Vince
Young),
(2010,TEN,Kenny
Bric)})
(WAS,
{(2010,WAS,Fred
Davis),
(2010,WAS,Albert
Haynesworth),
(2010,WAS,Fred
Davis),
(2010,WAS,Fred
Davis),
(2010,WAS,Joe
Joseph)})
16
Results:
(SEA,20)
(STL,9)
(TEN,31)
(WAS,16)
17. Using Types
•
•
By
default
Pig
treats
data
as
un-‐typed
User
can
declare
types
of
data
at
load
Lme
arrests = LOAD 'arrests.csv' USING PigStorage(',')
AS(year:int, team:chararray, player:chararray);
• If
data
type
is
not
declared
but
script
treats
value
as
a
certain
type,
Pig
will
assume
it
is
of
that
type
and
cast
it
arrests = LOAD 'arrests.csv' USING PigStorage(',')
AS(year, team, player);
Two_digit_year = FOREACH arrests
GENERATE year - 2000; -- cast to int
17
21. Sharing
Metadata
• Use
HCatalog
$ pig –useHCatalog
grunt> playbyplay= LOAD ’playbyplay' USING
org.apache.hcatalog.pig.HCatLoader();
grunt> STORE newdata INTO ’newtable' USING
org.apache.hcatalog.pig.HCatStorer();
•
*need to upload some jars to enable this:
•
http://blog.cloudera.com/blog/2013/08/demo-using-hue-toaccess-hive-data-through-pig/
23. Try it!
$ pig
grunt> arrests = LOAD 'arrests.csv' USING PigStorage(',') AS
(year,team,player);
grunt> grouped_arrests = GROUP arrests BY team;
grunt> num_arrests = FOREACH grouped_arrests GENERATE group AS
team, COUNT(arrests) AS total;
grunt> ordered_arrests = ORDER num_arrests BY total;
grunt> bad_boys = FILTER ordered_arrests BY (total>20);
23