Apache Pig

Apache Hadoop
Pig Fundamentals

Shashidhar HB
1









Why Hadoop
Hadoop n The Cloud Industry
Querying Large Data... Pig to Rescue
Pig: Why? What? How?
Pig Basics: Install, Configure, Try
Dwelling Deeper into Pig-PigLatin
Q&A

2

You have 10x more DATA
Than you did 3 years ago!

MORE about your
BUSINESS?

BUT do you know 10x

NO!
3

A lot of data, BIG data!

Information
(The Big Picture)

We are not able to effectively
store and analyze all the data we have, so we are not able to see the big picture!

5



BigData / Web Scale: are datasets that grow so large that
they become awkward to work with traditional database
management tools



Handling Big Data using traditional approach is costly and
rigid (Difficulties include capture, storage, search, sharing,
analytics and visualization)



Google,Yahoo, Facebook, LinkedIn handles “Petabytes”
of data everyday.



They all use HADOOP to solve there BIG DATA problem
6

So Mr. HADOOP says he has a
solution to our BIG problem !

8



Hadoop is an open-source software for RELIABLE
and SCALABLE distributed computing



Hadoop provides a comprehensive solution to handle
Big Data



Hadoop is
 HDFS : High Availability Data Storage subsystem
(http://labs.google.com/papers/gfs.html: 2003)

+
 MapReduce: Parallel Processing system
(http://labs.google.com/papers/mapreduce.html: 2004)
9



2008: Yahoo! Launched Hadoop



2009: Hadoop source code was made available to the
free world



2010: Facebook claimed that they have the largest
Hadoop cluster in the world with 21 PB of storage



2011: Facebook announced the data has grown to 30
PB
10



Stats :Facebook
▪ Started in 2004: 1 million users
▪ August 2008: Facebook reaches over 100 million active users
▪ Now: 750+ million active users
“Bottom line.. More users more DATA”



The BIG challenge at Facebook!!
Using historical data is a very big part of improving the user experience on
Facebook. So storing and processing all these bytes is of immense importance.

Facebook tried Hadoop for this.
11

Hadoop turned out to be a great
solution, but there was one little
problem!

12




Map Reduce requires skilled JAVA programmers
to write standard MapReduce programs
Developers are more fluent in querying data using
SQL

“Pig says, No Problemo!”
13



Input: User profiles, Page
visits



Find the top 5 most
visited pages by users
aged 18-25

14

1.

Users = LOAD ‘users’ AS (name, age);

2.

Filtered = FILTER Users BY age >= 18 AND age <= 25;

3.

Pages = LOAD ‘pages’ AS (user, url);

4.

Joined = JOIN Filtered BY name, Pages BY user;

5.

Grouped = GROUP Joined BY url;

6.

Summed = FOREACH Grouped generate GROUP, COUNT(Joined) AS
clicks;

7.

Sorted = ORDER Summed BY clicks DESC;

8.

Top5 = LIMIT Sorted 5;

9.

STORE Top5 INTO ‘top5sites’;
16



Pig is a dataflow language
• Language is called PigLatin
• Pretty simple syntax
• Under the covers, PigLatin scripts are turned into MapReduce
jobs and executed on the cluster



Pig Latin: High-level procedural language



Pig Engine: Parser, Optimizer and
distributed query execution
17

PIG








Pig is procedural
Nested relational data model
(No constraints on Data
Types)
Schema is optional
Scan-centric analytic
workloads (No Random reads
or writes)
Limited query optimization

SQL








SQL is declarative
Flat relational data model
(Data is tied to a specific Data
Type)
Schema is required
OLTP + OLAP workloads

Significant opportunity for
query optimization
18

PIG


Users = load 'users' as (name, age, ipaddr);



Clicks = load 'clicks' as (user, url, value);



ValuableClicks = filter Clicks by value > 0;



UserClicks = join Users by name,
ValuableClicks by user;




Geoinfo = load 'geoinfo' as (ipaddr, dma);
UserGeo = join UserClicks by ipaddr,
Geoinfo by ipaddr; ByDMA = group
UserGeo by dma;



ValuableClicksPerDMA = foreach ByDMA
generate group, COUNT(UserGeo);
store ValuableClicksPerDMA into
'ValuableClicksPerDMA';



SQL

insert into
ValuableClicksPerDMA select
dma, count(*) from geoinfo
join ( select name, ipaddr
from users join clicks on
(users.name = clicks.user)
where value > 0; ) using ipaddr
group by dma;

19



Joining datasets



Grouping data



Referring to elements by position rather than name ($0, $1,
etc)



Loading non-delimited data using a custom SerDe (Writing
a custom Reader and Writer)



Creation of user-defined functions (UDF), written in Java



And more..
20

Pig runs as a client-side application. Even if you want to run Pig on a Hadoop
cluster, there is nothing extra to install on the cluster: Pig launches jobs and
interacts with HDFS (or other Hadoop file systems) from your workstation.
Download a stable release from http://hadoop.apache.org/pig/releases.html
and unpack the tarball in a suitable place on your workstation:
% tar xzf pig-x.y.z.tar.gz
It’s convenient to add Pig’s binary directory to your command-line path.
For example:

% export PIG_INSTALL=/home/tom/pig-x.y.z
% export PATH=$PATH:$PIG_INSTALL/bin
You also need to set the JAVA_HOME environment variable to point to
suitable Java installation.

22



Execution Types
 Local mode (pig -x local)
 Hadoop mode



Pig must be configured to the cluster’s namenode
and jobtracker
1. Put hadoop config directory in PIG classpath
% export PIG_CLASSPATH=$HADOOP_INSTALL/conf/
2. Create a pig.properties
fs.default.name=hdfs://localhost/
mapred.job.tracker=localhost:8021
23



Script: Pig can run a script file that contains Pig commands.
For example,
% pig script.pig
Runs the commands in the local file ”script.pig”.
Alternatively, for very short scripts, you can use the -e option to run a script
specified
as a string on the command line.



Grunt: Grunt is an interactive shell for running Pig commands.
Grunt is started when no file is specified for Pig to run, and the -e option is not used.
Note: It is also possible to run Pig scripts from within Grunt using run and exec.



Embedded: You can run Pig programs from Java, much like you can use JDBC to run SQL
programs from Java.
There are more details on the Pig wiki at http://wiki.apache.org/pig/EmbeddedPig
24



PigLatin: A Pig Latin program consists of a collection of statements. A
statement can be thought of as an operation, or a command
For example,
1. A GROUP operation is a type of statement:
grunt> grouped_records = GROUP records BY year;
2. The command to list the files in a Hadoop filesystem is another example of a statement:
ls /
3. LOAD operation to load data from tab seperated file to PIG record
grunt> records = LOAD ‘sample.txt’
AS (year:chararray, temperature:int, quality:int);



Data: In Pig, a single element of data is an atom
A collection of atoms – such as a row, or a partial row – is a tuple
Tuples are collected together into bags
Atom –>

Row/Partial Row

–> Tuple

–> Bag
25

Example contents of ‘employee.txt’ a tab delimited text










1
Krishna 234000000
none
2
Krishna_01
234000000
none
124163
Shashi 10000 cloud
124164
Gopal
1000000 setlabs
124165
Govind 1000000 setlabs
124166
Ram
450000 es
124167
Madhusudhan
450000 e&r
124168
Hari
6500000 e&r
124169
Sachith 50000 cloud

26

--Loading data from employee.txt into emps bag and with a
schema
empls
=
LOAD
‘employee.txt’
AS
(id:int, name:chararray,
salary:double, dept:chararray);
--Filtering the data as required
rich = FILTER empls BY $2 > 100000;

--Sorting
sortd = ORDER rich BY salary DESC;
--Storing the final results
STORE sortd INTO ‘rich_employees.txt’;
-- Or alternatively we can dump the record on the screen
DUMP sortd;
-------------------------------------------------------------------Group by salary
grp = GROUP empls BY salary;
--Get count of employees in each salary group
cnt = FOREACH grp GENERATE group, COUNT(empls.id) as emp_cnt;
27



To view the schema of a relation
 DESCRIBE empls;



To view step-by-step execution of a series of statements
 ILLUSTRATE empls;



To view the execution plan of a relation
 EXPLAIN empls;



Join two data sets
LOAD 'data1' AS (col1, col2, col3, col4);
LOAD 'data2' AS (colA, colB, colC);
jnd = JOIN data1 BY col3, data2 BY colA PARALLEL 50;
STORE jnd INTO 'outfile‘;
28



Load using PigStorage
 empls = LOAD ‘employee.txt’ USING PigStorage('t')

AS (id:int, name:chararray, salary:double,
dept:chararray);



Store using PigStorage
 STORE srtd INTO ‘rich_employees.txt’ USING

PigStorage('t');

29

Flexibility with PIG
Is that all we can do with the PIG!!??

30



Pig provides extensive support for user-defined
functions (UDFs) as a way to specify custom
processing. Functions can be a part of almost
every operator in Pig



All UDF’s are case sensitive

31



Eval Functions (EvalFunc)
 Ex: StringConcat (built-in) : Generates the concatenation of the first two
fields of a tuple.



Aggregate Functions (EvalFunc & Algebraic)
 Ex: COUNT, AVG ( both built-in)



Filter Functions (FilterFunc)
 Ex: IsEmpty (built-in)



Load/Store Functions (LoadFunc/ StoreFunc)
 Ex: PigStorage (built-in)
Note: URL for built in functions:
http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/builtin/package-summary.html

32



Piggy Bank
 Piggy Bank is a place for Pig users to share their functions



DataFu (Linkedin’s collection of UDF’s)
 Hadoop library for large-scale data processing

33

package myudfs;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.impl.util.WrappedIOException;
public class UPPER extends EvalFunc <String>
{
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
String str = (String)input.get(0);
return str.toUpperCase();
}catch(Exception e){
throw WrappedIOException.wrap("Caught exception processing input row ", e);
}
}
}
34

-- myscript.pig
 REGISTER myudfs.jar;
Note: myudfs.jar should not be surrounded with quotes

 A = LOAD 'employee_data' AS (id: int,name: chararray,

salary: double, dept: chararray);
 B = FOREACH A GENERATE myudfs.UPPER(name);
 DUMP B;
35





java -cp pig.jar org.apache.pig.Main -x local
myscript.pig
or
pig -x local myscript.pig

Note: myudfs.jar should be in class path!


Locating an UDF jar file
 Pig first checks the classpath.
 Pig assumes that the location is either an absolute path or a

path relative to the location from which Pig was invoked
36

Pig Type
bytearray
chararray
int
long
float
double
tuple
bag
map

Java Class
DataByteArray
String
Integer
Long
Float
Double
Tuple
DataBag
Map<Object, Object>

37

All is well, but.. What about the
performance trade offs?

38

Q&A
Mail me :
shashidhar_hb@infosys.com

40

Apache Pig

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (17)

Similaire à Apache Pig

Similaire à Apache Pig (20)

Dernier

Dernier (20)

Apache Pig

Notes de l'éditeur