Pig programming is fun

Pig programming is more fun: New features in Pig

Daniel Dai (@daijy)
Thejas Nair (@thejasn)

© Hortonworks Inc. 2011 Page 1

What is Apache Pig?
Pig Latin, a high level An engine that
data processing executes Pig Latin
language. locally or on a
Hadoop cluster.

Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/

Architecting the Future of Big Data
Page 2
© Hortonworks Inc. 2011

Pig-latin example
• Query : Get the list of pages visited by users whose age is
between 20 and 25 years.

users = load users as (name, age);

users_18_to_25 = filter users by age > 20 and age <= 25;

page_views = load pages as (user, url);

page_views_u18_to_25 = join users_18_to_25 by name,
page_views by user;

Page 3

Why pig ?
• Faster development
–  Fewer lines of code
–  Don’t re-invent the wheel

• Flexible
–  Metadata is optional
–  Extensible
–  Procedural programming

Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/

Page 4

Before pig 0.9
p1.pig p2.pig p3.pig

Page 5

With pig macros
p1.pig p2.pig p3.pig

macro1.pig macro2.pig

Page 6

With pig macros
p1.pig p1.pig rm_bots.pig

get_top.pig

Page 7

Pig macro example
• Page_views data : (user_name, url, timestamp, …)
• Find top 5 users by page views
• Find top 10 most visited pages.

Page 8

Pig Macro example
page_views = LOAD .. /* top x macro */
/* get top 5 users by page view */ DEFINE topCount (rel, col, topNum)
u_grp = GROUP .. by uname; RETURNS top_num_recs {
u_count = FOREACH .. COUNT .. grped = GROUP $rel by $col;
ord_u_count = ORDER u_count .. cnt_grp = FOREACH ..COUNT($rel)..
top_5_users = LIMIT ordered.. 5; ord_cnt = ORDER .. by cnt;
DUMP top_5_users; $top_num_recs = LIMIT.. $topNum;
}
/* get top 10 urls by page view */ -----------------------------------------
url_grp = GROUP .. by url; page_views = LOAD ..
url_count = FOREACH .. COUNT . /* get top 5 users by page view */
ord_url_count = ORDER url_count.. top_5_users = topCount(page_views,
top_10_urls = LIMIT ord_url.. 10; uname, 5);
DUMP top_10_urls; DUMP top_5_users;
…

Page 9

Pig macro
• Coming soon – piggybank with pig macros

Page 10

Writing data flow program
• Writing a complex data pipeline is an iterative process

Load Load

Transform Join

Group Transform Filter

Page 11


Load Load

Transform Join


No output! L

Page 12

• Debug!

Load Load

Was
join
on

Transform Join wrong

a2ributes?

Bug
in
transform?

Did
ﬁlter
drop

everything?

Page 13

Common approaches to debug
• Running on real (large) data
– Inefficient, takes longer
• Running on (small) samples
– Empty results on join, selective filters

Page 14

Pig illustrate command
• Objective- Show examples for i/o of each statement that
are
– Realistic
– Complete
– Concise
– Generated fast
• Steps
– Downstream – sample and process
– Prune
– Upstream – generate realistic missing classes of examples
– Prune

Page 15

Illustrate command demo

Page 16

Pig relation-as-scalar
• In pig each statement alias is a relation
– Relation is a set of records
• Task: Get list of pages whose load time was more
than average.
• Steps
1.  Compute average load time
2.  Get list of pages whose load time is > average

Page 17

• Step 1 is like
.. = load ..!
..= group ..!
al_rel = foreach .. AVG(ltime) as avg_ltime;!

• Step 2 looks like
page_views = load ‘pviews.txt’ as !
(url, ltime, ..);!
!
slow_views = filter page_views by !
ltime > avg_ltime!

Page 18

• Getting results of step 1 (average_gpa)
– Join result of step 1 with students relation, or
– Write result into file, then use udf to read from file
• Pig scalar feature now simplifies this-
slow_views = filter page_views by !
ltime > al_rel.avg_ltime!

– Runtime exception if al_rel has more than one record.

Page 19

UDF in Scripting Language
• Benefit
– Use legacy code
– Use library in scripting language
– Leverage Hadoop for non-Java programmer
• Currently supported language
– Python
– JavaScript
– Ruby
• Extensible Interface
– Minimum effort to support another language

Page 20

Writing a Jython UDF
Write a Jython UDF •  Invoke Jython UDF when
needed
@outputSchema("word:chararray") •  Type conversion
def concat(word): –  Simple type
return word + word –  Python Array <-> Pig Bag
–  Python Dict <-> Pig Map
–  Pyton Tuple <-> Pig Tuple

@outputSchemaFunction("squareSchema") •  Convey schema to Pig
def square(num): –  outputSchema
–  outputSchemaFunction
if num == None:
return None register 'util.py' using jython as util;
return ((num)*(num))
B = foreach A generate util.square
def squareSchema(input): (i));
return input

Page 21

Use NLTK in Pig
• Example
register ’nltk_util.py' using jython as nltk;
……
B = foreach A generate nltk.tokenize(sentence)

nltk_util.py
import nltk
porter = nltk.PorterStemmer()
@outputSchema("words:{(word:chararray)}")
def tokenize(sentence):
tokens = nltk.word_tokenize(sentence)
words = [porter.stem(t) for t in tokens]
return words

Page 22

Writing a Script Engine
Writing a bridge UDF
class JythonFunction extends EvalFunc<Object> {
public Object exec(Tuple tuple) {
PyObject[] params = JythonUtils.pigTupleToPyTuple(tuple).getArray();
PyObject result = function.__call__(params);
return JythonUtils.pythonToPig(result);
}
public Schema outputSchema(Schema input) {
PyObject outputSchemaDef = f.__findattr__("outputSchema".intern());
return Utils.getSchemaFromString(outputSchemaDef.toString());
}
}

Page 23

Writing a Script Engine
Register scripting UDF

register 'util.py' using jython as util;

What happens in Pig
class JythonScriptEngine extends ScriptEngine {
public void registerFunctions(String path, String namespace, PigContext
pigContext) {
PythonInterpreter pi = Interpreter.interpreter;
pi.execfile(path);
for (PyTuple item : pi.getLocals().items())
funcspec = new FuncSpec(JythonFunction.class.getCanonicalName() + "('"
+ path + "','" + item. get(0)+"')");
pigContext.registerFunction(namespace + key, funcspec);
}
}

Page 24

Algebraic UDF in JRuby
class Count < AlgebraicPigUdf
output_schema Schema.long

def initial t
t.nil? ? 0 : 1
end

def intermed t
return 0 if t.nil?
t.flatten.inject(:+)
end

def final t
intermed(t)
end

end

Page 25

Pig Embedding
• Embed Pig inside scripting language
– Python
– JavaScript
• Algorithms which cannot complete using one Pig script
– Iterative algorithm
PageRank, Kmeans, Neural Network, Apriori, etc
– Parallel execution
Random forrest
– Divide and Conquer
– Branching

Page 26

Pig Embedding
from org.apache.pig.scripting import Pig

Compile
Pig

input= ":INPATH:/singlefile/studenttab10k”
Script

P = Pig.compile("""A = load '$in' as (name, age, gpa); store A into ’output';""")

Bind
Variables

Q = P.bind({'in':input})

result = Q.runSingle() Launch
Pig
Script

if result.isSuccessful():
print "Pig job PASSED”
else:
raise "Pig job FAILED"

Page 27

Pig Embedding
• Running embeded Pig script
pig sample.py
• What happen within Pig?
Pig
Script

Python Python
Script Script
sample.py Pig Jython Pig

Page 28

Nested Operator
• Nested Operator: Operator inside foreach
B = group A by name;
C = foreach B {
C0 = limit A 10;
generate C0;
}

• Prior Pig 0.10, supported nested operator
– DISTINCT, FILTER, LIMIT, and ORDER BY
• New operators added in 0.10
– CROSS, FOREACH

Page 29

Nested Cross/Foreach
A = LOAD ’studenttab10k' as (name:chararray, age:int, gpa:double);
B = LOAD ’votertab10k' as (name:chararray, age:int, registration,
contributions:double);
C = cogroup A by name, B by name;
D = foreach C {
C1 = filter A by gpa > 4;
C2 = filter B by contributions > 500;
C3 = cross C1, C2;
C4 = foreach C3 generate CONCAT(CONCAT((chararray)gpa, '_'), (chararray)
contributions);
generate flatten(C4);
}
store D into ’output'

Page 30

Misc Loaders
• HBaseStorage
• CassandraStorage
• AvroStorage
• JsonLoader/JsonStorage

Page 31

New operators to come
• Will be available in Pig 0.11
– RANK
– A distributed RANK implementation for Pig

– CUBE

Page 32

Pig programming is fun

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (9)

Similaire à Pig programming is fun

Similaire à Pig programming is fun (20)

Plus de DataWorks Summit

Plus de DataWorks Summit (20)

Dernier

Dernier (20)

Pig programming is fun