Contenu connexe Similaire à Pig programming is fun (20) Plus de DataWorks Summit (20) Pig programming is fun1. Pig programming is more fun: New features in Pig
Daniel Dai (@daijy)
Thejas Nair (@thejasn)
© Hortonworks Inc. 2011 Page 1
2. What is Apache Pig?
Pig Latin, a high level An engine that
data processing executes Pig Latin
language. locally or on a
Hadoop cluster.
Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/
Architecting the Future of Big Data
Page 2
© Hortonworks Inc. 2011
3. Pig-latin example
• Query : Get the list of pages visited by users whose age is
between 20 and 25 years.
users = load users as (name, age);
users_18_to_25 = filter users by age > 20 and age <= 25;
page_views = load pages as (user, url);
page_views_u18_to_25 = join users_18_to_25 by name,
page_views by user;
Architecting the Future of Big Data
Page 3
© Hortonworks Inc. 2011
4. Why pig ?
• Faster development
– Fewer lines of code
– Don’t re-invent the wheel
• Flexible
– Metadata is optional
– Extensible
– Procedural programming
Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/
Architecting the Future of Big Data
Page 4
© Hortonworks Inc. 2011
5. Before pig 0.9
p1.pig p2.pig p3.pig
Architecting the Future of Big Data
Page 5
© Hortonworks Inc. 2011
6. With pig macros
p1.pig p2.pig p3.pig
macro1.pig macro2.pig
Architecting the Future of Big Data
Page 6
© Hortonworks Inc. 2011
7. With pig macros
p1.pig p1.pig rm_bots.pig
get_top.pig
Architecting the Future of Big Data
Page 7
© Hortonworks Inc. 2011
8. Pig macro example
• Page_views data : (user_name, url, timestamp, …)
• Find top 5 users by page views
• Find top 10 most visited pages.
Architecting the Future of Big Data
Page 8
© Hortonworks Inc. 2011
9. Pig Macro example
page_views = LOAD .. /* top x macro */
/* get top 5 users by page view */ DEFINE topCount (rel, col, topNum)
u_grp = GROUP .. by uname; RETURNS top_num_recs {
u_count = FOREACH .. COUNT .. grped = GROUP $rel by $col;
ord_u_count = ORDER u_count .. cnt_grp = FOREACH ..COUNT($rel)..
top_5_users = LIMIT ordered.. 5; ord_cnt = ORDER .. by cnt;
DUMP top_5_users; $top_num_recs = LIMIT.. $topNum;
}
/* get top 10 urls by page view */ -----------------------------------------
url_grp = GROUP .. by url; page_views = LOAD ..
url_count = FOREACH .. COUNT . /* get top 5 users by page view */
ord_url_count = ORDER url_count.. top_5_users = topCount(page_views,
top_10_urls = LIMIT ord_url.. 10; uname, 5);
DUMP top_10_urls; DUMP top_5_users;
…
Architecting the Future of Big Data
Page 9
© Hortonworks Inc. 2011
10. Pig macro
• Coming soon – piggybank with pig macros
Architecting the Future of Big Data
Page 10
© Hortonworks Inc. 2011
11. Writing data flow program
• Writing a complex data pipeline is an iterative process
Load Load
Transform Join
Group Transform Filter
Architecting the Future of Big Data
Page 11
© Hortonworks Inc. 2011
12. Writing data flow program
Load Load
Transform Join
Group Transform Filter
No output! L
Architecting the Future of Big Data
Page 12
© Hortonworks Inc. 2011
13. Writing data flow program
• Debug!
Load Load
Was
join
on
Transform Join wrong
a2ributes?
Bug
in
Group Transform Filter
transform?
Did
filter
drop
everything?
Architecting the Future of Big Data
Page 13
© Hortonworks Inc. 2011
14. Common approaches to debug
• Running on real (large) data
– Inefficient, takes longer
• Running on (small) samples
– Empty results on join, selective filters
Architecting the Future of Big Data
Page 14
© Hortonworks Inc. 2011
15. Pig illustrate command
• Objective- Show examples for i/o of each statement that
are
– Realistic
– Complete
– Concise
– Generated fast
• Steps
– Downstream – sample and process
– Prune
– Upstream – generate realistic missing classes of examples
– Prune
Architecting the Future of Big Data
Page 15
© Hortonworks Inc. 2011
17. Pig relation-as-scalar
• In pig each statement alias is a relation
– Relation is a set of records
• Task: Get list of pages whose load time was more
than average.
• Steps
1. Compute average load time
2. Get list of pages whose load time is > average
Architecting the Future of Big Data
Page 17
© Hortonworks Inc. 2011
18. Pig relation-as-scalar
• Step 1 is like
.. = load ..!
..= group ..!
al_rel = foreach .. AVG(ltime) as avg_ltime;!
• Step 2 looks like
page_views = load ‘pviews.txt’ as !
(url, ltime, ..);!
!
slow_views = filter page_views by !
ltime > avg_ltime!
Architecting the Future of Big Data
Page 18
© Hortonworks Inc. 2011
19. Pig relation-as-scalar
• Getting results of step 1 (average_gpa)
– Join result of step 1 with students relation, or
– Write result into file, then use udf to read from file
• Pig scalar feature now simplifies this-
slow_views = filter page_views by !
ltime > al_rel.avg_ltime!
– Runtime exception if al_rel has more than one record.
Architecting the Future of Big Data
Page 19
© Hortonworks Inc. 2011
20. UDF in Scripting Language
• Benefit
– Use legacy code
– Use library in scripting language
– Leverage Hadoop for non-Java programmer
• Currently supported language
– Python
– JavaScript
– Ruby
• Extensible Interface
– Minimum effort to support another language
Architecting the Future of Big Data
Page 20
© Hortonworks Inc. 2011
21. Writing a Jython UDF
Write a Jython UDF • Invoke Jython UDF when
needed
@outputSchema("word:chararray") • Type conversion
def concat(word): – Simple type
return word + word – Python Array <-> Pig Bag
– Python Dict <-> Pig Map
– Pyton Tuple <-> Pig Tuple
@outputSchemaFunction("squareSchema") • Convey schema to Pig
def square(num): – outputSchema
– outputSchemaFunction
if num == None:
return None register 'util.py' using jython as util;
return ((num)*(num))
B = foreach A generate util.square
def squareSchema(input): (i));
return input
Architecting the Future of Big Data
Page 21
© Hortonworks Inc. 2011
22. Use NLTK in Pig
• Example
register ’nltk_util.py' using jython as nltk;
……
B = foreach A generate nltk.tokenize(sentence)
nltk_util.py
import nltk
porter = nltk.PorterStemmer()
@outputSchema("words:{(word:chararray)}")
def tokenize(sentence):
tokens = nltk.word_tokenize(sentence)
words = [porter.stem(t) for t in tokens]
return words
Architecting the Future of Big Data
Page 22
© Hortonworks Inc. 2011
23. Writing a Script Engine
Writing a bridge UDF
class JythonFunction extends EvalFunc<Object> {
public Object exec(Tuple tuple) {
PyObject[] params = JythonUtils.pigTupleToPyTuple(tuple).getArray();
PyObject result = function.__call__(params);
return JythonUtils.pythonToPig(result);
}
public Schema outputSchema(Schema input) {
PyObject outputSchemaDef = f.__findattr__("outputSchema".intern());
return Utils.getSchemaFromString(outputSchemaDef.toString());
}
}
Architecting the Future of Big Data
Page 23
© Hortonworks Inc. 2011
24. Writing a Script Engine
Register scripting UDF
register 'util.py' using jython as util;
What happens in Pig
class JythonScriptEngine extends ScriptEngine {
public void registerFunctions(String path, String namespace, PigContext
pigContext) {
PythonInterpreter pi = Interpreter.interpreter;
pi.execfile(path);
for (PyTuple item : pi.getLocals().items())
funcspec = new FuncSpec(JythonFunction.class.getCanonicalName() + "('"
+ path + "','" + item. get(0)+"')");
pigContext.registerFunction(namespace + key, funcspec);
}
}
Architecting the Future of Big Data
Page 24
© Hortonworks Inc. 2011
25. Algebraic UDF in JRuby
class Count < AlgebraicPigUdf
output_schema Schema.long
def initial t
t.nil? ? 0 : 1
end
def intermed t
return 0 if t.nil?
t.flatten.inject(:+)
end
def final t
intermed(t)
end
end
Architecting the Future of Big Data
Page 25
© Hortonworks Inc. 2011
26. Pig Embedding
• Embed Pig inside scripting language
– Python
– JavaScript
• Algorithms which cannot complete using one Pig script
– Iterative algorithm
PageRank, Kmeans, Neural Network, Apriori, etc
– Parallel execution
Random forrest
– Divide and Conquer
– Branching
Architecting the Future of Big Data
Page 26
© Hortonworks Inc. 2011
27. Pig Embedding
from org.apache.pig.scripting import Pig
Compile
Pig
input= ":INPATH:/singlefile/studenttab10k”
Script
P = Pig.compile("""A = load '$in' as (name, age, gpa); store A into ’output';""")
Bind
Variables
Q = P.bind({'in':input})
result = Q.runSingle() Launch
Pig
Script
if result.isSuccessful():
print "Pig job PASSED”
else:
raise "Pig job FAILED"
Architecting the Future of Big Data
Page 27
© Hortonworks Inc. 2011
28. Pig Embedding
• Running embeded Pig script
pig sample.py
• What happen within Pig?
Pig
Script
Python Python
Script Script
sample.py Pig Jython Pig
Architecting the Future of Big Data
Page 28
© Hortonworks Inc. 2011
29. Nested Operator
• Nested Operator: Operator inside foreach
B = group A by name;
C = foreach B {
C0 = limit A 10;
generate C0;
}
• Prior Pig 0.10, supported nested operator
– DISTINCT, FILTER, LIMIT, and ORDER BY
• New operators added in 0.10
– CROSS, FOREACH
Architecting the Future of Big Data
Page 29
© Hortonworks Inc. 2011
30. Nested Cross/Foreach
A = LOAD ’studenttab10k' as (name:chararray, age:int, gpa:double);
B = LOAD ’votertab10k' as (name:chararray, age:int, registration,
contributions:double);
C = cogroup A by name, B by name;
D = foreach C {
C1 = filter A by gpa > 4;
C2 = filter B by contributions > 500;
C3 = cross C1, C2;
C4 = foreach C3 generate CONCAT(CONCAT((chararray)gpa, '_'), (chararray)
contributions);
generate flatten(C4);
}
store D into ’output'
Architecting the Future of Big Data
Page 30
© Hortonworks Inc. 2011
32. New operators to come
• Will be available in Pig 0.11
– RANK
– A distributed RANK implementation for Pig
– CUBE
Architecting the Future of Big Data
Page 32
© Hortonworks Inc. 2011