Apache PIG - User Defined Functions

Apache Pig UDFs
Extending Pig to solve complex tasks

UDF = User Defined Functions

Your speaker today:
Christoph Bauer

java developer 10+ years

one of the founders

Helping our clients to use and
understand their (big) data

working in "BigData" since 2010

Why use PIG
● ad-hoc way for creating and executing
map/reduce jobs
● simple, high-level language
● more natural for analysts than map/reduce

Done.

http://leesfishandphotos.blogspot.de

UDFs to the rescue
Writing user defined functions (UDF)
+ easy to use
+ easy to code
+ keep the power of PIG
+ you can write them in java, python, ...

Do whatever you want
● image feature extraction
● geo computations
● data cleaning
● retrieve web pages
● natural language processing
...
● much more...

User Defined Functions
● EvalFunc<T>
public <T> exec(Tuple input)
● FilterFunc
public Boolean exec(Tuple input)
● Aggregate Functions
public interface Algebraic{
public String getInitial();
public String getIntermed();
public String getFinal();
}
● Load/Store Functions
public Tuple getNext()
public void putNext(Tuple input);

What? Why?
companyName companyAdress Net Worth
companyAdress Net Worth
companyAddress Net Worth
Net Worth
Net Worth
Net Worth
Net Worth
Net Worth
Net Worth

2010 | companyName | current Address | historical Net Worth



Example
r1, { q1:[(t1, "v1") , (t4, "v2")],
q2:[(t2, "v3"),(t7, "v4")] }
...apply UDF
r1, t1, q1:"v1", q2:"v4"
r1, t3, q1:"v1", q2:"v4"
r1, t5, q1:"v2", q2:"v4"

SNAPSHOTS(q1, t1 <= t < t6, 2), LATEST (q2)

LATEST
public class LATEST extends EvalFunc<Tuple> {

public Tuple exec(Tuple input) throws IOException {

}
}

LATEST (contd.)
public Tuple exec(Tuple input) throws IOException {
int numTuples = input.size();
Tuple result = tupleFactory.newTuple(numTuples);
for (int i = 0; i < numTuples; i++) {
switch (input.getType(i)) {
case DataType.BAG:
DataBag bag = (DataBag) input.get(i);
Object val = extractLatestValueFromBag(bag);
if (val != null) {
result.set(i, val);
}
break;
case DataType.MAP:
// ... MAPs need different handling
default:
// warn ...
} r1, { q1:[(t1, "v1") , (t4, "v2")],
} q2:[(t2, "v3"),(t7, "v4")] }
return result;
}

SNAPSHOT
public class SNAPSHOTS extends EvalFunc<DataBag> {
@Override
public DataBag exec(Tuple input) throws IOException {
List<Tuple> listOfTuples = new ArrayList<Tuple>();

DateTime dtCur = new DateTime(start);
DateTime dtEnd = new DateTime(end).plus(1L);
while (dtCur.isBefore(dtEnd)) {
listOfTuples.add(snapshot(input, dtCur));

dtCur = dtCur.plus(period);
}
DataBag bag = factory.newDefaultBag(listOfTuples);
return bag;
}

SNAPSHOT (contd.)
protected Tuple snapshot(Tuple input, long ts) throws... {
int numTuples = input.size();
Tuple result = tupleFactory.newTuple(numTuples + 1);
result.set(0, ts);

for (int i = 0; i < numTuples; i++) {
switch (input.getType(i)) {
case DataType.BAG:
DataBag bag = (DataBag) input.get(i);
Object val = extractTSValueFromBag(bag, ts);
result.set(i + 1, val);
break;
case DataType.MAP:
// handle MAPs
default:
} r1, { q1:[(t1, "v1") , (t4, "v2")],
} q2:[(t2, "v3"),(t7, "v4")] }
return result;
}

PigLatin
r1, { q1:[(t1, "v1") , (t4, "v2")],
q2:[(t2, "v3"),(t7, "v4")] }

REGISTER 'my-udf.jar'
DEFINE LATEST myudf.Latest();
DEFINE SNAPSHOT myudf.Snapshot
('2000-01-01 2013-01-01 1y');
A = LOAD 'inputTable' AS (id, q1, q2);
B = FOREACH A GENERATE id,
SNAPSHOT(q1) AS SN, LATEST(q2) as CUR;
C = FOREACH B GENERATE id,
FLATTEN(SN), FLATTEN(CUR);
STORE C INTO 'output.csv';

Passing parameters to UDFs
DEFINE SNAPSHOT cool.udf.Snapshot
('2000-01-01 2013-01-01 1y');
...
public SNAPSHOTS
(String start, String end, String increment)
{
super();
this.start = Long.parseLong(start);
this.end = Long.parseLong(end);
this.increment = parseLong(increment);
}

I didn't talk about
● UDFs run as a single instance in every
mapper, reducer, ... use instance variables
for locally shared objects
● Watch your heap when using Lucene
Indexes, or implementing the Algebraic
interface
● do implement
public Schema outputSchema(Schema input)
● report progress when doing time consuming
stuff
● Performance?

SNAPSHOT (contd.)
@Override
public Schema outputSchema(Schema input) {
List out = new ArrayList<Schema.FieldSchema>();
out.add(new FieldSchema("snapshot", DataType.LONG));

for (FieldSchema fieldSchema : input.getFields()) {
String alias = fieldSchema.alias;
byte type = fieldSchema.type;
out.add(new FieldSchema(alias, type));
}
Schema bagSchema = new Schema(out);
try {
return new Schema(new FieldSchema( getSchemaName(
"snapshots", input), bagSchema, DataType.
BAG));
} catch (FrontendException e) {
}
return null;
}

Reality check
● These UDFs are in production,
● Producing reports with up to 60GB
● Data is stored in HBase

Wrapping it up
We at Oberbaum Concept developed a bunch
of PIG Functions handling versioned data in
HBase.
● Rewrote HBaseStorage
● UDFs for Snapshots, Latest

Right now we are trying to push our changes
back into PIG.

Thank you!
Christoph Bauer

christoph.bauer@oberbaum-concept.com
https://www.xing.com/profile/Christoph_Bauer62

Apache PIG - User Defined Functions

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Apache PIG - User Defined Functions

Similaire à Apache PIG - User Defined Functions (20)

Dernier

Dernier (20)

Apache PIG - User Defined Functions