2. Your speaker today:
Christoph Bauer
java developer 10+ years
one of the founders
Helping our clients to use and
understand their (big) data
working in "BigData" since 2010
3. Why use PIG
● ad-hoc way for creating and executing
map/reduce jobs
● simple, high-level language
● more natural for analysts than map/reduce
6. UDFs to the rescue
Writing user defined functions (UDF)
+ easy to use
+ easy to code
+ keep the power of PIG
+ you can write them in java, python, ...
7. Do whatever you want
● image feature extraction
● geo computations
● data cleaning
● retrieve web pages
● natural language processing
...
● much more...
8. User Defined Functions
● EvalFunc<T>
public <T> exec(Tuple input)
● FilterFunc
public Boolean exec(Tuple input)
● Aggregate Functions
public interface Algebraic{
public String getInitial();
public String getIntermed();
public String getFinal();
}
● Load/Store Functions
public Tuple getNext()
public void putNext(Tuple input);
9. What? Why?
companyName companyAdress Net Worth
companyAdress Net Worth
companyAddress Net Worth
Net Worth
Net Worth
Net Worth
Net Worth
Net Worth
Net Worth
2010 | companyName | current Address | historical Net Worth
2011 | companyName | current Address | historical Net Worth
2012 | companyName | current Address | historical Net Worth
17. I didn't talk about
● UDFs run as a single instance in every
mapper, reducer, ... use instance variables
for locally shared objects
● Watch your heap when using Lucene
Indexes, or implementing the Algebraic
interface
● do implement
public Schema outputSchema(Schema input)
● report progress when doing time consuming
stuff
● Performance?
18. SNAPSHOT (contd.)
@Override
public Schema outputSchema(Schema input) {
List out = new ArrayList<Schema.FieldSchema>();
out.add(new FieldSchema("snapshot", DataType.LONG));
for (FieldSchema fieldSchema : input.getFields()) {
String alias = fieldSchema.alias;
byte type = fieldSchema.type;
out.add(new FieldSchema(alias, type));
}
Schema bagSchema = new Schema(out);
try {
return new Schema(new FieldSchema( getSchemaName(
"snapshots", input), bagSchema, DataType.
BAG));
} catch (FrontendException e) {
}
return null;
}
19. Reality check
● These UDFs are in production,
● Producing reports with up to 60GB
● Data is stored in HBase
20. Wrapping it up
We at Oberbaum Concept developed a bunch
of PIG Functions handling versioned data in
HBase.
● Rewrote HBaseStorage
● UDFs for Snapshots, Latest
Right now we are trying to push our changes
back into PIG.