Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Apache Pig UDFsExtending Pig to solve complex tasks   UDF = User Defined Functions
Your speaker today:          Christoph Bauer          java developer 10+ years          one of the founders          Helpi...
Why use PIG● ad-hoc way for creating and executing  map/reduce jobs● simple, high-level language● more natural for analyst...
Done.        http://leesfishandphotos.blogspot.de
Oh, wait...
UDFs to the rescueWriting user defined functions (UDF)+ easy to use+ easy to code+ keep the power of PIG+ you can write th...
Do whatever you want● image feature extraction● geo computations● data cleaning● retrieve web pages● natural language proc...
User Defined Functions● EvalFunc<T>  public <T> exec(Tuple input)● FilterFunc  public Boolean exec(Tuple input)● Aggregate...
What? Why?companyName          companyAdress                  Net Worth                       companyAdress               ...
Exampler1, { q1:[(t1, "v1") , (t4, "v2")],      q2:[(t2, "v3"),(t7, "v4")] }...apply UDFr1, t1, q1:"v1", q2:"v4"r1, t3, q1...
LATESTpublic class LATEST extends EvalFunc<Tuple> {    public Tuple exec(Tuple input) throws IOException {    }}
LATEST (contd.)public Tuple exec(Tuple input) throws IOException {    int numTuples = input.size();    Tuple result = tupl...
SNAPSHOTpublic class SNAPSHOTS extends EvalFunc<DataBag> {    @Override    public DataBag exec(Tuple input) throws IOExcep...
SNAPSHOT (contd.)protected Tuple snapshot(Tuple input, long ts) throws... {    int numTuples = input.size();    Tuple resu...
PigLatin                   r1, { q1:[(t1, "v1") , (t4, "v2")],                         q2:[(t2, "v3"),(t7, "v4")] }REGISTE...
Passing parameters to UDFsDEFINE SNAPSHOT cool.udf.Snapshot                 (2000-01-01 2013-01-01 1y);...public SNAPSHOTS...
I didnt talk about● UDFs run as a single instance in every  mapper, reducer, ... use instance variables  for locally share...
SNAPSHOT (contd.)@Overridepublic Schema outputSchema(Schema input) {    List out = new ArrayList<Schema.FieldSchema>();   ...
Reality check● These UDFs are in production,● Producing reports with up to 60GB● Data is stored in HBase
Wrapping it upWe at Oberbaum Concept developed a bunchof PIG Functions handling versioned data inHBase.● Rewrote HBaseStor...
Questions?
Thank you!                Christoph Bauerchristoph.bauer@oberbaum-concept.comhttps://www.xing.com/profile/Christoph_Bauer62
Prochain SlideShare
Chargement dans…5
×

Apache PIG - User Defined Functions

12 552 vues

Publié le

Extending Pig to solve complex tasks

Publié dans : Formation
  • bigdatacoder.com
    http://bigdatacoder.com/wiki/index.php?title=Main_Page
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici

Apache PIG - User Defined Functions

  1. 1. Apache Pig UDFsExtending Pig to solve complex tasks UDF = User Defined Functions
  2. 2. Your speaker today: Christoph Bauer java developer 10+ years one of the founders Helping our clients to use and understand their (big) data working in "BigData" since 2010
  3. 3. Why use PIG● ad-hoc way for creating and executing map/reduce jobs● simple, high-level language● more natural for analysts than map/reduce
  4. 4. Done. http://leesfishandphotos.blogspot.de
  5. 5. Oh, wait...
  6. 6. UDFs to the rescueWriting user defined functions (UDF)+ easy to use+ easy to code+ keep the power of PIG+ you can write them in java, python, ...
  7. 7. Do whatever you want● image feature extraction● geo computations● data cleaning● retrieve web pages● natural language processing ...● much more...
  8. 8. User Defined Functions● EvalFunc<T> public <T> exec(Tuple input)● FilterFunc public Boolean exec(Tuple input)● Aggregate Functions public interface Algebraic{ public String getInitial(); public String getIntermed(); public String getFinal(); }● Load/Store Functions public Tuple getNext() public void putNext(Tuple input);
  9. 9. What? Why?companyName companyAdress Net Worth companyAdress Net Worth companyAddress Net Worth Net Worth Net Worth Net Worth Net Worth Net Worth Net Worth2010 | companyName | current Address | historical Net Worth2011 | companyName | current Address | historical Net Worth2012 | companyName | current Address | historical Net Worth
  10. 10. Exampler1, { q1:[(t1, "v1") , (t4, "v2")], q2:[(t2, "v3"),(t7, "v4")] }...apply UDFr1, t1, q1:"v1", q2:"v4"r1, t3, q1:"v1", q2:"v4"r1, t5, q1:"v2", q2:"v4"SNAPSHOTS(q1, t1 <= t < t6, 2), LATEST (q2)
  11. 11. LATESTpublic class LATEST extends EvalFunc<Tuple> { public Tuple exec(Tuple input) throws IOException { }}
  12. 12. LATEST (contd.)public Tuple exec(Tuple input) throws IOException { int numTuples = input.size(); Tuple result = tupleFactory.newTuple(numTuples); for (int i = 0; i < numTuples; i++) { switch (input.getType(i)) { case DataType.BAG: DataBag bag = (DataBag) input.get(i); Object val = extractLatestValueFromBag(bag); if (val != null) { result.set(i, val); } break; case DataType.MAP: // ... MAPs need different handling default: // warn ... } r1, { q1:[(t1, "v1") , (t4, "v2")], } q2:[(t2, "v3"),(t7, "v4")] } return result;}
  13. 13. SNAPSHOTpublic class SNAPSHOTS extends EvalFunc<DataBag> { @Override public DataBag exec(Tuple input) throws IOException { List<Tuple> listOfTuples = new ArrayList<Tuple>(); DateTime dtCur = new DateTime(start); DateTime dtEnd = new DateTime(end).plus(1L); while (dtCur.isBefore(dtEnd)) { listOfTuples.add(snapshot(input, dtCur)); dtCur = dtCur.plus(period); } DataBag bag = factory.newDefaultBag(listOfTuples); return bag; }
  14. 14. SNAPSHOT (contd.)protected Tuple snapshot(Tuple input, long ts) throws... { int numTuples = input.size(); Tuple result = tupleFactory.newTuple(numTuples + 1); result.set(0, ts); for (int i = 0; i < numTuples; i++) { switch (input.getType(i)) { case DataType.BAG: DataBag bag = (DataBag) input.get(i); Object val = extractTSValueFromBag(bag, ts); result.set(i + 1, val); break; case DataType.MAP: // handle MAPs default: } r1, { q1:[(t1, "v1") , (t4, "v2")], } q2:[(t2, "v3"),(t7, "v4")] } return result;}
  15. 15. PigLatin r1, { q1:[(t1, "v1") , (t4, "v2")], q2:[(t2, "v3"),(t7, "v4")] }REGISTER my-udf.jarDEFINE LATEST myudf.Latest();DEFINE SNAPSHOT myudf.Snapshot (2000-01-01 2013-01-01 1y);A = LOAD inputTable AS (id, q1, q2);B = FOREACH A GENERATE id, SNAPSHOT(q1) AS SN, LATEST(q2) as CUR;C = FOREACH B GENERATE id, FLATTEN(SN), FLATTEN(CUR);STORE C INTO output.csv;
  16. 16. Passing parameters to UDFsDEFINE SNAPSHOT cool.udf.Snapshot (2000-01-01 2013-01-01 1y);...public SNAPSHOTS(String start, String end, String increment){ super(); this.start = Long.parseLong(start); this.end = Long.parseLong(end); this.increment = parseLong(increment);}
  17. 17. I didnt talk about● UDFs run as a single instance in every mapper, reducer, ... use instance variables for locally shared objects● Watch your heap when using Lucene Indexes, or implementing the Algebraic interface● do implement public Schema outputSchema(Schema input)● report progress when doing time consuming stuff● Performance?
  18. 18. SNAPSHOT (contd.)@Overridepublic Schema outputSchema(Schema input) { List out = new ArrayList<Schema.FieldSchema>(); out.add(new FieldSchema("snapshot", DataType.LONG)); for (FieldSchema fieldSchema : input.getFields()) { String alias = fieldSchema.alias; byte type = fieldSchema.type; out.add(new FieldSchema(alias, type)); } Schema bagSchema = new Schema(out); try { return new Schema(new FieldSchema( getSchemaName( "snapshots", input), bagSchema, DataType.BAG)); } catch (FrontendException e) { } return null;}
  19. 19. Reality check● These UDFs are in production,● Producing reports with up to 60GB● Data is stored in HBase
  20. 20. Wrapping it upWe at Oberbaum Concept developed a bunchof PIG Functions handling versioned data inHBase.● Rewrote HBaseStorage● UDFs for Snapshots, LatestRight now we are trying to push our changesback into PIG.
  21. 21. Questions?
  22. 22. Thank you! Christoph Bauerchristoph.bauer@oberbaum-concept.comhttps://www.xing.com/profile/Christoph_Bauer62

×