SlideShare une entreprise Scribd logo
1  sur  22
Télécharger pour lire hors ligne
Apache Pig UDFs
Extending Pig to solve complex tasks

   UDF = User Defined Functions
Your speaker today:
          Christoph Bauer

          java developer 10+ years

          one of the founders

          Helping our clients to use and
          understand their (big) data

          working in "BigData" since 2010
Why use PIG
● ad-hoc way for creating and executing
  map/reduce jobs
● simple, high-level language
● more natural for analysts than map/reduce
Done.




        http://leesfishandphotos.blogspot.de
Oh, wait...
UDFs to the rescue
Writing user defined functions (UDF)
+ easy to use
+ easy to code
+ keep the power of PIG
+ you can write them in java, python, ...
Do whatever you want
● image feature extraction
● geo computations
● data cleaning
● retrieve web pages
● natural language processing
  ...
● much more...
User Defined Functions
● EvalFunc<T>
  public <T> exec(Tuple input)
● FilterFunc
  public Boolean exec(Tuple input)
● Aggregate Functions
  public interface Algebraic{
      public String getInitial();
      public String getIntermed();
      public String getFinal();
  }
● Load/Store Functions
  public Tuple getNext()
  public void putNext(Tuple input);
What? Why?
companyName          companyAdress                  Net Worth
                       companyAdress                 Net Worth
                         companyAddress                 Net Worth
                                                         Net Worth
                                                            Net Worth
                                                             Net Worth
                                                                Net Worth
                                                                 Net Worth
                                                                    Net Worth




2010 | companyName | current Address | historical Net Worth


2011 | companyName | current Address | historical Net Worth


2012 | companyName | current Address | historical Net Worth
Example
r1, { q1:[(t1, "v1") , (t4, "v2")],
      q2:[(t2, "v3"),(t7, "v4")] }
...apply UDF
r1, t1, q1:"v1", q2:"v4"
r1, t3, q1:"v1", q2:"v4"
r1, t5, q1:"v2", q2:"v4"


SNAPSHOTS(q1, t1 <= t < t6, 2), LATEST (q2)
LATEST
public class LATEST extends EvalFunc<Tuple> {

    public Tuple exec(Tuple input) throws IOException {

    }
}
LATEST (contd.)
public Tuple exec(Tuple input) throws IOException {
    int numTuples = input.size();
    Tuple result = tupleFactory.newTuple(numTuples);
    for (int i = 0; i < numTuples; i++) {
        switch (input.getType(i)) {
        case DataType.BAG:
            DataBag bag = (DataBag) input.get(i);
            Object val = extractLatestValueFromBag(bag);
            if (val != null) {
                result.set(i, val);
            }
            break;
        case DataType.MAP:
            // ... MAPs need different handling
        default:
            // warn ...
        }                   r1, { q1:[(t1, "v1") , (t4, "v2")],
    }                             q2:[(t2, "v3"),(t7, "v4")] }
    return result;
}
SNAPSHOT
public class SNAPSHOTS extends EvalFunc<DataBag> {
    @Override
    public DataBag exec(Tuple input) throws IOException {
        List<Tuple> listOfTuples = new ArrayList<Tuple>();

       DateTime dtCur = new DateTime(start);
       DateTime dtEnd = new DateTime(end).plus(1L);
       while (dtCur.isBefore(dtEnd)) {
           listOfTuples.add(snapshot(input, dtCur));

           dtCur = dtCur.plus(period);
       }
       DataBag bag = factory.newDefaultBag(listOfTuples);
       return bag;
   }
SNAPSHOT (contd.)
protected Tuple snapshot(Tuple input, long ts) throws... {
    int numTuples = input.size();
    Tuple result = tupleFactory.newTuple(numTuples + 1);
    result.set(0, ts);

    for (int i = 0; i < numTuples; i++) {
        switch (input.getType(i)) {
        case DataType.BAG:
            DataBag bag = (DataBag) input.get(i);
            Object val = extractTSValueFromBag(bag, ts);
            result.set(i + 1, val);
            break;
        case DataType.MAP:
            // handle MAPs
        default:
        }                   r1, { q1:[(t1, "v1") , (t4, "v2")],
    }                             q2:[(t2, "v3"),(t7, "v4")] }
    return result;
}
PigLatin
                   r1, { q1:[(t1, "v1") , (t4, "v2")],
                         q2:[(t2, "v3"),(t7, "v4")] }


REGISTER 'my-udf.jar'
DEFINE LATEST myudf.Latest();
DEFINE SNAPSHOT myudf.Snapshot
              ('2000-01-01 2013-01-01 1y');
A = LOAD 'inputTable' AS (id, q1, q2);
B = FOREACH A GENERATE id,
    SNAPSHOT(q1) AS SN, LATEST(q2) as CUR;
C = FOREACH B GENERATE id,
    FLATTEN(SN), FLATTEN(CUR);
STORE C INTO 'output.csv';
Passing parameters to UDFs
DEFINE SNAPSHOT cool.udf.Snapshot
                 ('2000-01-01 2013-01-01 1y');
...
public SNAPSHOTS
(String start, String end, String increment)
{
    super();
    this.start = Long.parseLong(start);
    this.end = Long.parseLong(end);
    this.increment = parseLong(increment);
}
I didn't talk about
● UDFs run as a single instance in every
  mapper, reducer, ... use instance variables
  for locally shared objects
● Watch your heap when using Lucene
  Indexes, or implementing the Algebraic
  interface
● do implement
  public Schema outputSchema(Schema input)
● report progress when doing time consuming
  stuff
● Performance?
SNAPSHOT (contd.)
@Override
public Schema outputSchema(Schema input) {
    List out = new ArrayList<Schema.FieldSchema>();
    out.add(new FieldSchema("snapshot", DataType.LONG));

    for (FieldSchema fieldSchema : input.getFields()) {
        String alias = fieldSchema.alias;
        byte type = fieldSchema.type;
        out.add(new FieldSchema(alias, type));
    }
    Schema bagSchema = new Schema(out);
    try {
        return new Schema(new FieldSchema( getSchemaName(
            "snapshots", input), bagSchema, DataType.
BAG));
    } catch (FrontendException e) {
    }
    return null;
}
Reality check
● These UDFs are in production,
● Producing reports with up to 60GB
● Data is stored in HBase
Wrapping it up
We at Oberbaum Concept developed a bunch
of PIG Functions handling versioned data in
HBase.
● Rewrote HBaseStorage
● UDFs for Snapshots, Latest

Right now we are trying to push our changes
back into PIG.
Questions?
Thank you!
                Christoph Bauer




christoph.bauer@oberbaum-concept.com
https://www.xing.com/profile/Christoph_Bauer62

Contenu connexe

Tendances

响应式编程及框架
响应式编程及框架响应式编程及框架
响应式编程及框架jeffz
 
If You Think You Can Stay Away from Functional Programming, You Are Wrong
If You Think You Can Stay Away from Functional Programming, You Are WrongIf You Think You Can Stay Away from Functional Programming, You Are Wrong
If You Think You Can Stay Away from Functional Programming, You Are WrongMario Fusco
 
Native interfaces for R
Native interfaces for RNative interfaces for R
Native interfaces for RSeth Falcon
 
The Evolution of Async-Programming on .NET Platform (TUP, Full)
The Evolution of Async-Programming on .NET Platform (TUP, Full)The Evolution of Async-Programming on .NET Platform (TUP, Full)
The Evolution of Async-Programming on .NET Platform (TUP, Full)jeffz
 
Java8 stream
Java8 streamJava8 stream
Java8 streamkoji lin
 
자바 8 스트림 API
자바 8 스트림 API자바 8 스트림 API
자바 8 스트림 APINAVER Corp
 
Jscex: Write Sexy JavaScript
Jscex: Write Sexy JavaScriptJscex: Write Sexy JavaScript
Jscex: Write Sexy JavaScriptjeffz
 
Phil Bartie QGIS PLPython
Phil Bartie QGIS PLPythonPhil Bartie QGIS PLPython
Phil Bartie QGIS PLPythonRoss McDonald
 
The Evolution of Async-Programming (SD 2.0, JavaScript)
The Evolution of Async-Programming (SD 2.0, JavaScript)The Evolution of Async-Programming (SD 2.0, JavaScript)
The Evolution of Async-Programming (SD 2.0, JavaScript)jeffz
 
OOP and FP - Become a Better Programmer
OOP and FP - Become a Better ProgrammerOOP and FP - Become a Better Programmer
OOP and FP - Become a Better ProgrammerMario Fusco
 
Introduction to functional programming using Ocaml
Introduction to functional programming using OcamlIntroduction to functional programming using Ocaml
Introduction to functional programming using Ocamlpramode_ce
 
Laziness, trampolines, monoids and other functional amenities: this is not yo...
Laziness, trampolines, monoids and other functional amenities: this is not yo...Laziness, trampolines, monoids and other functional amenities: this is not yo...
Laziness, trampolines, monoids and other functional amenities: this is not yo...Mario Fusco
 
«iPython & Jupyter: 4 fun & profit», Лев Тонких, Rambler&Co
«iPython & Jupyter: 4 fun & profit», Лев Тонких, Rambler&Co«iPython & Jupyter: 4 fun & profit», Лев Тонких, Rambler&Co
«iPython & Jupyter: 4 fun & profit», Лев Тонких, Rambler&CoMail.ru Group
 
Jscex: Write Sexy JavaScript (中文)
Jscex: Write Sexy JavaScript (中文)Jscex: Write Sexy JavaScript (中文)
Jscex: Write Sexy JavaScript (中文)jeffz
 
Euro python2011 High Performance Python
Euro python2011 High Performance PythonEuro python2011 High Performance Python
Euro python2011 High Performance PythonIan Ozsvald
 
The Evolution of Async-Programming on .NET Platform (.Net China, C#)
The Evolution of Async-Programming on .NET Platform (.Net China, C#)The Evolution of Async-Programming on .NET Platform (.Net China, C#)
The Evolution of Async-Programming on .NET Platform (.Net China, C#)jeffz
 
OO JS for AS3 Devs
OO JS for AS3 DevsOO JS for AS3 Devs
OO JS for AS3 DevsJason Hanson
 

Tendances (20)

Java 8 Workshop
Java 8 WorkshopJava 8 Workshop
Java 8 Workshop
 
响应式编程及框架
响应式编程及框架响应式编程及框架
响应式编程及框架
 
If You Think You Can Stay Away from Functional Programming, You Are Wrong
If You Think You Can Stay Away from Functional Programming, You Are WrongIf You Think You Can Stay Away from Functional Programming, You Are Wrong
If You Think You Can Stay Away from Functional Programming, You Are Wrong
 
Native interfaces for R
Native interfaces for RNative interfaces for R
Native interfaces for R
 
The Evolution of Async-Programming on .NET Platform (TUP, Full)
The Evolution of Async-Programming on .NET Platform (TUP, Full)The Evolution of Async-Programming on .NET Platform (TUP, Full)
The Evolution of Async-Programming on .NET Platform (TUP, Full)
 
Fun with Kotlin
Fun with KotlinFun with Kotlin
Fun with Kotlin
 
MTL Versus Free
MTL Versus FreeMTL Versus Free
MTL Versus Free
 
Java8 stream
Java8 streamJava8 stream
Java8 stream
 
자바 8 스트림 API
자바 8 스트림 API자바 8 스트림 API
자바 8 스트림 API
 
Jscex: Write Sexy JavaScript
Jscex: Write Sexy JavaScriptJscex: Write Sexy JavaScript
Jscex: Write Sexy JavaScript
 
Phil Bartie QGIS PLPython
Phil Bartie QGIS PLPythonPhil Bartie QGIS PLPython
Phil Bartie QGIS PLPython
 
The Evolution of Async-Programming (SD 2.0, JavaScript)
The Evolution of Async-Programming (SD 2.0, JavaScript)The Evolution of Async-Programming (SD 2.0, JavaScript)
The Evolution of Async-Programming (SD 2.0, JavaScript)
 
OOP and FP - Become a Better Programmer
OOP and FP - Become a Better ProgrammerOOP and FP - Become a Better Programmer
OOP and FP - Become a Better Programmer
 
Introduction to functional programming using Ocaml
Introduction to functional programming using OcamlIntroduction to functional programming using Ocaml
Introduction to functional programming using Ocaml
 
Laziness, trampolines, monoids and other functional amenities: this is not yo...
Laziness, trampolines, monoids and other functional amenities: this is not yo...Laziness, trampolines, monoids and other functional amenities: this is not yo...
Laziness, trampolines, monoids and other functional amenities: this is not yo...
 
«iPython & Jupyter: 4 fun & profit», Лев Тонких, Rambler&Co
«iPython & Jupyter: 4 fun & profit», Лев Тонких, Rambler&Co«iPython & Jupyter: 4 fun & profit», Лев Тонких, Rambler&Co
«iPython & Jupyter: 4 fun & profit», Лев Тонких, Rambler&Co
 
Jscex: Write Sexy JavaScript (中文)
Jscex: Write Sexy JavaScript (中文)Jscex: Write Sexy JavaScript (中文)
Jscex: Write Sexy JavaScript (中文)
 
Euro python2011 High Performance Python
Euro python2011 High Performance PythonEuro python2011 High Performance Python
Euro python2011 High Performance Python
 
The Evolution of Async-Programming on .NET Platform (.Net China, C#)
The Evolution of Async-Programming on .NET Platform (.Net China, C#)The Evolution of Async-Programming on .NET Platform (.Net China, C#)
The Evolution of Async-Programming on .NET Platform (.Net China, C#)
 
OO JS for AS3 Devs
OO JS for AS3 DevsOO JS for AS3 Devs
OO JS for AS3 Devs
 

Similaire à Apache PIG - User Defined Functions

Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopSages
 
Empathic Programming - How to write comprehensible code
Empathic Programming - How to write comprehensible codeEmpathic Programming - How to write comprehensible code
Empathic Programming - How to write comprehensible codeMario Gleichmann
 
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseCodepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseSages
 
All I know about rsc.io/c2go
All I know about rsc.io/c2goAll I know about rsc.io/c2go
All I know about rsc.io/c2goMoriyoshi Koizumi
 
Kotlin: forse è la volta buona (Trento)
Kotlin: forse è la volta buona (Trento)Kotlin: forse è la volta buona (Trento)
Kotlin: forse è la volta buona (Trento)Davide Cerbo
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryDatabricks
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryDatabricks
 
Apache Commons - Don\'t re-invent the wheel
Apache Commons - Don\'t re-invent the wheelApache Commons - Don\'t re-invent the wheel
Apache Commons - Don\'t re-invent the wheeltcurdt
 
Effective C#
Effective C#Effective C#
Effective C#lantoli
 
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data EcosystemWprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data EcosystemSages
 
Are we ready to Go?
Are we ready to Go?Are we ready to Go?
Are we ready to Go?Adam Dudczak
 
Linq Sanjay Vyas
Linq   Sanjay VyasLinq   Sanjay Vyas
Linq Sanjay Vyasrsnarayanan
 
To Infinity & Beyond: Protocols & sequences in Node - Part 2
To Infinity & Beyond: Protocols & sequences in Node - Part 2To Infinity & Beyond: Protocols & sequences in Node - Part 2
To Infinity & Beyond: Protocols & sequences in Node - Part 2Bahul Neel Upadhyaya
 
Intravert Server side processing for Cassandra
Intravert Server side processing for CassandraIntravert Server side processing for Cassandra
Intravert Server side processing for CassandraEdward Capriolo
 
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"DataStax Academy
 
Paradigmas de Linguagens de Programacao - Aula #4
Paradigmas de Linguagens de Programacao - Aula #4Paradigmas de Linguagens de Programacao - Aula #4
Paradigmas de Linguagens de Programacao - Aula #4Ismar Silveira
 
Streams processing with Storm
Streams processing with StormStreams processing with Storm
Streams processing with StormMariusz Gil
 

Similaire à Apache PIG - User Defined Functions (20)

Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache Hadoop
 
Empathic Programming - How to write comprehensible code
Empathic Programming - How to write comprehensible codeEmpathic Programming - How to write comprehensible code
Empathic Programming - How to write comprehensible code
 
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseCodepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
 
All I know about rsc.io/c2go
All I know about rsc.io/c2goAll I know about rsc.io/c2go
All I know about rsc.io/c2go
 
Kotlin: forse è la volta buona (Trento)
Kotlin: forse è la volta buona (Trento)Kotlin: forse è la volta buona (Trento)
Kotlin: forse è la volta buona (Trento)
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
Apache Commons - Don\'t re-invent the wheel
Apache Commons - Don\'t re-invent the wheelApache Commons - Don\'t re-invent the wheel
Apache Commons - Don\'t re-invent the wheel
 
Writing Good Tests
Writing Good TestsWriting Good Tests
Writing Good Tests
 
Effective C#
Effective C#Effective C#
Effective C#
 
Qt Workshop
Qt WorkshopQt Workshop
Qt Workshop
 
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data EcosystemWprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
 
Are we ready to Go?
Are we ready to Go?Are we ready to Go?
Are we ready to Go?
 
Linq Sanjay Vyas
Linq   Sanjay VyasLinq   Sanjay Vyas
Linq Sanjay Vyas
 
To Infinity & Beyond: Protocols & sequences in Node - Part 2
To Infinity & Beyond: Protocols & sequences in Node - Part 2To Infinity & Beyond: Protocols & sequences in Node - Part 2
To Infinity & Beyond: Protocols & sequences in Node - Part 2
 
The STL
The STLThe STL
The STL
 
Intravert Server side processing for Cassandra
Intravert Server side processing for CassandraIntravert Server side processing for Cassandra
Intravert Server side processing for Cassandra
 
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"
 
Paradigmas de Linguagens de Programacao - Aula #4
Paradigmas de Linguagens de Programacao - Aula #4Paradigmas de Linguagens de Programacao - Aula #4
Paradigmas de Linguagens de Programacao - Aula #4
 
Streams processing with Storm
Streams processing with StormStreams processing with Storm
Streams processing with Storm
 

Dernier

APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 

Dernier (20)

APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 

Apache PIG - User Defined Functions

  • 1. Apache Pig UDFs Extending Pig to solve complex tasks UDF = User Defined Functions
  • 2. Your speaker today: Christoph Bauer java developer 10+ years one of the founders Helping our clients to use and understand their (big) data working in "BigData" since 2010
  • 3. Why use PIG ● ad-hoc way for creating and executing map/reduce jobs ● simple, high-level language ● more natural for analysts than map/reduce
  • 4. Done. http://leesfishandphotos.blogspot.de
  • 6. UDFs to the rescue Writing user defined functions (UDF) + easy to use + easy to code + keep the power of PIG + you can write them in java, python, ...
  • 7. Do whatever you want ● image feature extraction ● geo computations ● data cleaning ● retrieve web pages ● natural language processing ... ● much more...
  • 8. User Defined Functions ● EvalFunc<T> public <T> exec(Tuple input) ● FilterFunc public Boolean exec(Tuple input) ● Aggregate Functions public interface Algebraic{ public String getInitial(); public String getIntermed(); public String getFinal(); } ● Load/Store Functions public Tuple getNext() public void putNext(Tuple input);
  • 9. What? Why? companyName companyAdress Net Worth companyAdress Net Worth companyAddress Net Worth Net Worth Net Worth Net Worth Net Worth Net Worth Net Worth 2010 | companyName | current Address | historical Net Worth 2011 | companyName | current Address | historical Net Worth 2012 | companyName | current Address | historical Net Worth
  • 10. Example r1, { q1:[(t1, "v1") , (t4, "v2")], q2:[(t2, "v3"),(t7, "v4")] } ...apply UDF r1, t1, q1:"v1", q2:"v4" r1, t3, q1:"v1", q2:"v4" r1, t5, q1:"v2", q2:"v4" SNAPSHOTS(q1, t1 <= t < t6, 2), LATEST (q2)
  • 11. LATEST public class LATEST extends EvalFunc<Tuple> { public Tuple exec(Tuple input) throws IOException { } }
  • 12. LATEST (contd.) public Tuple exec(Tuple input) throws IOException { int numTuples = input.size(); Tuple result = tupleFactory.newTuple(numTuples); for (int i = 0; i < numTuples; i++) { switch (input.getType(i)) { case DataType.BAG: DataBag bag = (DataBag) input.get(i); Object val = extractLatestValueFromBag(bag); if (val != null) { result.set(i, val); } break; case DataType.MAP: // ... MAPs need different handling default: // warn ... } r1, { q1:[(t1, "v1") , (t4, "v2")], } q2:[(t2, "v3"),(t7, "v4")] } return result; }
  • 13. SNAPSHOT public class SNAPSHOTS extends EvalFunc<DataBag> { @Override public DataBag exec(Tuple input) throws IOException { List<Tuple> listOfTuples = new ArrayList<Tuple>(); DateTime dtCur = new DateTime(start); DateTime dtEnd = new DateTime(end).plus(1L); while (dtCur.isBefore(dtEnd)) { listOfTuples.add(snapshot(input, dtCur)); dtCur = dtCur.plus(period); } DataBag bag = factory.newDefaultBag(listOfTuples); return bag; }
  • 14. SNAPSHOT (contd.) protected Tuple snapshot(Tuple input, long ts) throws... { int numTuples = input.size(); Tuple result = tupleFactory.newTuple(numTuples + 1); result.set(0, ts); for (int i = 0; i < numTuples; i++) { switch (input.getType(i)) { case DataType.BAG: DataBag bag = (DataBag) input.get(i); Object val = extractTSValueFromBag(bag, ts); result.set(i + 1, val); break; case DataType.MAP: // handle MAPs default: } r1, { q1:[(t1, "v1") , (t4, "v2")], } q2:[(t2, "v3"),(t7, "v4")] } return result; }
  • 15. PigLatin r1, { q1:[(t1, "v1") , (t4, "v2")], q2:[(t2, "v3"),(t7, "v4")] } REGISTER 'my-udf.jar' DEFINE LATEST myudf.Latest(); DEFINE SNAPSHOT myudf.Snapshot ('2000-01-01 2013-01-01 1y'); A = LOAD 'inputTable' AS (id, q1, q2); B = FOREACH A GENERATE id, SNAPSHOT(q1) AS SN, LATEST(q2) as CUR; C = FOREACH B GENERATE id, FLATTEN(SN), FLATTEN(CUR); STORE C INTO 'output.csv';
  • 16. Passing parameters to UDFs DEFINE SNAPSHOT cool.udf.Snapshot ('2000-01-01 2013-01-01 1y'); ... public SNAPSHOTS (String start, String end, String increment) { super(); this.start = Long.parseLong(start); this.end = Long.parseLong(end); this.increment = parseLong(increment); }
  • 17. I didn't talk about ● UDFs run as a single instance in every mapper, reducer, ... use instance variables for locally shared objects ● Watch your heap when using Lucene Indexes, or implementing the Algebraic interface ● do implement public Schema outputSchema(Schema input) ● report progress when doing time consuming stuff ● Performance?
  • 18. SNAPSHOT (contd.) @Override public Schema outputSchema(Schema input) { List out = new ArrayList<Schema.FieldSchema>(); out.add(new FieldSchema("snapshot", DataType.LONG)); for (FieldSchema fieldSchema : input.getFields()) { String alias = fieldSchema.alias; byte type = fieldSchema.type; out.add(new FieldSchema(alias, type)); } Schema bagSchema = new Schema(out); try { return new Schema(new FieldSchema( getSchemaName( "snapshots", input), bagSchema, DataType. BAG)); } catch (FrontendException e) { } return null; }
  • 19. Reality check ● These UDFs are in production, ● Producing reports with up to 60GB ● Data is stored in HBase
  • 20. Wrapping it up We at Oberbaum Concept developed a bunch of PIG Functions handling versioned data in HBase. ● Rewrote HBaseStorage ● UDFs for Snapshots, Latest Right now we are trying to push our changes back into PIG.
  • 22. Thank you! Christoph Bauer christoph.bauer@oberbaum-concept.com https://www.xing.com/profile/Christoph_Bauer62