SlideShare une entreprise Scribd logo
1  sur  32
Pig programming is more fun: New features in Pig



Daniel Dai (@daijy)
Thejas Nair (@thejasn)




© Hortonworks Inc. 2011                        Page 1
What is Apache Pig?
  Pig Latin, a high level                                                An engine that
  data processing                                                        executes Pig Latin
  language.                                                              locally or on a
                                                                         Hadoop cluster.




Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/

                  Architecting the Future of Big Data
                                                                                              Page 2
                  © Hortonworks Inc. 2011
Pig-latin example
• Query : Get the list of pages visited by users whose age is
  between 20 and 25 years.

users = load users as (name, age);

users_18_to_25 = filter users by age > 20 and age <= 25;

page_views = load pages as (user, url);

page_views_u18_to_25 = join users_18_to_25 by name,
page_views by user;

      Architecting the Future of Big Data
                                                          Page 3
      © Hortonworks Inc. 2011
Why pig ?
• Faster development
  –  Fewer lines of code
  –  Don’t re-invent the wheel

• Flexible
  –  Metadata is optional
  –  Extensible
  –  Procedural programming



         Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/

     Architecting the Future of Big Data
                                                                          Page 4
     © Hortonworks Inc. 2011
Before pig 0.9
   p1.pig                           p2.pig   p3.pig




     Architecting the Future of Big Data
                                                      Page 5
     © Hortonworks Inc. 2011
With pig macros
                                  p1.pig           p2.pig   p3.pig

macro1.pig                                                           macro2.pig




             Architecting the Future of Big Data
                                                                           Page 6
             © Hortonworks Inc. 2011
With pig macros
  p1.pig                                   p1.pig   rm_bots.pig




                                                    get_top.pig




     Architecting the Future of Big Data
                                                           Page 7
     © Hortonworks Inc. 2011
Pig macro example
• Page_views data : (user_name, url, timestamp, …)
• Find top 5 users by page views
• Find top 10 most visited pages.




      Architecting the Future of Big Data
                                                     Page 8
      © Hortonworks Inc. 2011
Pig Macro example
page_views = LOAD ..                           /* top x macro */
/* get top 5 users by page view */             DEFINE topCount (rel, col, topNum)
u_grp = GROUP .. by uname;                     RETURNS top_num_recs {
u_count = FOREACH .. COUNT ..                   grped = GROUP $rel by $col;
ord_u_count = ORDER u_count ..                  cnt_grp = FOREACH ..COUNT($rel)..
top_5_users = LIMIT ordered.. 5;                ord_cnt = ORDER .. by cnt;
DUMP top_5_users;                               $top_num_recs = LIMIT.. $topNum;
                                               }
/* get top 10 urls by page view */             -----------------------------------------
url_grp = GROUP .. by url;                     page_views = LOAD ..
url_count = FOREACH .. COUNT .                 /* get top 5 users by page view */
ord_url_count = ORDER url_count..              top_5_users = topCount(page_views,
top_10_urls = LIMIT ord_url.. 10;              uname, 5);
DUMP top_10_urls;                              DUMP top_5_users;
                                               …


         Architecting the Future of Big Data
                                                                                  Page 9
         © Hortonworks Inc. 2011
Pig macro
• Coming soon – piggybank with pig macros




     Architecting the Future of Big Data
                                            Page 10
     © Hortonworks Inc. 2011
Writing data flow program
• Writing a complex data pipeline is an iterative process

     Load                                   Load



   Transform                                Join



                                            Group   Transform   Filter




      Architecting the Future of Big Data
                                                                         Page 11
      © Hortonworks Inc. 2011
Writing data flow program


    Load                                   Load



  Transform                                Join



                                           Group   Transform         Filter


                                                               No output! L




     Architecting the Future of Big Data
                                                                               Page 12
     © Hortonworks Inc. 2011
Writing data flow program
• Debug!

        Load                                   Load


                                                       Was	
  join	
  on	
  
    Transform                                  Join      wrong	
  
                                                         a2ributes?	
  


Bug	
  in	
                                    Group          Transform                    Filter
   transform?	
  

                                                                               Did	
  filter	
  drop	
  
                                                                                    everything?	
  



         Architecting the Future of Big Data
                                                                                                          Page 13
         © Hortonworks Inc. 2011
Common approaches to debug
• Running on real (large) data
  – Inefficient, takes longer
• Running on (small) samples
  – Empty results on join, selective filters




      Architecting the Future of Big Data
                                               Page 14
      © Hortonworks Inc. 2011
Pig illustrate command
• Objective- Show examples for i/o of each statement that
  are
  – Realistic
  – Complete
  – Concise
  – Generated fast
• Steps
  – Downstream – sample and process
  – Prune
  – Upstream – generate realistic missing classes of examples
  – Prune


      Architecting the Future of Big Data
                                                           Page 15
      © Hortonworks Inc. 2011
Illustrate command demo




   Architecting the Future of Big Data
                                         Page 16
   © Hortonworks Inc. 2011
Pig relation-as-scalar
• In pig each statement alias is a relation
   – Relation is a set of records
• Task: Get list of pages whose load time was more
  than average.
• Steps
   1.  Compute average load time
   2.  Get list of pages whose load time is > average




      Architecting the Future of Big Data
                                                        Page 17
      © Hortonworks Inc. 2011
Pig relation-as-scalar
• Step 1 is like
  .. = load ..!
  ..= group ..!
  al_rel = foreach .. AVG(ltime) as avg_ltime;!


• Step 2 looks like
   page_views = load ‘pviews.txt’ as !
                               (url, ltime, ..);!
   !
   slow_views = filter page_views by !
                         ltime > avg_ltime!




       Architecting the Future of Big Data
                                                    Page 18
       © Hortonworks Inc. 2011
Pig relation-as-scalar
• Getting results of step 1 (average_gpa)
   – Join result of step 1 with students relation, or
   – Write result into file, then use udf to read from file
• Pig scalar feature now simplifies this-
   slow_views = filter page_views by !
                         ltime > al_rel.avg_ltime!


   – Runtime exception if al_rel has more than one record.




       Architecting the Future of Big Data
                                                              Page 19
       © Hortonworks Inc. 2011
UDF in Scripting Language
• Benefit
   – Use legacy code
   – Use library in scripting language
   – Leverage Hadoop for non-Java programmer
• Currently supported language
   – Python
   – JavaScript
   – Ruby
• Extensible Interface
   – Minimum effort to support another language



      Architecting the Future of Big Data
                                                  Page 20
      © Hortonworks Inc. 2011
Writing a Jython UDF
Write a Jython UDF                             •  Invoke Jython UDF when
                                                  needed
@outputSchema("word:chararray")                •  Type conversion
def concat(word):                                  –  Simple type
  return word + word                               –  Python Array <-> Pig Bag
                                                   –  Python Dict <-> Pig Map
                                                   –  Pyton Tuple <-> Pig Tuple

@outputSchemaFunction("squareSchema")          •  Convey schema to Pig
def square(num):                                   –  outputSchema
                                                   –  outputSchemaFunction
  if num == None:
      return None                              register 'util.py' using jython as util;
  return ((num)*(num))
                                               B = foreach A generate util.square
def squareSchema(input):                       (i));
  return input

         Architecting the Future of Big Data
                                                                                  Page 21
         © Hortonworks Inc. 2011
Use NLTK in Pig
• Example
   register ’nltk_util.py' using jython as nltk;
   ……
   B = foreach A generate nltk.tokenize(sentence)

 nltk_util.py
   import nltk
   porter = nltk.PorterStemmer()
   @outputSchema("words:{(word:chararray)}")
   def tokenize(sentence):
     tokens = nltk.word_tokenize(sentence)
     words = [porter.stem(t) for t in tokens]
     return words



      Architecting the Future of Big Data
                                                    Page 22
      © Hortonworks Inc. 2011
Writing a Script Engine
Writing a bridge UDF
class JythonFunction extends EvalFunc<Object> {
   public Object exec(Tuple tuple) {
     PyObject[] params = JythonUtils.pigTupleToPyTuple(tuple).getArray();
     PyObject result = function.__call__(params);
     return JythonUtils.pythonToPig(result);
   }
   public Schema outputSchema(Schema input) {
     PyObject outputSchemaDef = f.__findattr__("outputSchema".intern());
     return Utils.getSchemaFromString(outputSchemaDef.toString());
   }
}




        Architecting the Future of Big Data
                                                                            Page 23
        © Hortonworks Inc. 2011
Writing a Script Engine
Register scripting UDF

register 'util.py' using jython as util;

What happens in Pig
class JythonScriptEngine extends ScriptEngine {
   public void registerFunctions(String path, String namespace, PigContext
pigContext) {
     PythonInterpreter pi = Interpreter.interpreter;
     pi.execfile(path);
     for (PyTuple item : pi.getLocals().items())
        funcspec = new FuncSpec(JythonFunction.class.getCanonicalName() + "('"
                   + path + "','" + item. get(0)+"')");
        pigContext.registerFunction(namespace + key, funcspec);
   }
}



          Architecting the Future of Big Data
                                                                            Page 24
          © Hortonworks Inc. 2011
Algebraic UDF in JRuby
class Count < AlgebraicPigUdf
   output_schema Schema.long

  def initial t
    t.nil? ? 0 : 1
  end

  def intermed t
    return 0 if t.nil?
    t.flatten.inject(:+)
  end

  def final t
    intermed(t)
  end

end


          Architecting the Future of Big Data
                                                Page 25
          © Hortonworks Inc. 2011
Pig Embedding
• Embed Pig inside scripting language
  – Python
  – JavaScript
• Algorithms which cannot complete using one Pig script
  – Iterative algorithm
  PageRank, Kmeans, Neural Network, Apriori, etc
  – Parallel execution
  Random forrest
  – Divide and Conquer
  – Branching




      Architecting the Future of Big Data
                                                          Page 26
      © Hortonworks Inc. 2011
Pig Embedding
from org.apache.pig.scripting import Pig

                                                                             Compile	
  Pig	
  
input= ":INPATH:/singlefile/studenttab10k”
                                                                                Script	
  

P = Pig.compile("""A = load '$in' as (name, age, gpa); store A into ’output';""")

                                               Bind	
  Variables	
  
Q = P.bind({'in':input})

result = Q.runSingle()                         Launch	
  Pig	
  Script	
  

if result.isSuccessful():
    print "Pig job PASSED”
else:
    raise "Pig job FAILED"



         Architecting the Future of Big Data
                                                                                                  Page 27
         © Hortonworks Inc. 2011
Pig Embedding
 • Running embeded Pig script
    pig sample.py
 • What happen within Pig?
                                                                Pig
                                                                Script


             Python                           Python
             Script                           Script
sample.py                            Pig               Jython            Pig




        Architecting the Future of Big Data
                                                                               Page 28
        © Hortonworks Inc. 2011
Nested Operator
• Nested Operator: Operator inside foreach
  B = group A by name;
  C = foreach B {
    C0 = limit A 10;
    generate C0;
  }


• Prior Pig 0.10, supported nested operator
  – DISTINCT, FILTER, LIMIT, and ORDER BY
• New operators added in 0.10
  – CROSS, FOREACH



      Architecting the Future of Big Data
                                              Page 29
      © Hortonworks Inc. 2011
Nested Cross/Foreach
A = LOAD ’studenttab10k' as (name:chararray, age:int, gpa:double);
B = LOAD ’votertab10k' as (name:chararray, age:int, registration,
contributions:double);
C = cogroup A by name, B by name;
D = foreach C {
   C1 = filter A by gpa > 4;
   C2 = filter B by contributions > 500;
   C3 = cross C1, C2;
   C4 = foreach C3 generate CONCAT(CONCAT((chararray)gpa, '_'), (chararray)
contributions);
   generate flatten(C4);
}
store D into ’output'




       Architecting the Future of Big Data
                                                                      Page 30
       © Hortonworks Inc. 2011
Misc Loaders
• HBaseStorage
• CassandraStorage
• AvroStorage
• JsonLoader/JsonStorage




     Architecting the Future of Big Data
                                           Page 31
     © Hortonworks Inc. 2011
New operators to come
• Will be available in Pig 0.11
   – RANK
       – A distributed RANK implementation for Pig

   – CUBE




      Architecting the Future of Big Data
                                                     Page 32
      © Hortonworks Inc. 2011

Contenu connexe

Tendances

Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
Wes Floyd
 
Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataPig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big Data
DataWorks Summit
 
Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data Scientists
DataWorks Summit
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
Mitsuharu Hamba
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
pappupassindia
 
Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User GroupHadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop User Group
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Yahoo Developer Network
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performance
DataWorks Summit
 

Tendances (20)

Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
 
Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
 
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural LanguagesData Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
 
IPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopIPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for Hadoop
 
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
 
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
 
Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataPig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big Data
 
Scalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worldsScalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worlds
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
 
Apache Pig: Making data transformation easy
Apache Pig: Making data transformation easyApache Pig: Making data transformation easy
Apache Pig: Making data transformation easy
 
Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data Scientists
 
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User GroupHadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User Group
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questions
 
Word Embedding for Nearest Words
Word Embedding for Nearest WordsWord Embedding for Nearest Words
Word Embedding for Nearest Words
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performance
 

En vedette (9)

ARCOS PALMARES Y PLANTARES
ARCOS PALMARES Y PLANTARES ARCOS PALMARES Y PLANTARES
ARCOS PALMARES Y PLANTARES
 
F cube - bits spark presentation
F cube - bits spark presentationF cube - bits spark presentation
F cube - bits spark presentation
 
Emo spark
Emo sparkEmo spark
Emo spark
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves MabialaDeep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
 
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew Ray
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
 

Similaire à Pig programming is fun

Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUGReal-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
skumpf
 
Agile analytics applications on hadoop
Agile analytics applications on hadoopAgile analytics applications on hadoop
Agile analytics applications on hadoop
Hortonworks
 
Hortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics ApplicationsHortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics Applications
russell_jurney
 
Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)
Hortonworks
 

Similaire à Pig programming is fun (20)

Introduction to pig
Introduction to pigIntroduction to pig
Introduction to pig
 
Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015
 
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUGReal-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
 
Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015
 
Agile analytics applications on hadoop
Agile analytics applications on hadoopAgile analytics applications on hadoop
Agile analytics applications on hadoop
 
Hortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics ApplicationsHortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics Applications
 
OOP 2014
OOP 2014OOP 2014
OOP 2014
 
Cloud Foundry Bootcamp
Cloud Foundry BootcampCloud Foundry Bootcamp
Cloud Foundry Bootcamp
 
2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_points2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_points
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course Workshop
 
Internet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitInternet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop Summit
 
Apache Hadoop on the Open Cloud
Apache Hadoop on the Open CloudApache Hadoop on the Open Cloud
Apache Hadoop on the Open Cloud
 
Run Your First Hadoop 2.x Program
Run Your First Hadoop 2.x ProgramRun Your First Hadoop 2.x Program
Run Your First Hadoop 2.x Program
 
Enterprise IIoT Edge Processing with Apache NiFi
Enterprise IIoT Edge Processing with Apache NiFiEnterprise IIoT Edge Processing with Apache NiFi
Enterprise IIoT Edge Processing with Apache NiFi
 
Ruby and R
Ruby and RRuby and R
Ruby and R
 
The Three Stages of Cloud Adoption - RightScale Compute 2013
The Three Stages of Cloud Adoption - RightScale Compute 2013The Three Stages of Cloud Adoption - RightScale Compute 2013
The Three Stages of Cloud Adoption - RightScale Compute 2013
 
Machine Learning With Spark
Machine Learning With SparkMachine Learning With Spark
Machine Learning With Spark
 
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVis
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVisProcess and Visualize Your Data with Revolution R, Hadoop and GoogleVis
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVis
 
Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 

Plus de DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Dernier (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 

Pig programming is fun

  • 1. Pig programming is more fun: New features in Pig Daniel Dai (@daijy) Thejas Nair (@thejasn) © Hortonworks Inc. 2011 Page 1
  • 2. What is Apache Pig? Pig Latin, a high level An engine that data processing executes Pig Latin language. locally or on a Hadoop cluster. Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/ Architecting the Future of Big Data Page 2 © Hortonworks Inc. 2011
  • 3. Pig-latin example • Query : Get the list of pages visited by users whose age is between 20 and 25 years. users = load users as (name, age); users_18_to_25 = filter users by age > 20 and age <= 25; page_views = load pages as (user, url); page_views_u18_to_25 = join users_18_to_25 by name, page_views by user; Architecting the Future of Big Data Page 3 © Hortonworks Inc. 2011
  • 4. Why pig ? • Faster development –  Fewer lines of code –  Don’t re-invent the wheel • Flexible –  Metadata is optional –  Extensible –  Procedural programming Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/ Architecting the Future of Big Data Page 4 © Hortonworks Inc. 2011
  • 5. Before pig 0.9 p1.pig p2.pig p3.pig Architecting the Future of Big Data Page 5 © Hortonworks Inc. 2011
  • 6. With pig macros p1.pig p2.pig p3.pig macro1.pig macro2.pig Architecting the Future of Big Data Page 6 © Hortonworks Inc. 2011
  • 7. With pig macros p1.pig p1.pig rm_bots.pig get_top.pig Architecting the Future of Big Data Page 7 © Hortonworks Inc. 2011
  • 8. Pig macro example • Page_views data : (user_name, url, timestamp, …) • Find top 5 users by page views • Find top 10 most visited pages. Architecting the Future of Big Data Page 8 © Hortonworks Inc. 2011
  • 9. Pig Macro example page_views = LOAD .. /* top x macro */ /* get top 5 users by page view */ DEFINE topCount (rel, col, topNum) u_grp = GROUP .. by uname; RETURNS top_num_recs { u_count = FOREACH .. COUNT .. grped = GROUP $rel by $col; ord_u_count = ORDER u_count .. cnt_grp = FOREACH ..COUNT($rel).. top_5_users = LIMIT ordered.. 5; ord_cnt = ORDER .. by cnt; DUMP top_5_users; $top_num_recs = LIMIT.. $topNum; } /* get top 10 urls by page view */ ----------------------------------------- url_grp = GROUP .. by url; page_views = LOAD .. url_count = FOREACH .. COUNT . /* get top 5 users by page view */ ord_url_count = ORDER url_count.. top_5_users = topCount(page_views, top_10_urls = LIMIT ord_url.. 10; uname, 5); DUMP top_10_urls; DUMP top_5_users; … Architecting the Future of Big Data Page 9 © Hortonworks Inc. 2011
  • 10. Pig macro • Coming soon – piggybank with pig macros Architecting the Future of Big Data Page 10 © Hortonworks Inc. 2011
  • 11. Writing data flow program • Writing a complex data pipeline is an iterative process Load Load Transform Join Group Transform Filter Architecting the Future of Big Data Page 11 © Hortonworks Inc. 2011
  • 12. Writing data flow program Load Load Transform Join Group Transform Filter No output! L Architecting the Future of Big Data Page 12 © Hortonworks Inc. 2011
  • 13. Writing data flow program • Debug! Load Load Was  join  on   Transform Join wrong   a2ributes?   Bug  in   Group Transform Filter transform?   Did  filter  drop   everything?   Architecting the Future of Big Data Page 13 © Hortonworks Inc. 2011
  • 14. Common approaches to debug • Running on real (large) data – Inefficient, takes longer • Running on (small) samples – Empty results on join, selective filters Architecting the Future of Big Data Page 14 © Hortonworks Inc. 2011
  • 15. Pig illustrate command • Objective- Show examples for i/o of each statement that are – Realistic – Complete – Concise – Generated fast • Steps – Downstream – sample and process – Prune – Upstream – generate realistic missing classes of examples – Prune Architecting the Future of Big Data Page 15 © Hortonworks Inc. 2011
  • 16. Illustrate command demo Architecting the Future of Big Data Page 16 © Hortonworks Inc. 2011
  • 17. Pig relation-as-scalar • In pig each statement alias is a relation – Relation is a set of records • Task: Get list of pages whose load time was more than average. • Steps 1.  Compute average load time 2.  Get list of pages whose load time is > average Architecting the Future of Big Data Page 17 © Hortonworks Inc. 2011
  • 18. Pig relation-as-scalar • Step 1 is like .. = load ..! ..= group ..! al_rel = foreach .. AVG(ltime) as avg_ltime;! • Step 2 looks like page_views = load ‘pviews.txt’ as ! (url, ltime, ..);! ! slow_views = filter page_views by ! ltime > avg_ltime! Architecting the Future of Big Data Page 18 © Hortonworks Inc. 2011
  • 19. Pig relation-as-scalar • Getting results of step 1 (average_gpa) – Join result of step 1 with students relation, or – Write result into file, then use udf to read from file • Pig scalar feature now simplifies this- slow_views = filter page_views by ! ltime > al_rel.avg_ltime! – Runtime exception if al_rel has more than one record. Architecting the Future of Big Data Page 19 © Hortonworks Inc. 2011
  • 20. UDF in Scripting Language • Benefit – Use legacy code – Use library in scripting language – Leverage Hadoop for non-Java programmer • Currently supported language – Python – JavaScript – Ruby • Extensible Interface – Minimum effort to support another language Architecting the Future of Big Data Page 20 © Hortonworks Inc. 2011
  • 21. Writing a Jython UDF Write a Jython UDF •  Invoke Jython UDF when needed @outputSchema("word:chararray") •  Type conversion def concat(word): –  Simple type return word + word –  Python Array <-> Pig Bag –  Python Dict <-> Pig Map –  Pyton Tuple <-> Pig Tuple @outputSchemaFunction("squareSchema") •  Convey schema to Pig def square(num): –  outputSchema –  outputSchemaFunction if num == None: return None register 'util.py' using jython as util; return ((num)*(num)) B = foreach A generate util.square def squareSchema(input): (i)); return input Architecting the Future of Big Data Page 21 © Hortonworks Inc. 2011
  • 22. Use NLTK in Pig • Example register ’nltk_util.py' using jython as nltk; …… B = foreach A generate nltk.tokenize(sentence) nltk_util.py import nltk porter = nltk.PorterStemmer() @outputSchema("words:{(word:chararray)}") def tokenize(sentence): tokens = nltk.word_tokenize(sentence) words = [porter.stem(t) for t in tokens] return words Architecting the Future of Big Data Page 22 © Hortonworks Inc. 2011
  • 23. Writing a Script Engine Writing a bridge UDF class JythonFunction extends EvalFunc<Object> { public Object exec(Tuple tuple) { PyObject[] params = JythonUtils.pigTupleToPyTuple(tuple).getArray(); PyObject result = function.__call__(params); return JythonUtils.pythonToPig(result); } public Schema outputSchema(Schema input) { PyObject outputSchemaDef = f.__findattr__("outputSchema".intern()); return Utils.getSchemaFromString(outputSchemaDef.toString()); } } Architecting the Future of Big Data Page 23 © Hortonworks Inc. 2011
  • 24. Writing a Script Engine Register scripting UDF register 'util.py' using jython as util; What happens in Pig class JythonScriptEngine extends ScriptEngine { public void registerFunctions(String path, String namespace, PigContext pigContext) { PythonInterpreter pi = Interpreter.interpreter; pi.execfile(path); for (PyTuple item : pi.getLocals().items()) funcspec = new FuncSpec(JythonFunction.class.getCanonicalName() + "('" + path + "','" + item. get(0)+"')"); pigContext.registerFunction(namespace + key, funcspec); } } Architecting the Future of Big Data Page 24 © Hortonworks Inc. 2011
  • 25. Algebraic UDF in JRuby class Count < AlgebraicPigUdf output_schema Schema.long def initial t t.nil? ? 0 : 1 end def intermed t return 0 if t.nil? t.flatten.inject(:+) end def final t intermed(t) end end Architecting the Future of Big Data Page 25 © Hortonworks Inc. 2011
  • 26. Pig Embedding • Embed Pig inside scripting language – Python – JavaScript • Algorithms which cannot complete using one Pig script – Iterative algorithm PageRank, Kmeans, Neural Network, Apriori, etc – Parallel execution Random forrest – Divide and Conquer – Branching Architecting the Future of Big Data Page 26 © Hortonworks Inc. 2011
  • 27. Pig Embedding from org.apache.pig.scripting import Pig Compile  Pig   input= ":INPATH:/singlefile/studenttab10k” Script   P = Pig.compile("""A = load '$in' as (name, age, gpa); store A into ’output';""") Bind  Variables   Q = P.bind({'in':input}) result = Q.runSingle() Launch  Pig  Script   if result.isSuccessful(): print "Pig job PASSED” else: raise "Pig job FAILED" Architecting the Future of Big Data Page 27 © Hortonworks Inc. 2011
  • 28. Pig Embedding • Running embeded Pig script pig sample.py • What happen within Pig? Pig Script Python Python Script Script sample.py Pig Jython Pig Architecting the Future of Big Data Page 28 © Hortonworks Inc. 2011
  • 29. Nested Operator • Nested Operator: Operator inside foreach B = group A by name; C = foreach B { C0 = limit A 10; generate C0; } • Prior Pig 0.10, supported nested operator – DISTINCT, FILTER, LIMIT, and ORDER BY • New operators added in 0.10 – CROSS, FOREACH Architecting the Future of Big Data Page 29 © Hortonworks Inc. 2011
  • 30. Nested Cross/Foreach A = LOAD ’studenttab10k' as (name:chararray, age:int, gpa:double); B = LOAD ’votertab10k' as (name:chararray, age:int, registration, contributions:double); C = cogroup A by name, B by name; D = foreach C { C1 = filter A by gpa > 4; C2 = filter B by contributions > 500; C3 = cross C1, C2; C4 = foreach C3 generate CONCAT(CONCAT((chararray)gpa, '_'), (chararray) contributions); generate flatten(C4); } store D into ’output' Architecting the Future of Big Data Page 30 © Hortonworks Inc. 2011
  • 31. Misc Loaders • HBaseStorage • CassandraStorage • AvroStorage • JsonLoader/JsonStorage Architecting the Future of Big Data Page 31 © Hortonworks Inc. 2011
  • 32. New operators to come • Will be available in Pig 0.11 – RANK – A distributed RANK implementation for Pig – CUBE Architecting the Future of Big Data Page 32 © Hortonworks Inc. 2011