SlideShare une entreprise Scribd logo
1  sur  32
Pig programming is more fun: New features in Pig



Daniel Dai (@daijy)
Thejas Nair (@thejasn)




© Hortonworks Inc. 2011                        Page 1
What is Apache Pig?
  Pig Latin, a high level                                                An engine that
  data processing                                                        executes Pig Latin
  language.                                                              locally or on a
                                                                         Hadoop cluster.




Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/

                  Architecting the Future of Big Data
                                                                                              Page 2
                  © Hortonworks Inc. 2011
Pig-latin example
• Query : Get the list of pages visited by users whose age is
  between 20 and 25 years.

users = load users as (name, age);

users_18_to_25 = filter users by age > 20 and age <= 25;

page_views = load pages as (user, url);

page_views_u18_to_25 = join users_18_to_25 by name,
page_views by user;

      Architecting the Future of Big Data
                                                          Page 3
      © Hortonworks Inc. 2011
Why pig ?
• Faster development
  –  Fewer lines of code
  –  Don’t re-invent the wheel

• Flexible
  –  Metadata is optional
  –  Extensible
  –  Procedural programming



         Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/

     Architecting the Future of Big Data
                                                                          Page 4
     © Hortonworks Inc. 2011
Before pig 0.9
   p1.pig                           p2.pig   p3.pig




     Architecting the Future of Big Data
                                                      Page 5
     © Hortonworks Inc. 2011
With pig macros
                                  p1.pig           p2.pig   p3.pig

macro1.pig                                                           macro2.pig




             Architecting the Future of Big Data
                                                                           Page 6
             © Hortonworks Inc. 2011
With pig macros
  p1.pig                                   p1.pig   rm_bots.pig




                                                    get_top.pig




     Architecting the Future of Big Data
                                                           Page 7
     © Hortonworks Inc. 2011
Pig macro example
• Page_views data : (user_name, url, timestamp, …)
• Find top 5 users by page views
• Find top 10 most visited pages.




      Architecting the Future of Big Data
                                                     Page 8
      © Hortonworks Inc. 2011
Pig Macro example
page_views = LOAD ..                           /* top x macro */
/* get top 5 users by page view */             DEFINE topCount (rel, col, topNum)
u_grp = GROUP .. by uname;                     RETURNS top_num_recs {
u_count = FOREACH .. COUNT ..                   grped = GROUP $rel by $col;
ord_u_count = ORDER u_count ..                  cnt_grp = FOREACH ..COUNT($rel)..
top_5_users = LIMIT ordered.. 5;                ord_cnt = ORDER .. by cnt;
DUMP top_5_users;                               $top_num_recs = LIMIT.. $topNum;
                                               }
/* get top 10 urls by page view */             -----------------------------------------
url_grp = GROUP .. by url;                     page_views = LOAD ..
url_count = FOREACH .. COUNT .                 /* get top 5 users by page view */
ord_url_count = ORDER url_count..              top_5_users = topCount(page_views,
top_10_urls = LIMIT ord_url.. 10;              uname, 5);
DUMP top_10_urls;                              DUMP top_5_users;
                                               …


         Architecting the Future of Big Data
                                                                                  Page 9
         © Hortonworks Inc. 2011
Pig macro
• Coming soon – piggybank with pig macros




     Architecting the Future of Big Data
                                            Page 10
     © Hortonworks Inc. 2011
Writing data flow program
• Writing a complex data pipeline is an iterative process

     Load                                   Load



   Transform                                Join



                                            Group   Transform   Filter




      Architecting the Future of Big Data
                                                                         Page 11
      © Hortonworks Inc. 2011
Writing data flow program


    Load                                   Load



  Transform                                Join



                                           Group   Transform         Filter


                                                               No output! L




     Architecting the Future of Big Data
                                                                               Page 12
     © Hortonworks Inc. 2011
Writing data flow program
• Debug!

        Load                                   Load


                                                       Was	
  join	
  on	
  
    Transform                                  Join      wrong	
  
                                                         a2ributes?	
  


Bug	
  in	
                                    Group          Transform                    Filter
   transform?	
  

                                                                               Did	
  filter	
  drop	
  
                                                                                    everything?	
  



         Architecting the Future of Big Data
                                                                                                          Page 13
         © Hortonworks Inc. 2011
Common approaches to debug
• Running on real (large) data
  – Inefficient, takes longer
• Running on (small) samples
  – Empty results on join, selective filters




      Architecting the Future of Big Data
                                               Page 14
      © Hortonworks Inc. 2011
Pig illustrate command
• Objective- Show examples for i/o of each statement that
  are
  – Realistic
  – Complete
  – Concise
  – Generated fast
• Steps
  – Downstream – sample and process
  – Prune
  – Upstream – generate realistic missing classes of examples
  – Prune


      Architecting the Future of Big Data
                                                           Page 15
      © Hortonworks Inc. 2011
Illustrate command demo




   Architecting the Future of Big Data
                                         Page 16
   © Hortonworks Inc. 2011
Pig relation-as-scalar
• In pig each statement alias is a relation
   – Relation is a set of records
• Task: Get list of pages whose load time was more
  than average.
• Steps
   1.  Compute average load time
   2.  Get list of pages whose load time is > average




      Architecting the Future of Big Data
                                                        Page 17
      © Hortonworks Inc. 2011
Pig relation-as-scalar
• Step 1 is like
  .. = load ..!
  ..= group ..!
  al_rel = foreach .. AVG(ltime) as avg_ltime;!


• Step 2 looks like
   page_views = load ‘pviews.txt’ as !
                               (url, ltime, ..);!
   !
   slow_views = filter page_views by !
                         ltime > avg_ltime!




       Architecting the Future of Big Data
                                                    Page 18
       © Hortonworks Inc. 2011
Pig relation-as-scalar
• Getting results of step 1 (average_gpa)
   – Join result of step 1 with students relation, or
   – Write result into file, then use udf to read from file
• Pig scalar feature now simplifies this-
   slow_views = filter page_views by !
                         ltime > al_rel.avg_ltime!


   – Runtime exception if al_rel has more than one record.




       Architecting the Future of Big Data
                                                              Page 19
       © Hortonworks Inc. 2011
UDF in Scripting Language
• Benefit
   – Use legacy code
   – Use library in scripting language
   – Leverage Hadoop for non-Java programmer
• Currently supported language
   – Python
   – JavaScript
   – Ruby
• Extensible Interface
   – Minimum effort to support another language



      Architecting the Future of Big Data
                                                  Page 20
      © Hortonworks Inc. 2011
Writing a Jython UDF
Write a Jython UDF                             •  Invoke Jython UDF when
                                                  needed
@outputSchema("word:chararray")                •  Type conversion
def concat(word):                                  –  Simple type
  return word + word                               –  Python Array <-> Pig Bag
                                                   –  Python Dict <-> Pig Map
                                                   –  Pyton Tuple <-> Pig Tuple

@outputSchemaFunction("squareSchema")          •  Convey schema to Pig
def square(num):                                   –  outputSchema
                                                   –  outputSchemaFunction
  if num == None:
      return None                              register 'util.py' using jython as util;
  return ((num)*(num))
                                               B = foreach A generate util.square
def squareSchema(input):                       (i));
  return input

         Architecting the Future of Big Data
                                                                                  Page 21
         © Hortonworks Inc. 2011
Use NLTK in Pig
• Example
   register ’nltk_util.py' using jython as nltk;
   ……
   B = foreach A generate nltk.tokenize(sentence)

 nltk_util.py
   import nltk
   porter = nltk.PorterStemmer()
   @outputSchema("words:{(word:chararray)}")
   def tokenize(sentence):
     tokens = nltk.word_tokenize(sentence)
     words = [porter.stem(t) for t in tokens]
     return words



      Architecting the Future of Big Data
                                                    Page 22
      © Hortonworks Inc. 2011
Writing a Script Engine
Writing a bridge UDF
class JythonFunction extends EvalFunc<Object> {
   public Object exec(Tuple tuple) {
     PyObject[] params = JythonUtils.pigTupleToPyTuple(tuple).getArray();
     PyObject result = function.__call__(params);
     return JythonUtils.pythonToPig(result);
   }
   public Schema outputSchema(Schema input) {
     PyObject outputSchemaDef = f.__findattr__("outputSchema".intern());
     return Utils.getSchemaFromString(outputSchemaDef.toString());
   }
}




        Architecting the Future of Big Data
                                                                            Page 23
        © Hortonworks Inc. 2011
Writing a Script Engine
Register scripting UDF

register 'util.py' using jython as util;

What happens in Pig
class JythonScriptEngine extends ScriptEngine {
   public void registerFunctions(String path, String namespace, PigContext
pigContext) {
     PythonInterpreter pi = Interpreter.interpreter;
     pi.execfile(path);
     for (PyTuple item : pi.getLocals().items())
        funcspec = new FuncSpec(JythonFunction.class.getCanonicalName() + "('"
                   + path + "','" + item. get(0)+"')");
        pigContext.registerFunction(namespace + key, funcspec);
   }
}



          Architecting the Future of Big Data
                                                                            Page 24
          © Hortonworks Inc. 2011
Algebraic UDF in JRuby
class Count < AlgebraicPigUdf
   output_schema Schema.long

  def initial t
    t.nil? ? 0 : 1
  end

  def intermed t
    return 0 if t.nil?
    t.flatten.inject(:+)
  end

  def final t
    intermed(t)
  end

end


          Architecting the Future of Big Data
                                                Page 25
          © Hortonworks Inc. 2011
Pig Embedding
• Embed Pig inside scripting language
  – Python
  – JavaScript
• Algorithms which cannot complete using one Pig script
  – Iterative algorithm
  PageRank, Kmeans, Neural Network, Apriori, etc
  – Parallel execution
  Random forrest
  – Divide and Conquer
  – Branching




      Architecting the Future of Big Data
                                                          Page 26
      © Hortonworks Inc. 2011
Pig Embedding
from org.apache.pig.scripting import Pig

                                                                             Compile	
  Pig	
  
input= ":INPATH:/singlefile/studenttab10k”
                                                                                Script	
  

P = Pig.compile("""A = load '$in' as (name, age, gpa); store A into ’output';""")

                                               Bind	
  Variables	
  
Q = P.bind({'in':input})

result = Q.runSingle()                         Launch	
  Pig	
  Script	
  

if result.isSuccessful():
    print "Pig job PASSED”
else:
    raise "Pig job FAILED"



         Architecting the Future of Big Data
                                                                                                  Page 27
         © Hortonworks Inc. 2011
Pig Embedding
 • Running embeded Pig script
    pig sample.py
 • What happen within Pig?
                                                                Pig
                                                                Script


             Python                           Python
             Script                           Script
sample.py                            Pig               Jython            Pig




        Architecting the Future of Big Data
                                                                               Page 28
        © Hortonworks Inc. 2011
Nested Operator
• Nested Operator: Operator inside foreach
  B = group A by name;
  C = foreach B {
    C0 = limit A 10;
    generate C0;
  }


• Prior Pig 0.10, supported nested operator
  – DISTINCT, FILTER, LIMIT, and ORDER BY
• New operators added in 0.10
  – CROSS, FOREACH



      Architecting the Future of Big Data
                                              Page 29
      © Hortonworks Inc. 2011
Nested Cross/Foreach
A = LOAD ’studenttab10k' as (name:chararray, age:int, gpa:double);
B = LOAD ’votertab10k' as (name:chararray, age:int, registration,
contributions:double);
C = cogroup A by name, B by name;
D = foreach C {
   C1 = filter A by gpa > 4;
   C2 = filter B by contributions > 500;
   C3 = cross C1, C2;
   C4 = foreach C3 generate CONCAT(CONCAT((chararray)gpa, '_'), (chararray)
contributions);
   generate flatten(C4);
}
store D into ’output'




       Architecting the Future of Big Data
                                                                      Page 30
       © Hortonworks Inc. 2011
Misc Loaders
• HBaseStorage
• CassandraStorage
• AvroStorage
• JsonLoader/JsonStorage




     Architecting the Future of Big Data
                                           Page 31
     © Hortonworks Inc. 2011
New operators to come
• Will be available in Pig 0.11
   – RANK
       – A distributed RANK implementation for Pig

   – CUBE




      Architecting the Future of Big Data
                                                     Page 32
      © Hortonworks Inc. 2011

Contenu connexe

Tendances

Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2Wes Floyd
 
Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Ian Huston
 
Python in big data world
Python in big data worldPython in big data world
Python in big data worldRohit
 
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural LanguagesData Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural LanguagesIan Huston
 
IPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopIPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopDataWorks Summit
 
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Yu Liu
 
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...PyData
 
Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataPig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataDataWorks Summit
 
Scalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worldsScalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worldsDataWorks Summit
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaEdureka!
 
Apache Pig: Making data transformation easy
Apache Pig: Making data transformation easyApache Pig: Making data transformation easy
Apache Pig: Making data transformation easyVictor Sanchez Anguix
 
Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data ScientistsDataWorks Summit
 
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Jeffrey Breen
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingMitsuharu Hamba
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview questionpappupassindia
 
Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User GroupHadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User GroupHadoop User Group
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsAsad Masood Qazi
 
Word Embedding for Nearest Words
Word Embedding for Nearest WordsWord Embedding for Nearest Words
Word Embedding for Nearest WordsEkaKurniawan40
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationYahoo Developer Network
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performanceDataWorks Summit
 

Tendances (20)

Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
 
Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
 
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural LanguagesData Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
 
IPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopIPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for Hadoop
 
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
 
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
 
Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataPig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big Data
 
Scalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worldsScalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worlds
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
 
Apache Pig: Making data transformation easy
Apache Pig: Making data transformation easyApache Pig: Making data transformation easy
Apache Pig: Making data transformation easy
 
Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data Scientists
 
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User GroupHadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User Group
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questions
 
Word Embedding for Nearest Words
Word Embedding for Nearest WordsWord Embedding for Nearest Words
Word Embedding for Nearest Words
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performance
 

En vedette

ARCOS PALMARES Y PLANTARES
ARCOS PALMARES Y PLANTARES ARCOS PALMARES Y PLANTARES
ARCOS PALMARES Y PLANTARES alisjow
 
F cube - bits spark presentation
F cube - bits spark presentationF cube - bits spark presentation
F cube - bits spark presentationChetan Misra
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves MabialaDeep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves MabialaSpark Summit
 
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RaySpark Summit
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat SheetHortonworks
 

En vedette (9)

ARCOS PALMARES Y PLANTARES
ARCOS PALMARES Y PLANTARES ARCOS PALMARES Y PLANTARES
ARCOS PALMARES Y PLANTARES
 
F cube - bits spark presentation
F cube - bits spark presentationF cube - bits spark presentation
F cube - bits spark presentation
 
Emo spark
Emo sparkEmo spark
Emo spark
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves MabialaDeep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
 
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew Ray
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
 

Similaire à Pig programming is fun

Introduction to pig
Introduction to pigIntroduction to pig
Introduction to pigRavi Mutyala
 
Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015Mac Moore
 
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUGReal-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUGskumpf
 
Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015Mac Moore
 
Agile analytics applications on hadoop
Agile analytics applications on hadoopAgile analytics applications on hadoop
Agile analytics applications on hadoopHortonworks
 
Hortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics ApplicationsHortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics Applicationsrussell_jurney
 
Cloud Foundry Bootcamp
Cloud Foundry BootcampCloud Foundry Bootcamp
Cloud Foundry BootcampJoshua Long
 
2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_points2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_pointsAdam Muise
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course WorkshopDataWorks Summit
 
Internet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitInternet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitDataWorks Summit
 
Apache Hadoop on the Open Cloud
Apache Hadoop on the Open CloudApache Hadoop on the Open Cloud
Apache Hadoop on the Open CloudHortonworks
 
Run Your First Hadoop 2.x Program
Run Your First Hadoop 2.x ProgramRun Your First Hadoop 2.x Program
Run Your First Hadoop 2.x ProgramSkillspeed
 
Enterprise IIoT Edge Processing with Apache NiFi
Enterprise IIoT Edge Processing with Apache NiFiEnterprise IIoT Edge Processing with Apache NiFi
Enterprise IIoT Edge Processing with Apache NiFiTimothy Spann
 
The Three Stages of Cloud Adoption - RightScale Compute 2013
The Three Stages of Cloud Adoption - RightScale Compute 2013The Three Stages of Cloud Adoption - RightScale Compute 2013
The Three Stages of Cloud Adoption - RightScale Compute 2013RightScale
 
Machine Learning With Spark
Machine Learning With SparkMachine Learning With Spark
Machine Learning With SparkShivaji Dutta
 
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVis
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVisProcess and Visualize Your Data with Revolution R, Hadoop and GoogleVis
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVisHortonworks
 
Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hortonworks
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopPOSSCON
 

Similaire à Pig programming is fun (20)

Introduction to pig
Introduction to pigIntroduction to pig
Introduction to pig
 
Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015
 
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUGReal-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
 
Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015
 
Agile analytics applications on hadoop
Agile analytics applications on hadoopAgile analytics applications on hadoop
Agile analytics applications on hadoop
 
Hortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics ApplicationsHortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics Applications
 
OOP 2014
OOP 2014OOP 2014
OOP 2014
 
Cloud Foundry Bootcamp
Cloud Foundry BootcampCloud Foundry Bootcamp
Cloud Foundry Bootcamp
 
2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_points2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_points
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course Workshop
 
Internet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitInternet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop Summit
 
Apache Hadoop on the Open Cloud
Apache Hadoop on the Open CloudApache Hadoop on the Open Cloud
Apache Hadoop on the Open Cloud
 
Run Your First Hadoop 2.x Program
Run Your First Hadoop 2.x ProgramRun Your First Hadoop 2.x Program
Run Your First Hadoop 2.x Program
 
Enterprise IIoT Edge Processing with Apache NiFi
Enterprise IIoT Edge Processing with Apache NiFiEnterprise IIoT Edge Processing with Apache NiFi
Enterprise IIoT Edge Processing with Apache NiFi
 
Ruby and R
Ruby and RRuby and R
Ruby and R
 
The Three Stages of Cloud Adoption - RightScale Compute 2013
The Three Stages of Cloud Adoption - RightScale Compute 2013The Three Stages of Cloud Adoption - RightScale Compute 2013
The Three Stages of Cloud Adoption - RightScale Compute 2013
 
Machine Learning With Spark
Machine Learning With SparkMachine Learning With Spark
Machine Learning With Spark
 
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVis
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVisProcess and Visualize Your Data with Revolution R, Hadoop and GoogleVis
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVis
 
Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 

Plus de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 

Dernier (20)

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 

Pig programming is fun

  • 1. Pig programming is more fun: New features in Pig Daniel Dai (@daijy) Thejas Nair (@thejasn) © Hortonworks Inc. 2011 Page 1
  • 2. What is Apache Pig? Pig Latin, a high level An engine that data processing executes Pig Latin language. locally or on a Hadoop cluster. Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/ Architecting the Future of Big Data Page 2 © Hortonworks Inc. 2011
  • 3. Pig-latin example • Query : Get the list of pages visited by users whose age is between 20 and 25 years. users = load users as (name, age); users_18_to_25 = filter users by age > 20 and age <= 25; page_views = load pages as (user, url); page_views_u18_to_25 = join users_18_to_25 by name, page_views by user; Architecting the Future of Big Data Page 3 © Hortonworks Inc. 2011
  • 4. Why pig ? • Faster development –  Fewer lines of code –  Don’t re-invent the wheel • Flexible –  Metadata is optional –  Extensible –  Procedural programming Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/ Architecting the Future of Big Data Page 4 © Hortonworks Inc. 2011
  • 5. Before pig 0.9 p1.pig p2.pig p3.pig Architecting the Future of Big Data Page 5 © Hortonworks Inc. 2011
  • 6. With pig macros p1.pig p2.pig p3.pig macro1.pig macro2.pig Architecting the Future of Big Data Page 6 © Hortonworks Inc. 2011
  • 7. With pig macros p1.pig p1.pig rm_bots.pig get_top.pig Architecting the Future of Big Data Page 7 © Hortonworks Inc. 2011
  • 8. Pig macro example • Page_views data : (user_name, url, timestamp, …) • Find top 5 users by page views • Find top 10 most visited pages. Architecting the Future of Big Data Page 8 © Hortonworks Inc. 2011
  • 9. Pig Macro example page_views = LOAD .. /* top x macro */ /* get top 5 users by page view */ DEFINE topCount (rel, col, topNum) u_grp = GROUP .. by uname; RETURNS top_num_recs { u_count = FOREACH .. COUNT .. grped = GROUP $rel by $col; ord_u_count = ORDER u_count .. cnt_grp = FOREACH ..COUNT($rel).. top_5_users = LIMIT ordered.. 5; ord_cnt = ORDER .. by cnt; DUMP top_5_users; $top_num_recs = LIMIT.. $topNum; } /* get top 10 urls by page view */ ----------------------------------------- url_grp = GROUP .. by url; page_views = LOAD .. url_count = FOREACH .. COUNT . /* get top 5 users by page view */ ord_url_count = ORDER url_count.. top_5_users = topCount(page_views, top_10_urls = LIMIT ord_url.. 10; uname, 5); DUMP top_10_urls; DUMP top_5_users; … Architecting the Future of Big Data Page 9 © Hortonworks Inc. 2011
  • 10. Pig macro • Coming soon – piggybank with pig macros Architecting the Future of Big Data Page 10 © Hortonworks Inc. 2011
  • 11. Writing data flow program • Writing a complex data pipeline is an iterative process Load Load Transform Join Group Transform Filter Architecting the Future of Big Data Page 11 © Hortonworks Inc. 2011
  • 12. Writing data flow program Load Load Transform Join Group Transform Filter No output! L Architecting the Future of Big Data Page 12 © Hortonworks Inc. 2011
  • 13. Writing data flow program • Debug! Load Load Was  join  on   Transform Join wrong   a2ributes?   Bug  in   Group Transform Filter transform?   Did  filter  drop   everything?   Architecting the Future of Big Data Page 13 © Hortonworks Inc. 2011
  • 14. Common approaches to debug • Running on real (large) data – Inefficient, takes longer • Running on (small) samples – Empty results on join, selective filters Architecting the Future of Big Data Page 14 © Hortonworks Inc. 2011
  • 15. Pig illustrate command • Objective- Show examples for i/o of each statement that are – Realistic – Complete – Concise – Generated fast • Steps – Downstream – sample and process – Prune – Upstream – generate realistic missing classes of examples – Prune Architecting the Future of Big Data Page 15 © Hortonworks Inc. 2011
  • 16. Illustrate command demo Architecting the Future of Big Data Page 16 © Hortonworks Inc. 2011
  • 17. Pig relation-as-scalar • In pig each statement alias is a relation – Relation is a set of records • Task: Get list of pages whose load time was more than average. • Steps 1.  Compute average load time 2.  Get list of pages whose load time is > average Architecting the Future of Big Data Page 17 © Hortonworks Inc. 2011
  • 18. Pig relation-as-scalar • Step 1 is like .. = load ..! ..= group ..! al_rel = foreach .. AVG(ltime) as avg_ltime;! • Step 2 looks like page_views = load ‘pviews.txt’ as ! (url, ltime, ..);! ! slow_views = filter page_views by ! ltime > avg_ltime! Architecting the Future of Big Data Page 18 © Hortonworks Inc. 2011
  • 19. Pig relation-as-scalar • Getting results of step 1 (average_gpa) – Join result of step 1 with students relation, or – Write result into file, then use udf to read from file • Pig scalar feature now simplifies this- slow_views = filter page_views by ! ltime > al_rel.avg_ltime! – Runtime exception if al_rel has more than one record. Architecting the Future of Big Data Page 19 © Hortonworks Inc. 2011
  • 20. UDF in Scripting Language • Benefit – Use legacy code – Use library in scripting language – Leverage Hadoop for non-Java programmer • Currently supported language – Python – JavaScript – Ruby • Extensible Interface – Minimum effort to support another language Architecting the Future of Big Data Page 20 © Hortonworks Inc. 2011
  • 21. Writing a Jython UDF Write a Jython UDF •  Invoke Jython UDF when needed @outputSchema("word:chararray") •  Type conversion def concat(word): –  Simple type return word + word –  Python Array <-> Pig Bag –  Python Dict <-> Pig Map –  Pyton Tuple <-> Pig Tuple @outputSchemaFunction("squareSchema") •  Convey schema to Pig def square(num): –  outputSchema –  outputSchemaFunction if num == None: return None register 'util.py' using jython as util; return ((num)*(num)) B = foreach A generate util.square def squareSchema(input): (i)); return input Architecting the Future of Big Data Page 21 © Hortonworks Inc. 2011
  • 22. Use NLTK in Pig • Example register ’nltk_util.py' using jython as nltk; …… B = foreach A generate nltk.tokenize(sentence) nltk_util.py import nltk porter = nltk.PorterStemmer() @outputSchema("words:{(word:chararray)}") def tokenize(sentence): tokens = nltk.word_tokenize(sentence) words = [porter.stem(t) for t in tokens] return words Architecting the Future of Big Data Page 22 © Hortonworks Inc. 2011
  • 23. Writing a Script Engine Writing a bridge UDF class JythonFunction extends EvalFunc<Object> { public Object exec(Tuple tuple) { PyObject[] params = JythonUtils.pigTupleToPyTuple(tuple).getArray(); PyObject result = function.__call__(params); return JythonUtils.pythonToPig(result); } public Schema outputSchema(Schema input) { PyObject outputSchemaDef = f.__findattr__("outputSchema".intern()); return Utils.getSchemaFromString(outputSchemaDef.toString()); } } Architecting the Future of Big Data Page 23 © Hortonworks Inc. 2011
  • 24. Writing a Script Engine Register scripting UDF register 'util.py' using jython as util; What happens in Pig class JythonScriptEngine extends ScriptEngine { public void registerFunctions(String path, String namespace, PigContext pigContext) { PythonInterpreter pi = Interpreter.interpreter; pi.execfile(path); for (PyTuple item : pi.getLocals().items()) funcspec = new FuncSpec(JythonFunction.class.getCanonicalName() + "('" + path + "','" + item. get(0)+"')"); pigContext.registerFunction(namespace + key, funcspec); } } Architecting the Future of Big Data Page 24 © Hortonworks Inc. 2011
  • 25. Algebraic UDF in JRuby class Count < AlgebraicPigUdf output_schema Schema.long def initial t t.nil? ? 0 : 1 end def intermed t return 0 if t.nil? t.flatten.inject(:+) end def final t intermed(t) end end Architecting the Future of Big Data Page 25 © Hortonworks Inc. 2011
  • 26. Pig Embedding • Embed Pig inside scripting language – Python – JavaScript • Algorithms which cannot complete using one Pig script – Iterative algorithm PageRank, Kmeans, Neural Network, Apriori, etc – Parallel execution Random forrest – Divide and Conquer – Branching Architecting the Future of Big Data Page 26 © Hortonworks Inc. 2011
  • 27. Pig Embedding from org.apache.pig.scripting import Pig Compile  Pig   input= ":INPATH:/singlefile/studenttab10k” Script   P = Pig.compile("""A = load '$in' as (name, age, gpa); store A into ’output';""") Bind  Variables   Q = P.bind({'in':input}) result = Q.runSingle() Launch  Pig  Script   if result.isSuccessful(): print "Pig job PASSED” else: raise "Pig job FAILED" Architecting the Future of Big Data Page 27 © Hortonworks Inc. 2011
  • 28. Pig Embedding • Running embeded Pig script pig sample.py • What happen within Pig? Pig Script Python Python Script Script sample.py Pig Jython Pig Architecting the Future of Big Data Page 28 © Hortonworks Inc. 2011
  • 29. Nested Operator • Nested Operator: Operator inside foreach B = group A by name; C = foreach B { C0 = limit A 10; generate C0; } • Prior Pig 0.10, supported nested operator – DISTINCT, FILTER, LIMIT, and ORDER BY • New operators added in 0.10 – CROSS, FOREACH Architecting the Future of Big Data Page 29 © Hortonworks Inc. 2011
  • 30. Nested Cross/Foreach A = LOAD ’studenttab10k' as (name:chararray, age:int, gpa:double); B = LOAD ’votertab10k' as (name:chararray, age:int, registration, contributions:double); C = cogroup A by name, B by name; D = foreach C { C1 = filter A by gpa > 4; C2 = filter B by contributions > 500; C3 = cross C1, C2; C4 = foreach C3 generate CONCAT(CONCAT((chararray)gpa, '_'), (chararray) contributions); generate flatten(C4); } store D into ’output' Architecting the Future of Big Data Page 30 © Hortonworks Inc. 2011
  • 31. Misc Loaders • HBaseStorage • CassandraStorage • AvroStorage • JsonLoader/JsonStorage Architecting the Future of Big Data Page 31 © Hortonworks Inc. 2011
  • 32. New operators to come • Will be available in Pig 0.11 – RANK – A distributed RANK implementation for Pig – CUBE Architecting the Future of Big Data Page 32 © Hortonworks Inc. 2011