SlideShare une entreprise Scribd logo
1  sur  22
Télécharger pour lire hors ligne
Welcome to Pig
Pig
Pig - Introduction
An engine for executing data flows in parallel on
Hadoop
● Language - Pig Latin
● Pig Engine
● Can be used with or without Hadoop
Pig
Pig - Why do we need it?
● Programmers struggle a lot in writing
MapReduce tasks
● Pig scripts are easier to write and maintain
● Pig provides many relational operators which
are hard to write in MapReduce
Pig
Pig - Use Cases
1. Analyzing Data
2. Iterative processing
3. Batch Processing
4. ETL - Extract, Transform and Load
Pig
Pig - Philosophy
A. Pigs eat anything
• Data: Relational, nested, or unstructured.
B. Pigs live anywhere
• Parallel processing Language. Not only Hadoop
C. Pigs are domestic animals
• Controllable, provides custom functions in java/python
D. Pigs fly
• Designed for performance
Pig
Pig Latin - A Data Flow Language
Allows describing:
1. How data flows
2. Should be read
3. Processed
4. Stored to multiple outputs in parallel
Complex Workflows
1. Multiple inputs are joined
Pig
1. MapReduce mode
• Access HDFS and Hadoop cluster
• Used in production
2. Local mode
• Access local files and local machine
• Used for testing locally
• Fastens the development
Pig - Modes
Pig
• Login to CloudxLab Linux console.
• Type pig
• Invoke commands in grunt shell to access files in HDFS
• Control hadoop from grunt shell
Pig - MapReduce Mode
Pig
• Login to CloudxLab Linux console.
• Type pig -x local
• Invoke commands in grunt shell to access files in local file system
Pig - Local Mode
Pig
Pig - Data Types
1. int - Signed 32-bit integer - Example - 8
2. long - Signed 64-bit integer - Example - 5L
3. float - 32-bit floating point - Example - 5.5F
4. double - 64-bit floating point - Example - 10.5
5. chararray - character array - Example - ‘CloudxLab’
6. bytearray - blob - Example - Any binary data
7. datetime - Example - 1970-01-01T00:00:00.000+00:00
Pig
Pig - Complex Data Types
http://pig.apache.org/docs/r0.15.0/basic.html#Data+Types+
and+More
Pig
Pig - Relational Operators - LOAD
• divs = LOAD '/data/NYSE_dividends';
• divs = LOAD '/data/NYSE_dividends' USING PigStorage(',');
• divs = LOAD '/data/NYSE_dividends' AS (name: chararray, stock_symbol:
chararray, date: datetime, dividend: float);
Pig
Pig - Store / Dump
STORE
• Stores the data to HDFS and other storages
DUMP
• Prints the value on the screen - print()
• Useful for debugging
Pig
Pig - Lazy Evaluation
• Each processing step results in new relation
○ except for dump and store
• No value or operation is evaluated until the value or the
transformed data is required
• This reduces the repeated calculation
Pig
Pig - Relational Operators - FOREACH
Takes expressions & applies to every record
1. divs = LOAD '/data/NYSE_dividends' AS (name:chararray,
stock_symbol:chararray, date:chararray, dividends:float);
2. values = FOREACH divs GENERATE stock_symbol, dividends;
3. STORE values INTO 'values_1';
4. cat values_1
Pig
Pig - FOREACH - Question
Will below code generate any reducer code?
gain = FOREACH divs GENERATE ticker, val;
DUMP gain;
Pig
Pig - Relational Operators - GROUP
Groups the data in relations based on the keys
(John, 18, 4.0F)
(Mary, 19, 3.8F)
(Bill, 20, 3.9F)
(Joe, 18, 3.8F)
DESCRIBE A;
A: {
name: chararray,
age: int,
gpa: float
}
A
(18, {(John, 18, 4.0F), (Joe, 18, 3.8F)})
(19, {(Mary, 19, 3.8F)})
(20, {(Bill, 20, 3.9F)})
B = GROUP A BY age;
DESCRIBE B;
B: {
group: int,
A: {
name: chararray,
age: int,
gpa: float
}
}
Pig
Pig - Relational Operators - FILTER
● divs = LOAD '/data/NYSE_dividends' AS (exchange: chararray, symbol:
chararray, date: datetime, dividends: float);
● startswithcm = FILTER divs BY symbol matches 'CM.*';
Pig
Pig - More Operators
http://pig.apache.org/docs/r0.15.0/basic.html
Pig
Pig - Hands-on - Average Dividend
Exchange Symbol Date Dividends
NYSE CPO 2009-12-30 0.11
NYSE CPO 2009-09-28 0.12
NYSE CPO 2009-06-26 0.13
NYSE CCS 2009-10-28 0.41
NYSE CCS 2009-04-29 0.43
NYSE CIF 2009-12-09 .029
NYSE CIF 2009-12-09 .028
hdfs: /data/NYSE_dividends
CPO 0.14
CCS 0.41
CIF .0214
Exchange - chararray
Stock symbol - chararray
Date - chararray
Dividends - float
Average
Pig
Pig - Hands-on - Average Dividend
1. divs = LOAD '/data/NYSE_dividends' AS (exchange, stock_symbol, date,
dividends);
2. grped = GROUP divs BY stock_symbol;
3. DUMP grped;
4. avged = FOREACH grped GENERATE group, AVG(divs.dividends);
5. STORE avged INTO 'avged';
6. cat avged/part-r-00000
Exchange - chararray
Stock symbol - chararray
Date - chararray
Dividends - float
Pig
Pig - Summary
● Basics
● Execution modes
● Data types
● Relational operators - FOREACH, GROUP, FILTER
● Hands-on demo

Contenu connexe

Tendances

Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
Mitsuharu Hamba
 
Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)
Takrim Ul Islam Laskar
 

Tendances (19)

Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
Introduction to scoop and its functions
Introduction to scoop and its functionsIntroduction to scoop and its functions
Introduction to scoop and its functions
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
 
Advanced topics in hive
Advanced topics in hiveAdvanced topics in hive
Advanced topics in hive
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open house
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
 
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem
 
Simplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in HadoopSimplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in Hadoop
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 

Similaire à Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab

power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop framework
bhargavi804095
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Viswanath Gangavaram
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
Wes Floyd
 

Similaire à Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab (20)

power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop framework
 
Introduction to pig & pig latin
Introduction to pig & pig latinIntroduction to pig & pig latin
Introduction to pig & pig latin
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
PigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptxPigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptx
 
PigHive.pptx
PigHive.pptxPigHive.pptx
PigHive.pptx
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Training
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
 
Pig workshop
Pig workshopPig workshop
Pig workshop
 
PigHive.pptx
PigHive.pptxPigHive.pptx
PigHive.pptx
 
06 pig-01-intro
06 pig-01-intro06 pig-01-intro
06 pig-01-intro
 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
 
Pig latin
Pig latinPig latin
Pig latin
 
Apache pig
Apache pigApache pig
Apache pig
 
Amebaサービスのログ解析基盤
Amebaサービスのログ解析基盤Amebaサービスのログ解析基盤
Amebaサービスのログ解析基盤
 
Fuzzing - Part 1
Fuzzing - Part 1Fuzzing - Part 1
Fuzzing - Part 1
 
November 2014 HUG: Apache Pig 0.14
November 2014 HUG: Apache Pig 0.14 November 2014 HUG: Apache Pig 0.14
November 2014 HUG: Apache Pig 0.14
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Introduction to pig
Introduction to pigIntroduction to pig
Introduction to pig
 

Plus de CloudxLab

Plus de CloudxLab (20)

Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
 
Deep Learning Overview
Deep Learning OverviewDeep Learning Overview
Deep Learning Overview
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
 
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural Nets
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabIntroduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
 
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLab
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab

  • 2. Pig Pig - Introduction An engine for executing data flows in parallel on Hadoop ● Language - Pig Latin ● Pig Engine ● Can be used with or without Hadoop
  • 3. Pig Pig - Why do we need it? ● Programmers struggle a lot in writing MapReduce tasks ● Pig scripts are easier to write and maintain ● Pig provides many relational operators which are hard to write in MapReduce
  • 4. Pig Pig - Use Cases 1. Analyzing Data 2. Iterative processing 3. Batch Processing 4. ETL - Extract, Transform and Load
  • 5. Pig Pig - Philosophy A. Pigs eat anything • Data: Relational, nested, or unstructured. B. Pigs live anywhere • Parallel processing Language. Not only Hadoop C. Pigs are domestic animals • Controllable, provides custom functions in java/python D. Pigs fly • Designed for performance
  • 6. Pig Pig Latin - A Data Flow Language Allows describing: 1. How data flows 2. Should be read 3. Processed 4. Stored to multiple outputs in parallel Complex Workflows 1. Multiple inputs are joined
  • 7. Pig 1. MapReduce mode • Access HDFS and Hadoop cluster • Used in production 2. Local mode • Access local files and local machine • Used for testing locally • Fastens the development Pig - Modes
  • 8. Pig • Login to CloudxLab Linux console. • Type pig • Invoke commands in grunt shell to access files in HDFS • Control hadoop from grunt shell Pig - MapReduce Mode
  • 9. Pig • Login to CloudxLab Linux console. • Type pig -x local • Invoke commands in grunt shell to access files in local file system Pig - Local Mode
  • 10. Pig Pig - Data Types 1. int - Signed 32-bit integer - Example - 8 2. long - Signed 64-bit integer - Example - 5L 3. float - 32-bit floating point - Example - 5.5F 4. double - 64-bit floating point - Example - 10.5 5. chararray - character array - Example - ‘CloudxLab’ 6. bytearray - blob - Example - Any binary data 7. datetime - Example - 1970-01-01T00:00:00.000+00:00
  • 11. Pig Pig - Complex Data Types http://pig.apache.org/docs/r0.15.0/basic.html#Data+Types+ and+More
  • 12. Pig Pig - Relational Operators - LOAD • divs = LOAD '/data/NYSE_dividends'; • divs = LOAD '/data/NYSE_dividends' USING PigStorage(','); • divs = LOAD '/data/NYSE_dividends' AS (name: chararray, stock_symbol: chararray, date: datetime, dividend: float);
  • 13. Pig Pig - Store / Dump STORE • Stores the data to HDFS and other storages DUMP • Prints the value on the screen - print() • Useful for debugging
  • 14. Pig Pig - Lazy Evaluation • Each processing step results in new relation ○ except for dump and store • No value or operation is evaluated until the value or the transformed data is required • This reduces the repeated calculation
  • 15. Pig Pig - Relational Operators - FOREACH Takes expressions & applies to every record 1. divs = LOAD '/data/NYSE_dividends' AS (name:chararray, stock_symbol:chararray, date:chararray, dividends:float); 2. values = FOREACH divs GENERATE stock_symbol, dividends; 3. STORE values INTO 'values_1'; 4. cat values_1
  • 16. Pig Pig - FOREACH - Question Will below code generate any reducer code? gain = FOREACH divs GENERATE ticker, val; DUMP gain;
  • 17. Pig Pig - Relational Operators - GROUP Groups the data in relations based on the keys (John, 18, 4.0F) (Mary, 19, 3.8F) (Bill, 20, 3.9F) (Joe, 18, 3.8F) DESCRIBE A; A: { name: chararray, age: int, gpa: float } A (18, {(John, 18, 4.0F), (Joe, 18, 3.8F)}) (19, {(Mary, 19, 3.8F)}) (20, {(Bill, 20, 3.9F)}) B = GROUP A BY age; DESCRIBE B; B: { group: int, A: { name: chararray, age: int, gpa: float } }
  • 18. Pig Pig - Relational Operators - FILTER ● divs = LOAD '/data/NYSE_dividends' AS (exchange: chararray, symbol: chararray, date: datetime, dividends: float); ● startswithcm = FILTER divs BY symbol matches 'CM.*';
  • 19. Pig Pig - More Operators http://pig.apache.org/docs/r0.15.0/basic.html
  • 20. Pig Pig - Hands-on - Average Dividend Exchange Symbol Date Dividends NYSE CPO 2009-12-30 0.11 NYSE CPO 2009-09-28 0.12 NYSE CPO 2009-06-26 0.13 NYSE CCS 2009-10-28 0.41 NYSE CCS 2009-04-29 0.43 NYSE CIF 2009-12-09 .029 NYSE CIF 2009-12-09 .028 hdfs: /data/NYSE_dividends CPO 0.14 CCS 0.41 CIF .0214 Exchange - chararray Stock symbol - chararray Date - chararray Dividends - float Average
  • 21. Pig Pig - Hands-on - Average Dividend 1. divs = LOAD '/data/NYSE_dividends' AS (exchange, stock_symbol, date, dividends); 2. grped = GROUP divs BY stock_symbol; 3. DUMP grped; 4. avged = FOREACH grped GENERATE group, AVG(divs.dividends); 5. STORE avged INTO 'avged'; 6. cat avged/part-r-00000 Exchange - chararray Stock symbol - chararray Date - chararray Dividends - float
  • 22. Pig Pig - Summary ● Basics ● Execution modes ● Data types ● Relational operators - FOREACH, GROUP, FILTER ● Hands-on demo