SlideShare a Scribd company logo
1 of 19
Introduction To PIG The evolution of data processing frameworks
What is PIG? Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs Pig generates and compiles a Map/Reduce program(s) on the fly.
Why PIG? Ease of programming - It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
File Formats PigStorage Custom Load / Store Functions
Installing PIG Download / Unpack tarball (pig.apache.org) Install RPM / DEB package (cloudera.com)
Running PIG Grunt Shell: Enter Pig commands manually using Pigā€™s interactive shell, Grunt. Script File: Place Pig commands in a script file and run the script. Embedded Program: Embed Pig commands in a host language and run the program.
Run Modes Local Mode: To run Pig in local mode, you need access to a single machine. Hadoop(mapreduce) Mode: To run Pig in hadoop (mapreduce) mode, you need access to a Hadoop cluster and HDFS installation.
Sample PIG script A = load 'passwd' using PigStorage(':');  B = foreach A generate $0 as id; store B into ā€˜id.outā€™;
Sample Script With Schema A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float); B = FOREACH A GENERATE myudfs.UPPER(name);
Eval Functions AVG CONCAT Example COUNT COUNT_STAR DIFF IsEmpty MAX MIN SIZE SUM TOKENIZE
Math Functions # Math Functions ABS ACOS ASIN ATAN CBRT CEIL COSH COS EXP FLOOR LOG LOG10 RANDOM ROUND SIN SINH SQRT TAN TANH
Pig Types
Sample CW PIG script RawInput = LOAD '$INPUT' USING com.contextweb.pig.CWHeaderLoader('$RESOURCES/schema/wide.xml'); input = foreachRawInput GENERATE ContextCategoryId as Category, TagId, URL, Impressions; GroupedInput = GROUP input BY (Category, TagId, URL); result = FOREACH GroupedInput GENERATE group, SUM(input.Impressions) as Impressions; STORE result INTO '$OUTPUT' USING com.contextweb.pig.CWHeaderStore();
Sample PIG script (Filtering) RawInput = LOAD '$INPUT' USING com.contextweb.pig.CWHeaderLoader('$RESOURCES/schema/wide.xml'); input = foreachRawInput GENERATE ContextCategoryId as Category, DefLevelId , TagId, URL,Impressions; defFilter = FILTER input BY (DefLevelId == 8) or (DefLevelId == 12); GroupedInput = GROUP defFilter BY (Category, TagId, URL); result = FOREACH GroupedInput GENERATE group, SUM(input.Impressions) as Impressions; STORE result INTO '$OUTPUT' USING com.contextweb.pig.CWHeaderStore();
What is PIG UDF? UDF  - User Defined Function Types of UDFā€™s: Eval Functions (extends EvalFunc<String>) Aggregate Functions (extends EvalFunc<Long> implements Algebraic) Filter Functions (extends FilterFunc) UDFContext Allows UDFs to get access to the JobConfobject Allows UDFs to pass configuration information between instantiations of the UDF on the front and backends.
Sample UDF public class TopLevelDomain extends EvalFunc<String> { 	@Override 	public String exec(Tupletuple) throws IOException { 		Object o = tuple.get(0); 		if (o == null) { 			return null; 		} 		return Validator.getTLD(o.toString()); 	} }
UDF In Action REGISTER '$WORK_DIR/pig-support.jar'; DEFINE getTopLevelDomaincom.contextweb.pig.udf.TopLevelDomain(); AA = foreach input GENERATE TagId, getTopLevelDomain(PublisherDomain) as RootDomain
Resources Apache PIG http://pig.apache.org/ Apache Hadoophttp://hadoop.apache.org/ Cloudera CDH https://wiki.cloudera.com/display/DOC/CDH3+Installation
PIG DEMO

More Related Content

What's hot

What's hot (20)

Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
Ā 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Database
Ā 
Sqoop
SqoopSqoop
Sqoop
Ā 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
Ā 
Sharding
ShardingSharding
Sharding
Ā 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
Ā 
SQOOP PPT
SQOOP PPTSQOOP PPT
SQOOP PPT
Ā 
Mongodb basics and architecture
Mongodb basics and architectureMongodb basics and architecture
Mongodb basics and architecture
Ā 
NOSQL Databases types and Uses
NOSQL Databases types and UsesNOSQL Databases types and Uses
NOSQL Databases types and Uses
Ā 
NoSQL Databases
NoSQL DatabasesNoSQL Databases
NoSQL Databases
Ā 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
Ā 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
Ā 
An Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDBAn Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDB
Ā 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
Ā 
Nosql databases
Nosql databasesNosql databases
Nosql databases
Ā 
Log Structured Merge Tree
Log Structured Merge TreeLog Structured Merge Tree
Log Structured Merge Tree
Ā 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
Ā 
Physical architecture of sql server
Physical architecture of sql serverPhysical architecture of sql server
Physical architecture of sql server
Ā 
Data cubes
Data cubesData cubes
Data cubes
Ā 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
Ā 

Viewers also liked

Viewers also liked (8)

Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
Ā 
Hive ppt (1)
Hive ppt (1)Hive ppt (1)
Hive ppt (1)
Ā 
Une introduction Ć  Hive
Une introduction Ć  HiveUne introduction Ć  Hive
Une introduction Ć  Hive
Ā 
Un introduction Ć  Pig
Un introduction Ć  PigUn introduction Ć  Pig
Un introduction Ć  Pig
Ā 
Big Data : concepts, cas d'usage et tendances
Big Data : concepts, cas d'usage et tendancesBig Data : concepts, cas d'usage et tendances
Big Data : concepts, cas d'usage et tendances
Ā 
Big data - Cours d'introduction l Data-business
Big data - Cours d'introduction l Data-businessBig data - Cours d'introduction l Data-business
Big data - Cours d'introduction l Data-business
Ā 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
Ā 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
Ā 

Similar to Introduction to Apache Pig

Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateApache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Yahoo Developer Network
Ā 
January 2011 HUG: Pig Presentation
January 2011 HUG: Pig PresentationJanuary 2011 HUG: Pig Presentation
January 2011 HUG: Pig Presentation
Yahoo Developer Network
Ā 
Scripting GeoServer with GeoScript
Scripting GeoServer with GeoScriptScripting GeoServer with GeoScript
Scripting GeoServer with GeoScript
Justin Deoliveira
Ā 
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
DrPDShebaKeziaMalarc
Ā 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Viswanath Gangavaram
Ā 

Similar to Introduction to Apache Pig (20)

AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overview
Ā 
Practical pig
Practical pigPractical pig
Practical pig
Ā 
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateApache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Ā 
January 2011 HUG: Pig Presentation
January 2011 HUG: Pig PresentationJanuary 2011 HUG: Pig Presentation
January 2011 HUG: Pig Presentation
Ā 
03 pig intro
03 pig intro03 pig intro
03 pig intro
Ā 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Training
Ā 
Introduction To Groovy 2005
Introduction To Groovy 2005Introduction To Groovy 2005
Introduction To Groovy 2005
Ā 
Pig
PigPig
Pig
Ā 
Apache Pig
Apache PigApache Pig
Apache Pig
Ā 
PSGI and Plack from first principles
PSGI and Plack from first principlesPSGI and Plack from first principles
PSGI and Plack from first principles
Ā 
Groovy Update - JavaPolis 2007
Groovy Update - JavaPolis 2007Groovy Update - JavaPolis 2007
Groovy Update - JavaPolis 2007
Ā 
Scripting GeoServer with GeoScript
Scripting GeoServer with GeoScriptScripting GeoServer with GeoScript
Scripting GeoServer with GeoScript
Ā 
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Ā 
pig.ppt
pig.pptpig.ppt
pig.ppt
Ā 
06 pig-01-intro
06 pig-01-intro06 pig-01-intro
06 pig-01-intro
Ā 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
Ā 
What's New in ZF 1.10
What's New in ZF 1.10What's New in ZF 1.10
What's New in ZF 1.10
Ā 
Introduction to pig & pig latin
Introduction to pig & pig latinIntroduction to pig & pig latin
Introduction to pig & pig latin
Ā 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Ā 
Pig_Presentation
Pig_PresentationPig_Presentation
Pig_Presentation
Ā 

More from Jason Shao

NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsNYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
Jason Shao
Ā 

More from Jason Shao (6)

Tune hadoop
Tune hadoopTune hadoop
Tune hadoop
Ā 
Sgi hadoop
Sgi hadoopSgi hadoop
Sgi hadoop
Ā 
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsNYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
Ā 
Managing Hadoop with Puppet
Managing Hadoop with PuppetManaging Hadoop with Puppet
Managing Hadoop with Puppet
Ā 
NYC Java Meetup - Profiling and Performance
NYC Java Meetup - Profiling and PerformanceNYC Java Meetup - Profiling and Performance
NYC Java Meetup - Profiling and Performance
Ā 
Sakai NYC User Group
Sakai NYC User GroupSakai NYC User Group
Sakai NYC User Group
Ā 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(ā˜Žļø+971_581248768%)**%*]'#abortion pills for sale in dubai@
Ā 

Recently uploaded (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
Ā 
Navi Mumbai Call Girls šŸ„° 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls šŸ„° 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls šŸ„° 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls šŸ„° 8617370543 Service Offer VIP Hot Model
Ā 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Ā 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
Ā 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Ā 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Ā 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
Ā 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
Ā 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
Ā 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Ā 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
Ā 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Ā 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Ā 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
Ā 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
Ā 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Ā 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Ā 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
Ā 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
Ā 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
Ā 

Introduction to Apache Pig

  • 1. Introduction To PIG The evolution of data processing frameworks
  • 2. What is PIG? Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs Pig generates and compiles a Map/Reduce program(s) on the fly.
  • 3. Why PIG? Ease of programming - It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
  • 4. File Formats PigStorage Custom Load / Store Functions
  • 5. Installing PIG Download / Unpack tarball (pig.apache.org) Install RPM / DEB package (cloudera.com)
  • 6. Running PIG Grunt Shell: Enter Pig commands manually using Pigā€™s interactive shell, Grunt. Script File: Place Pig commands in a script file and run the script. Embedded Program: Embed Pig commands in a host language and run the program.
  • 7. Run Modes Local Mode: To run Pig in local mode, you need access to a single machine. Hadoop(mapreduce) Mode: To run Pig in hadoop (mapreduce) mode, you need access to a Hadoop cluster and HDFS installation.
  • 8. Sample PIG script A = load 'passwd' using PigStorage(':'); B = foreach A generate $0 as id; store B into ā€˜id.outā€™;
  • 9. Sample Script With Schema A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float); B = FOREACH A GENERATE myudfs.UPPER(name);
  • 10. Eval Functions AVG CONCAT Example COUNT COUNT_STAR DIFF IsEmpty MAX MIN SIZE SUM TOKENIZE
  • 11. Math Functions # Math Functions ABS ACOS ASIN ATAN CBRT CEIL COSH COS EXP FLOOR LOG LOG10 RANDOM ROUND SIN SINH SQRT TAN TANH
  • 13. Sample CW PIG script RawInput = LOAD '$INPUT' USING com.contextweb.pig.CWHeaderLoader('$RESOURCES/schema/wide.xml'); input = foreachRawInput GENERATE ContextCategoryId as Category, TagId, URL, Impressions; GroupedInput = GROUP input BY (Category, TagId, URL); result = FOREACH GroupedInput GENERATE group, SUM(input.Impressions) as Impressions; STORE result INTO '$OUTPUT' USING com.contextweb.pig.CWHeaderStore();
  • 14. Sample PIG script (Filtering) RawInput = LOAD '$INPUT' USING com.contextweb.pig.CWHeaderLoader('$RESOURCES/schema/wide.xml'); input = foreachRawInput GENERATE ContextCategoryId as Category, DefLevelId , TagId, URL,Impressions; defFilter = FILTER input BY (DefLevelId == 8) or (DefLevelId == 12); GroupedInput = GROUP defFilter BY (Category, TagId, URL); result = FOREACH GroupedInput GENERATE group, SUM(input.Impressions) as Impressions; STORE result INTO '$OUTPUT' USING com.contextweb.pig.CWHeaderStore();
  • 15. What is PIG UDF? UDF - User Defined Function Types of UDFā€™s: Eval Functions (extends EvalFunc<String>) Aggregate Functions (extends EvalFunc<Long> implements Algebraic) Filter Functions (extends FilterFunc) UDFContext Allows UDFs to get access to the JobConfobject Allows UDFs to pass configuration information between instantiations of the UDF on the front and backends.
  • 16. Sample UDF public class TopLevelDomain extends EvalFunc<String> { @Override public String exec(Tupletuple) throws IOException { Object o = tuple.get(0); if (o == null) { return null; } return Validator.getTLD(o.toString()); } }
  • 17. UDF In Action REGISTER '$WORK_DIR/pig-support.jar'; DEFINE getTopLevelDomaincom.contextweb.pig.udf.TopLevelDomain(); AA = foreach input GENERATE TagId, getTopLevelDomain(PublisherDomain) as RootDomain
  • 18. Resources Apache PIG http://pig.apache.org/ Apache Hadoophttp://hadoop.apache.org/ Cloudera CDH https://wiki.cloudera.com/display/DOC/CDH3+Installation