SlideShare une entreprise Scribd logo
1  sur  40
Apache Hadoop
Pig Fundamentals

Shashidhar HB
1








Why Hadoop
Hadoop n The Cloud Industry
Querying Large Data... Pig to Rescue
Pig: Why? What? How?
Pig Basics: Install, Configure, Try
Dwelling Deeper into Pig-PigLatin
Q&A

2
You have 10x more DATA
Than you did 3 years ago!

MORE about your
BUSINESS?

BUT do you know 10x

NO!
3
A lot of data, BIG data!

Information
(The Big Picture)

We are not able to effectively
store and analyze all the data we have, so we are not able to see the big picture!

5


BigData / Web Scale: are datasets that grow so large that
they become awkward to work with traditional database
management tools



Handling Big Data using traditional approach is costly and
rigid (Difficulties include capture, storage, search, sharing,
analytics and visualization)



Google,Yahoo, Facebook, LinkedIn handles “Petabytes”
of data everyday.



They all use HADOOP to solve there BIG DATA problem
6
7
So Mr. HADOOP says he has a
solution to our BIG problem !

8


Hadoop is an open-source software for RELIABLE
and SCALABLE distributed computing



Hadoop provides a comprehensive solution to handle
Big Data



Hadoop is
 HDFS : High Availability Data Storage subsystem
(http://labs.google.com/papers/gfs.html: 2003)

+
 MapReduce: Parallel Processing system
(http://labs.google.com/papers/mapreduce.html: 2004)
9


2008: Yahoo! Launched Hadoop



2009: Hadoop source code was made available to the
free world



2010: Facebook claimed that they have the largest
Hadoop cluster in the world with 21 PB of storage



2011: Facebook announced the data has grown to 30
PB
10


Stats :Facebook
▪ Started in 2004: 1 million users
▪ August 2008: Facebook reaches over 100 million active users
▪ Now: 750+ million active users
“Bottom line.. More users more DATA”



The BIG challenge at Facebook!!
Using historical data is a very big part of improving the user experience on
Facebook. So storing and processing all these bytes is of immense importance.

Facebook tried Hadoop for this.
11
Hadoop turned out to be a great
solution, but there was one little
problem!

12



Map Reduce requires skilled JAVA programmers
to write standard MapReduce programs
Developers are more fluent in querying data using
SQL

“Pig says, No Problemo!”
13


Input: User profiles, Page
visits



Find the top 5 most
visited pages by users
aged 18-25

14
15
1.

Users = LOAD ‘users’ AS (name, age);

2.

Filtered = FILTER Users BY age >= 18 AND age <= 25;

3.

Pages = LOAD ‘pages’ AS (user, url);

4.

Joined = JOIN Filtered BY name, Pages BY user;

5.

Grouped = GROUP Joined BY url;

6.

Summed = FOREACH Grouped generate GROUP, COUNT(Joined) AS
clicks;

7.

Sorted = ORDER Summed BY clicks DESC;

8.

Top5 = LIMIT Sorted 5;

9.

STORE Top5 INTO ‘top5sites’;
16


Pig is a dataflow language
• Language is called PigLatin
• Pretty simple syntax
• Under the covers, PigLatin scripts are turned into MapReduce
jobs and executed on the cluster



Pig Latin: High-level procedural language



Pig Engine: Parser, Optimizer and
distributed query execution
17
PIG








Pig is procedural
Nested relational data model
(No constraints on Data
Types)
Schema is optional
Scan-centric analytic
workloads (No Random reads
or writes)
Limited query optimization

SQL








SQL is declarative
Flat relational data model
(Data is tied to a specific Data
Type)
Schema is required
OLTP + OLAP workloads

Significant opportunity for
query optimization
18
PIG


Users = load 'users' as (name, age, ipaddr);



Clicks = load 'clicks' as (user, url, value);



ValuableClicks = filter Clicks by value > 0;



UserClicks = join Users by name,
ValuableClicks by user;




Geoinfo = load 'geoinfo' as (ipaddr, dma);
UserGeo = join UserClicks by ipaddr,
Geoinfo by ipaddr; ByDMA = group
UserGeo by dma;



ValuableClicksPerDMA = foreach ByDMA
generate group, COUNT(UserGeo);
store ValuableClicksPerDMA into
'ValuableClicksPerDMA';



SQL

insert into
ValuableClicksPerDMA select
dma, count(*) from geoinfo
join ( select name, ipaddr
from users join clicks on
(users.name = clicks.user)
where value > 0; ) using ipaddr
group by dma;

19


Joining datasets



Grouping data



Referring to elements by position rather than name ($0, $1,
etc)



Loading non-delimited data using a custom SerDe (Writing
a custom Reader and Writer)



Creation of user-defined functions (UDF), written in Java



And more..
20
Under the hood

21
Pig runs as a client-side application. Even if you want to run Pig on a Hadoop
cluster, there is nothing extra to install on the cluster: Pig launches jobs and
interacts with HDFS (or other Hadoop file systems) from your workstation.
Download a stable release from http://hadoop.apache.org/pig/releases.html
and unpack the tarball in a suitable place on your workstation:
% tar xzf pig-x.y.z.tar.gz
It’s convenient to add Pig’s binary directory to your command-line path.
For example:

% export PIG_INSTALL=/home/tom/pig-x.y.z
% export PATH=$PATH:$PIG_INSTALL/bin
You also need to set the JAVA_HOME environment variable to point to
suitable Java installation.

22


Execution Types
 Local mode (pig -x local)
 Hadoop mode



Pig must be configured to the cluster’s namenode
and jobtracker
1. Put hadoop config directory in PIG classpath
% export PIG_CLASSPATH=$HADOOP_INSTALL/conf/
2. Create a pig.properties
fs.default.name=hdfs://localhost/
mapred.job.tracker=localhost:8021
23


Script: Pig can run a script file that contains Pig commands.
For example,
% pig script.pig
Runs the commands in the local file ”script.pig”.
Alternatively, for very short scripts, you can use the -e option to run a script
specified
as a string on the command line.



Grunt: Grunt is an interactive shell for running Pig commands.
Grunt is started when no file is specified for Pig to run, and the -e option is not used.
Note: It is also possible to run Pig scripts from within Grunt using run and exec.



Embedded: You can run Pig programs from Java, much like you can use JDBC to run SQL
programs from Java.
There are more details on the Pig wiki at http://wiki.apache.org/pig/EmbeddedPig
24


PigLatin: A Pig Latin program consists of a collection of statements. A
statement can be thought of as an operation, or a command
For example,
1. A GROUP operation is a type of statement:
grunt> grouped_records = GROUP records BY year;
2. The command to list the files in a Hadoop filesystem is another example of a statement:
ls /
3. LOAD operation to load data from tab seperated file to PIG record
grunt> records = LOAD ‘sample.txt’
AS (year:chararray, temperature:int, quality:int);



Data: In Pig, a single element of data is an atom
A collection of atoms – such as a row, or a partial row – is a tuple
Tuples are collected together into bags
Atom –>

Row/Partial Row

–> Tuple

–> Bag
25
Example contents of ‘employee.txt’ a tab delimited text










1
Krishna 234000000
none
2
Krishna_01
234000000
none
124163
Shashi 10000 cloud
124164
Gopal
1000000 setlabs
124165
Govind 1000000 setlabs
124166
Ram
450000 es
124167
Madhusudhan
450000 e&r
124168
Hari
6500000 e&r
124169
Sachith 50000 cloud

26
--Loading data from employee.txt into emps bag and with a
schema
empls
=
LOAD
‘employee.txt’
AS
(id:int, name:chararray,
salary:double, dept:chararray);
--Filtering the data as required
rich = FILTER empls BY $2 > 100000;

--Sorting
sortd = ORDER rich BY salary DESC;
--Storing the final results
STORE sortd INTO ‘rich_employees.txt’;
-- Or alternatively we can dump the record on the screen
DUMP sortd;
-------------------------------------------------------------------Group by salary
grp = GROUP empls BY salary;
--Get count of employees in each salary group
cnt = FOREACH grp GENERATE group, COUNT(empls.id) as emp_cnt;
27


To view the schema of a relation
 DESCRIBE empls;



To view step-by-step execution of a series of statements
 ILLUSTRATE empls;



To view the execution plan of a relation
 EXPLAIN empls;



Join two data sets
LOAD 'data1' AS (col1, col2, col3, col4);
LOAD 'data2' AS (colA, colB, colC);
jnd = JOIN data1 BY col3, data2 BY colA PARALLEL 50;
STORE jnd INTO 'outfile‘;
28


Load using PigStorage
 empls = LOAD ‘employee.txt’ USING PigStorage('t')

AS (id:int, name:chararray, salary:double,
dept:chararray);



Store using PigStorage
 STORE srtd INTO ‘rich_employees.txt’ USING

PigStorage('t');

29
Flexibility with PIG
Is that all we can do with the PIG!!??

30


Pig provides extensive support for user-defined
functions (UDFs) as a way to specify custom
processing. Functions can be a part of almost
every operator in Pig



All UDF’s are case sensitive

31


Eval Functions (EvalFunc)
 Ex: StringConcat (built-in) : Generates the concatenation of the first two
fields of a tuple.



Aggregate Functions (EvalFunc & Algebraic)
 Ex: COUNT, AVG ( both built-in)



Filter Functions (FilterFunc)
 Ex: IsEmpty (built-in)



Load/Store Functions (LoadFunc/ StoreFunc)
 Ex: PigStorage (built-in)
Note: URL for built in functions:
http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/builtin/package-summary.html

32


Piggy Bank
 Piggy Bank is a place for Pig users to share their functions



DataFu (Linkedin’s collection of UDF’s)
 Hadoop library for large-scale data processing

33
package myudfs;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.impl.util.WrappedIOException;
public class UPPER extends EvalFunc <String>
{
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
String str = (String)input.get(0);
return str.toUpperCase();
}catch(Exception e){
throw WrappedIOException.wrap("Caught exception processing input row ", e);
}
}
}
34
-- myscript.pig
 REGISTER myudfs.jar;
Note: myudfs.jar should not be surrounded with quotes

 A = LOAD 'employee_data' AS (id: int,name: chararray,

salary: double, dept: chararray);
 B = FOREACH A GENERATE myudfs.UPPER(name);
 DUMP B;
35




java -cp pig.jar org.apache.pig.Main -x local
myscript.pig
or
pig -x local myscript.pig

Note: myudfs.jar should be in class path!


Locating an UDF jar file
 Pig first checks the classpath.
 Pig assumes that the location is either an absolute path or a

path relative to the location from which Pig was invoked
36
Pig Type
bytearray
chararray
int
long
float
double
tuple
bag
map

Java Class
DataByteArray
String
Integer
Long
Float
Double
Tuple
DataBag
Map<Object, Object>

37
All is well, but.. What about the
performance trade offs?

38
Source: Yahoo
39
Q&A
Mail me :
shashidhar_hb@infosys.com

40
That’s all folks!

41

Contenu connexe

Tendances

Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingMitsuharu Hamba
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Trainingstratapps
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pigdaijy
 
Apache spark session
Apache spark sessionApache spark session
Apache spark sessionknowbigdata
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on HadoopPaco Nathan
 
Apache Pig: Making data transformation easy
Apache Pig: Making data transformation easyApache Pig: Making data transformation easy
Apache Pig: Making data transformation easyVictor Sanchez Anguix
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdfEdureka!
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Simplilearn
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceobdit
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To HadoopAdeel Ahmad
 
Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]knowbigdata
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talksyhadoop
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentationArvind Kumar
 

Tendances (20)

Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Training
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pig
 
Apache spark session
Apache spark sessionApache spark session
Apache spark session
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
 
Apache Pig: Making data transformation easy
Apache Pig: Making data transformation easyApache Pig: Making data transformation easy
Apache Pig: Making data transformation easy
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Apache pig
Apache pigApache pig
Apache pig
 
Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdf
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduce
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talks
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 

En vedette

Hadoop and Hive
Hadoop and HiveHadoop and Hive
Hadoop and HiveZheng Shao
 
Hadoop Installation and basic configuration
Hadoop Installation and basic configurationHadoop Installation and basic configuration
Hadoop Installation and basic configurationGerrit van Vuuren
 
Hive - SerDe and LazySerde
Hive - SerDe and LazySerdeHive - SerDe and LazySerde
Hive - SerDe and LazySerdeZheng Shao
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache PigTapan Avasthi
 
Hive Functions Cheat Sheet
Hive Functions Cheat SheetHive Functions Cheat Sheet
Hive Functions Cheat SheetHortonworks
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design PatternsDonald Miner
 
Analytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsAnalytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsDataWorks Summit
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word countJeff Patti
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsSkillspeed
 
Hive Object Model
Hive Object ModelHive Object Model
Hive Object ModelZheng Shao
 
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan GatesApache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan GatesBig Data Spain
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsAnju Singh
 

En vedette (17)

Hadoop and Hive
Hadoop and HiveHadoop and Hive
Hadoop and Hive
 
Hadoop Installation and basic configuration
Hadoop Installation and basic configurationHadoop Installation and basic configuration
Hadoop Installation and basic configuration
 
Hive - SerDe and LazySerde
Hive - SerDe and LazySerdeHive - SerDe and LazySerde
Hive - SerDe and LazySerde
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
Hive Functions Cheat Sheet
Hive Functions Cheat SheetHive Functions Cheat Sheet
Hive Functions Cheat Sheet
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Analytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsAnalytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table Functions
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word count
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
 
Hive Object Model
Hive Object ModelHive Object Model
Hive Object Model
 
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan GatesApache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
 
Sql queires
Sql queiresSql queires
Sql queires
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce Details
 

Similaire à Apache Pig

Session 04 pig - slides
Session 04   pig - slidesSession 04   pig - slides
Session 04 pig - slidesAnandMHadoop
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Nathan Bijnens
 
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...DrPDShebaKeziaMalarc
 
Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)Srivatsan Ramanujam
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2Wes Floyd
 
An Overview of Hadoop
An Overview of HadoopAn Overview of Hadoop
An Overview of HadoopAsif Ali
 
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームPivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームMasayuki Matsushita
 
Build a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OSBuild a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OSJane Man
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Hadoop breizhjug
Hadoop breizhjugHadoop breizhjug
Hadoop breizhjugDavid Morin
 
Big Data Summer training presentation
Big Data Summer training presentationBig Data Summer training presentation
Big Data Summer training presentationHarshitaKamboj
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewMadhur Nawandar
 

Similaire à Apache Pig (20)

Unit 4 lecture2
Unit 4 lecture2Unit 4 lecture2
Unit 4 lecture2
 
Hadoop content
Hadoop contentHadoop content
Hadoop content
 
06 pig-01-intro
06 pig-01-intro06 pig-01-intro
06 pig-01-intro
 
Session 04 pig - slides
Session 04   pig - slidesSession 04   pig - slides
Session 04 pig - slides
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!
 
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
 
pig.ppt
pig.pptpig.ppt
pig.ppt
 
Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
 
Pig
PigPig
Pig
 
An Overview of Hadoop
An Overview of HadoopAn Overview of Hadoop
An Overview of Hadoop
 
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームPivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
43_Sameer_Kumar_Das2
43_Sameer_Kumar_Das243_Sameer_Kumar_Das2
43_Sameer_Kumar_Das2
 
Build a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OSBuild a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OS
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
BD-zero lecture.pptx
BD-zero lecture.pptxBD-zero lecture.pptx
BD-zero lecture.pptx
 
Hadoop breizhjug
Hadoop breizhjugHadoop breizhjug
Hadoop breizhjug
 
Big Data Summer training presentation
Big Data Summer training presentationBig Data Summer training presentation
Big Data Summer training presentation
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overview
 

Dernier

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 

Dernier (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Apache Pig

  • 2.        Why Hadoop Hadoop n The Cloud Industry Querying Large Data... Pig to Rescue Pig: Why? What? How? Pig Basics: Install, Configure, Try Dwelling Deeper into Pig-PigLatin Q&A 2
  • 3. You have 10x more DATA Than you did 3 years ago! MORE about your BUSINESS? BUT do you know 10x NO! 3
  • 4. A lot of data, BIG data! Information (The Big Picture) We are not able to effectively store and analyze all the data we have, so we are not able to see the big picture! 5
  • 5.  BigData / Web Scale: are datasets that grow so large that they become awkward to work with traditional database management tools  Handling Big Data using traditional approach is costly and rigid (Difficulties include capture, storage, search, sharing, analytics and visualization)  Google,Yahoo, Facebook, LinkedIn handles “Petabytes” of data everyday.  They all use HADOOP to solve there BIG DATA problem 6
  • 6. 7
  • 7. So Mr. HADOOP says he has a solution to our BIG problem ! 8
  • 8.  Hadoop is an open-source software for RELIABLE and SCALABLE distributed computing  Hadoop provides a comprehensive solution to handle Big Data  Hadoop is  HDFS : High Availability Data Storage subsystem (http://labs.google.com/papers/gfs.html: 2003) +  MapReduce: Parallel Processing system (http://labs.google.com/papers/mapreduce.html: 2004) 9
  • 9.  2008: Yahoo! Launched Hadoop  2009: Hadoop source code was made available to the free world  2010: Facebook claimed that they have the largest Hadoop cluster in the world with 21 PB of storage  2011: Facebook announced the data has grown to 30 PB 10
  • 10.  Stats :Facebook ▪ Started in 2004: 1 million users ▪ August 2008: Facebook reaches over 100 million active users ▪ Now: 750+ million active users “Bottom line.. More users more DATA”  The BIG challenge at Facebook!! Using historical data is a very big part of improving the user experience on Facebook. So storing and processing all these bytes is of immense importance. Facebook tried Hadoop for this. 11
  • 11. Hadoop turned out to be a great solution, but there was one little problem! 12
  • 12.   Map Reduce requires skilled JAVA programmers to write standard MapReduce programs Developers are more fluent in querying data using SQL “Pig says, No Problemo!” 13
  • 13.  Input: User profiles, Page visits  Find the top 5 most visited pages by users aged 18-25 14
  • 14. 15
  • 15. 1. Users = LOAD ‘users’ AS (name, age); 2. Filtered = FILTER Users BY age >= 18 AND age <= 25; 3. Pages = LOAD ‘pages’ AS (user, url); 4. Joined = JOIN Filtered BY name, Pages BY user; 5. Grouped = GROUP Joined BY url; 6. Summed = FOREACH Grouped generate GROUP, COUNT(Joined) AS clicks; 7. Sorted = ORDER Summed BY clicks DESC; 8. Top5 = LIMIT Sorted 5; 9. STORE Top5 INTO ‘top5sites’; 16
  • 16.  Pig is a dataflow language • Language is called PigLatin • Pretty simple syntax • Under the covers, PigLatin scripts are turned into MapReduce jobs and executed on the cluster  Pig Latin: High-level procedural language  Pig Engine: Parser, Optimizer and distributed query execution 17
  • 17. PIG      Pig is procedural Nested relational data model (No constraints on Data Types) Schema is optional Scan-centric analytic workloads (No Random reads or writes) Limited query optimization SQL      SQL is declarative Flat relational data model (Data is tied to a specific Data Type) Schema is required OLTP + OLAP workloads Significant opportunity for query optimization 18
  • 18. PIG  Users = load 'users' as (name, age, ipaddr);  Clicks = load 'clicks' as (user, url, value);  ValuableClicks = filter Clicks by value > 0;  UserClicks = join Users by name, ValuableClicks by user;   Geoinfo = load 'geoinfo' as (ipaddr, dma); UserGeo = join UserClicks by ipaddr, Geoinfo by ipaddr; ByDMA = group UserGeo by dma;  ValuableClicksPerDMA = foreach ByDMA generate group, COUNT(UserGeo); store ValuableClicksPerDMA into 'ValuableClicksPerDMA';  SQL insert into ValuableClicksPerDMA select dma, count(*) from geoinfo join ( select name, ipaddr from users join clicks on (users.name = clicks.user) where value > 0; ) using ipaddr group by dma; 19
  • 19.  Joining datasets  Grouping data  Referring to elements by position rather than name ($0, $1, etc)  Loading non-delimited data using a custom SerDe (Writing a custom Reader and Writer)  Creation of user-defined functions (UDF), written in Java  And more.. 20
  • 21. Pig runs as a client-side application. Even if you want to run Pig on a Hadoop cluster, there is nothing extra to install on the cluster: Pig launches jobs and interacts with HDFS (or other Hadoop file systems) from your workstation. Download a stable release from http://hadoop.apache.org/pig/releases.html and unpack the tarball in a suitable place on your workstation: % tar xzf pig-x.y.z.tar.gz It’s convenient to add Pig’s binary directory to your command-line path. For example: % export PIG_INSTALL=/home/tom/pig-x.y.z % export PATH=$PATH:$PIG_INSTALL/bin You also need to set the JAVA_HOME environment variable to point to suitable Java installation. 22
  • 22.  Execution Types  Local mode (pig -x local)  Hadoop mode  Pig must be configured to the cluster’s namenode and jobtracker 1. Put hadoop config directory in PIG classpath % export PIG_CLASSPATH=$HADOOP_INSTALL/conf/ 2. Create a pig.properties fs.default.name=hdfs://localhost/ mapred.job.tracker=localhost:8021 23
  • 23.  Script: Pig can run a script file that contains Pig commands. For example, % pig script.pig Runs the commands in the local file ”script.pig”. Alternatively, for very short scripts, you can use the -e option to run a script specified as a string on the command line.  Grunt: Grunt is an interactive shell for running Pig commands. Grunt is started when no file is specified for Pig to run, and the -e option is not used. Note: It is also possible to run Pig scripts from within Grunt using run and exec.  Embedded: You can run Pig programs from Java, much like you can use JDBC to run SQL programs from Java. There are more details on the Pig wiki at http://wiki.apache.org/pig/EmbeddedPig 24
  • 24.  PigLatin: A Pig Latin program consists of a collection of statements. A statement can be thought of as an operation, or a command For example, 1. A GROUP operation is a type of statement: grunt> grouped_records = GROUP records BY year; 2. The command to list the files in a Hadoop filesystem is another example of a statement: ls / 3. LOAD operation to load data from tab seperated file to PIG record grunt> records = LOAD ‘sample.txt’ AS (year:chararray, temperature:int, quality:int);  Data: In Pig, a single element of data is an atom A collection of atoms – such as a row, or a partial row – is a tuple Tuples are collected together into bags Atom –> Row/Partial Row –> Tuple –> Bag 25
  • 25. Example contents of ‘employee.txt’ a tab delimited text          1 Krishna 234000000 none 2 Krishna_01 234000000 none 124163 Shashi 10000 cloud 124164 Gopal 1000000 setlabs 124165 Govind 1000000 setlabs 124166 Ram 450000 es 124167 Madhusudhan 450000 e&r 124168 Hari 6500000 e&r 124169 Sachith 50000 cloud 26
  • 26. --Loading data from employee.txt into emps bag and with a schema empls = LOAD ‘employee.txt’ AS (id:int, name:chararray, salary:double, dept:chararray); --Filtering the data as required rich = FILTER empls BY $2 > 100000; --Sorting sortd = ORDER rich BY salary DESC; --Storing the final results STORE sortd INTO ‘rich_employees.txt’; -- Or alternatively we can dump the record on the screen DUMP sortd; -------------------------------------------------------------------Group by salary grp = GROUP empls BY salary; --Get count of employees in each salary group cnt = FOREACH grp GENERATE group, COUNT(empls.id) as emp_cnt; 27
  • 27.  To view the schema of a relation  DESCRIBE empls;  To view step-by-step execution of a series of statements  ILLUSTRATE empls;  To view the execution plan of a relation  EXPLAIN empls;  Join two data sets LOAD 'data1' AS (col1, col2, col3, col4); LOAD 'data2' AS (colA, colB, colC); jnd = JOIN data1 BY col3, data2 BY colA PARALLEL 50; STORE jnd INTO 'outfile‘; 28
  • 28.  Load using PigStorage  empls = LOAD ‘employee.txt’ USING PigStorage('t') AS (id:int, name:chararray, salary:double, dept:chararray);  Store using PigStorage  STORE srtd INTO ‘rich_employees.txt’ USING PigStorage('t'); 29
  • 29. Flexibility with PIG Is that all we can do with the PIG!!?? 30
  • 30.  Pig provides extensive support for user-defined functions (UDFs) as a way to specify custom processing. Functions can be a part of almost every operator in Pig  All UDF’s are case sensitive 31
  • 31.  Eval Functions (EvalFunc)  Ex: StringConcat (built-in) : Generates the concatenation of the first two fields of a tuple.  Aggregate Functions (EvalFunc & Algebraic)  Ex: COUNT, AVG ( both built-in)  Filter Functions (FilterFunc)  Ex: IsEmpty (built-in)  Load/Store Functions (LoadFunc/ StoreFunc)  Ex: PigStorage (built-in) Note: URL for built in functions: http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/builtin/package-summary.html 32
  • 32.  Piggy Bank  Piggy Bank is a place for Pig users to share their functions  DataFu (Linkedin’s collection of UDF’s)  Hadoop library for large-scale data processing 33
  • 33. package myudfs; import java.io.IOException; import org.apache.pig.EvalFunc; import org.apache.pig.data.Tuple; import org.apache.pig.impl.util.WrappedIOException; public class UPPER extends EvalFunc <String> { public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ String str = (String)input.get(0); return str.toUpperCase(); }catch(Exception e){ throw WrappedIOException.wrap("Caught exception processing input row ", e); } } } 34
  • 34. -- myscript.pig  REGISTER myudfs.jar; Note: myudfs.jar should not be surrounded with quotes  A = LOAD 'employee_data' AS (id: int,name: chararray, salary: double, dept: chararray);  B = FOREACH A GENERATE myudfs.UPPER(name);  DUMP B; 35
  • 35.   java -cp pig.jar org.apache.pig.Main -x local myscript.pig or pig -x local myscript.pig Note: myudfs.jar should be in class path!  Locating an UDF jar file  Pig first checks the classpath.  Pig assumes that the location is either an absolute path or a path relative to the location from which Pig was invoked 36
  • 37. All is well, but.. What about the performance trade offs? 38

Notes de l'éditeur

  1. BIG Data -  are datasets that grow so large that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage,search, sharing, analytics, and visualizing.Data management system which is Highly Available, Reliable, Transparent, High Performance, Scalable, Accessible, Secure, Usable, and Inexpensive.
  2. Source: Wikipedia
  3. Source: Internet (Googling)
  4. Facebook statisticsURL: http://www.facebook.com/press/info.php?factsheet
  5. Img Source: Yahoohadoop website“Pig Makes Hadoop Easy to Drive”Pig Vs Hivehttp://developer.yahoo.com/blogs/hadoop/posts/2010/08/pig_and_hive_at_yahoo/Pig Vs SQLhttp://developer.yahoo.com/blogs/hadoop/posts/2010/01/comparing_pig_latin_and_sql_fo/
  6. Input: User profiles, PageVisitsFind the top 5 mostvisited pages by usersaged 18-25
  7. Input: User profiles, Page visitsFind the top 5 most visited pages by usersaged 18-25
  8. http://developer.yahoo.com/blogs/hadoop/posts/2010/01/comparing_pig_latin_and_sql_fo/insert into ValuableClicksPerDMA select dma, count(*) from geoinfo join ( select name, ipaddr from users join clicks on (users.name = clicks.user) where value &gt; 0; ) using ipaddr group by dma; The Pig Latin for this will look like:Users = load &apos;users&apos; as (name, age, ipaddr); Clicks = load &apos;clicks&apos; as (user, url, value); ValuableClicks = filter Clicks by value &gt; 0;UserClicks = join Users by name, ValuableClicks by user; Geoinfo = load &apos;geoinfo&apos; as (ipaddr, dma);UserGeo = join UserClicks by ipaddr, Geoinfo by ipaddr; ByDMA = group UserGeo by dma; ValuableClicksPerDMA = foreachByDMA generate group, COUNT(UserGeo); store ValuableClicksPerDMA into &apos;ValuableClicksPerDMA&apos;;
  9. http://developer.yahoo.com/blogs/hadoop/posts/2010/01/comparing_pig_latin_and_sql_fo/insert into ValuableClicksPerDMA select dma, count(*) from geoinfo join ( select name, ipaddr from users join clicks on (users.name = clicks.user) where value &gt; 0; ) using ipaddr group by dma; The Pig Latin for this will look like:Users = load &apos;users&apos; as (name, age, ipaddr); Clicks = load &apos;clicks&apos; as (user, url, value); ValuableClicks = filter Clicks by value &gt; 0;UserClicks = join Users by name, ValuableClicks by user; Geoinfo = load &apos;geoinfo&apos; as (ipaddr, dma);UserGeo = join UserClicks by ipaddr, Geoinfo by ipaddr; ByDMA = group UserGeo by dma; ValuableClicksPerDMA = foreachByDMA generate group, COUNT(UserGeo); store ValuableClicksPerDMA into &apos;ValuableClicksPerDMA&apos;;
  10. SerDe - Serializer/Deserializer
  11. Execution TypesPig has two execution types or modes: local mode and Hadoop mode.Local mode (pig -x local)In local mode, Pig runs in a single JVM and accesses the local filesystem. This mode issuitable only for small datasets, and when trying out Pig. Local mode does not useHadoop. In particular, it does not use Hadoop’s local job runner; instead, Pig translatesqueries into a physical plan that it executes itself.Hadoop modeIn Hadoop mode, Pig translates queries into MapReduce jobs and runs them on aHadoop cluster. The cluster may be a pseudo- or fully distributed cluster. Hadoop mode(with a fully distributed cluster) is what you use when you want to run Pig on largedatasets.
  12. import java.io.IOException;import org.apache.pig.PigServer;public class WordCount { public static void main(String[] args) {PigServerpigServer = new PigServer(); try {pigServer.registerJar(&quot;/mylocation/tokenize.jar&quot;);runMyQuery(pigServer, &quot;myinput.txt&quot;; } catch (IOExceptione) {e.printStackTrace(); } } public static void runMyQuery(PigServerpigServer, String inputFile) throws IOException { pigServer.registerQuery(&quot;A = load &apos;&quot; + inputFile + &quot;&apos; using TextLoader();&quot;);pigServer.registerQuery(&quot;B = foreach A generate flatten(tokenize($0));&quot;);pigServer.registerQuery(&quot;C = group B by $1;&quot;);pigServer.registerQuery(&quot;D = foreach C generate flatten(group), COUNT(B.$0);&quot;);pigServer.store(&quot;D&quot;, &quot;myoutput&quot;); }}
  13. PIG | RDBMSAtom ~ CellTuple ~ RowBags ~ Table
  14. Example contents of ‘employee.txt’ a tab delimited text1 Krishna 234000000 none2 Krishna_01 234000000 none124163 Shashi 10000 cloud124164 Gopal 1000000 setlabs124165 Govind 1000000 setlabs124166 Ram 450000 es124167 Madhusudhan 450000 e&amp;r124168 Hari 6500000 e&amp;r124169 Sachith 50000 cloud
  15. Example contents of ‘people.txt’ a tab delimited text1 Krishna 234000000 none2 Krishna_01 234000000 none124163 Shashi 10000 cloud124164 Gopal 1000000 setlabs124165 Govind 1000000 setlabs124166 Ram 450000 es124167 Madhusudhan 450000 e&amp;r124168 Hari 6500000 e&amp;r124169 Sachith 50000 cloud --Loading data from people.txt into emps bag and with a schemaemps = LOAD &apos;people.txt&apos; AS (id:int, name:chararray, salary:double, dept:chararray); --Filtering the data as requiredrich = FILTER emps BY $2 &gt; 100000; --Sortingsrtd = ORDER rich BY salary DESC; --Storing the final resultsSTORE srtd INTO &apos;rich_people.txt&apos;;-- Or alternatively we can dump the record on the screenDUMP srtd;Import data using SQOOP1.Import moviesqoop import \--connect jdbc:mysql://localhost/movielens \--table movie --fields-terminated-by &apos;\t&apos; \--username training --password training2. Import movieratingsqoop import \--connect jdbc:mysql://localhost/movielens \--table movierating --fields-terminated-by &apos;\t&apos; \--username training --password training
  16. PARALLEL keyword only effects the number of reduce tasks. Map parallelism is determined by the input file, one map for each HDFS block. At most 2 map or reduce tasks can run on a machine simultaneously.grunt&gt; personal = load &apos;personal.txt&apos; as (empid,name,phonenumber);grunt&gt; official = load &apos;official.txt&apos; as (empid,dept,dc); grunt&gt; joined = join personal by empid, official by empid;grunt&gt; dump joined;
  17. http://pig.apache.org/docs/r0.7.0/udf.htmlEval FunctionsEval is the most common type of function How to write? UPPER extends EvalFunc Code snippet -- myscript.pig REGISTER myudfs.jar; A = LOAD &apos;employee_data&apos; AS (id: int,name: chararray, salary: double, dept: chararray); B = FOREACH A GENERATE myudfs.UPPER(name); DUMP B; Sample UDF package myudfs; import java.io.IOException; import org.apache.pig.EvalFunc; import org.apache.pig.data.Tuple; import org.apache.pig.impl.util.WrappedIOException; public class UPPER extends EvalFunc (String) { public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ String str = (String)input.get(0); return str.toUpperCase(); }catch(Exception e){ throw WrappedIOException.wrap(&quot;Caught exception processing input row &quot;, e); } } } How to execute the above script? java -cp pig.jar org.apache.pig.Main -x local myscript.pig or pig -x local myscript.pig &quot;Note: myudfs.jar should be in class path!&quot;Aggregate Functions An aggregate function is an eval function that takes a bag and returns a scalar value. One interesting and useful property of many aggregate functions is that they can be computed incrementally in a distributed fashion. Aggregate functions are usually applied to grouped data. How to write? COUNT extends EvalFunc (Long) implements Algebraic Ex: COUNT, AVG (built-in)Filter Functions Filter functions are eval functions that return a boolean value. Filter functions can be used anywhere a Boolean expression is appropriate, including the FILTER operator. Ex: IsEmpty (built-in) How to write?IsEmpty extends FilterFunc How to use it? D = FILTER C BY not IsEmpty(A);Load/Store Functions The load/store user-defined functions control how data goes into Pig and comes out of Pig. Often, the same function handles both input and output but that does not have to be the case. Ex: PigStorage (built-in) How to write? LOAD: SimpleTextLoader extends LoadFunc STORE: SimpleTextStorer extends StoreFunc
  18. Tuple: An ordered list of Data. A tuple has fields, numbered 0 through (number of fields - 1). The entry in the field can be any datatype, or it can be null. Tuples are constructed only by a TupleFactory. A DefaultTupleFactory is provided by the system. If a user wishes to use their own type of Tuple, they should also provide an implementation of TupleFactory to construct their types of Tuples. Fields are numbered from 0.