Pig Latin is a high-level data flow language that compiles down to MapReduce jobs. It was developed by Yahoo! as a platform for analyzing large datasets. Pig Latin allows users to express data analysis programs as sequences of operations like load, group, join and filter. This provides a more natural data flow programming model compared to the rigid MapReduce framework. The language aims to strike a balance between the low-level MapReduce model and the procedural control of SQL.
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Pig
1. PIG: High Level Data
Flow Language
COPYRIGHT (C) CHIRAG AHUJA
2. Outline
Map-Reduce and the need for Pig Latin
Pig Latin
Compilation into Map-Reduce
Implementation
Comparison with Map-Reduce
Optimization in Pig
COPYRIGHT (C) CHIRAG AHUJA
3. The Map-Reduce Appeal
COPYRIGHT (C) CHIRAG AHUJA
Scale
Scalable due to simpler design
• Only parallelizable operations
• No transactions
$ Runs on cheap commodity hardware
SQL Procedural Control- a processing “pipe”
4. Map-Reduce
COPYRIGHT (C) CHIRAG AHUJA
k1 v1
k2 v2
k1 v3
k2 v4
k1 v5
map
map
k1 v1
k1 v3
k1 v5
k2 v2
k2 v4
Output
records
reduce
reduce
Just a group-by-aggregate?
Input
records
6. Disadvantages
COPYRIGHT (C) CHIRAG AHUJA
1. Extremely rigid data flow
Other flows constantly hacked in
Join, Union Split
M R
M M R M
Chains
2. Common operations must be coded by hand
• Join, filter, projection, aggregates, sorting, distinct
3. Semantics hidden inside map-reduce functions
• Difficult to maintain, extend, and optimize
7. Pros And Cons
Need a high-level, general data flow language
COPYRIGHT (C) CHIRAG AHUJA
8. Enter Pig Latin
Need a high-level, general data flow language
COPYRIGHT (C) CHIRAG AHUJA
9. What is Pig
A platform for analyzing large data sets that consists of a high-level language
for expressing data analysis programs.
Compiles down to Map Reduce jobs
Developed by Yahoo!
Open-source language
COPYRIGHT (C) CHIRAG AHUJA
10. Data Flow
COPYRIGHT (C) CHIRAG AHUJA
Load Visits
Group by url
Foreach url
generate count
Load Url Info
Join on url
Group by category
Foreach category
generate top10 urls
11. In Pig Latin
visits = load ‘/data/visits’ as (user, url, time);
gVisits = group visits by url;
visitCounts = foreach gVisits generate url, count(visits);
urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);
visitCounts = join visitCounts by url, urlInfo by url;
gCategories = group visitCounts by category;
topUrls = foreach gCategories generate top(visitCounts,10);
store topUrls into ‘/data/topUrls’;
COPYRIGHT (C) CHIRAG AHUJA
13. Implementation
COPYRIGHT (C) CHIRAG AHUJA
SQL
Pig
Hadoop
Map-Reduce
cluster
automatic
rewrite +
optimize
or
or
user
14. Java vs. Pig
300
250
200
150
100
50
COPYRIGHT (C) CHIRAG AHUJA
180
160
140
120
100
80
60
40
20
0
1/20 the lines of code
Hadoop Pig
0
Hadoop Pig
Minutes
1/16 the development time
Performance is comparable (Java is slightly better)
15. Summary
Big demand for parallel data processing
◦ Emerging tools that do not look like SQL DBMS
◦ Programmers like dataflow pipes over static files
Hence the excitement about Map-Reduce
But, Map-Reduce is too low-level and rigid
COPYRIGHT (C) CHIRAG AHUJA
Pig Latin
Sweet spot between map-reduce and SQL