Hadoop Pig: MapReduce the easy way!

•

10 j'aime•22,319 vues

Nathan Bijnens

My presentation about Hadoop and Pig during the Fosdem Datadevroom 2011.

Technologie

Hadoop Pig:
MapReduce the easy way.

Nathan Bijnens
http://nathan.gs
@nathan_gs

● Data analysis becomes
more and more
important
● Increasing complexity
of analysis
● Meanwhile the data we
analyze grows big, fast!

s: http://www.flickr.com/photos/pallotron/2479541331/ by pallotron

Hadoop: Intro

Hadoop is an open source Java framework aimed
at data intensive distributed applications.

It enables applications to work with thousands of
nodes and petabytes of data.

Hadoop: Intro

Hadoop was inspired by Google's Map Reduce
and Google File System.

http://labs.google.com/papers/mapreduce.html

Hadoop: HDFS

HDFS is a distributed, scalable filesystem
designed to store large files.

In combination with the Hadoop JobTracker it
provides data locality.

It auto replicates all blocks to 3 data nodes,
where preferable 2 copies are stored on two data
nodes within the same rack and one in another
rack.

Hadoop: HDFS

● NameNode
● Keeps track of what is stored where

● In memory

● Single Point of Failure

● DataNodes

Hadoop: HDFS

s: Practical problem solving with Hadoop and Pig by Milind Bhandarkar
http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig

MapReduce

MapReduce works by breaking
processing into two phases, a map and
a reduce function.

MapReduce

● Input
● Map
● Shuffle
● Reduce
● Output

s: Practical problem solving with Hadoop and Pig by Milind Bhandarkar
http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig

Use Cases: Who & how it's used

MassiveMedia / Netlog
● Cases
● Traffic analysis

● User actions

● ...

● On a 7 node cluster.

Use Cases: Who & how it's used

Yahoo!
● Cases
● Ad Systems

● Web Search

● ...

● More than 36000 nodes!

s: http://wiki.apache.org/hadoop/PoweredBy

Use Cases: When not to use

SETI@home
● Highly CPU oriented
● data locality is unimportant!

Hadoop Pig: Intro

Pig is a high level data flow language.

Hadoop Pig: 3 components

Pig Latin

Grunt

PigServer

Hadoop Pig
data = LOAD 'employee.csv' USING PigStorage() AS (
first_name:chararray,
last_name:chararray,
age:int,
wage:float,
department:chararray
);

grouped_by_department = GROUP data BY department;

total_wage_by_department =
FOREACH grouped_by_department
GENERATE
group AS department,
COUNT(data) as employee_count,
SUM(data::wage) AS total_wage;

total_ordered = ORDER total_wage_by_department BY total_wage;

total_limited = LIMIT total_ordered 10;

DUMP total_limited;

books = LOAD 'books.csv.bz2' USING PigStorage() AS (
book_id:int,
book_name:chararray,
author_name:chararray
);

book_sales = LOAD 'book_sales.csv.bz2' USING PigStorage() AS (
book_id:int,
price:float,
country:chararray
);

--- books = FILTER books BY (author_name LIKE 'Pamuk');

data = JOIN books ON book_id, book_sales ON book_id PARALLEL 12;

grouped_by_book = GROUP data BY books::book_name;

total_sales_by_book =
FOREACH grouped_by_book
GENERATE
group as book,
COUNT(data) as sales_volume,
SUM(book_sales::price) AS total_sales;

STORE total_sales_by_book INTO 'book_sale_results';

UDF

● Custom Load and Store classes.
● Hbase

● ProtocolBuffers

● CombinedLog

● Custom extraction

eg. date, ...

Take a look at the PiggyBank.

Some alternatives

● Hive
● Streaming
● Native Java MapReduce

Contenu connexe

Tendances

Introduction to the Hadoop Ecosystem (SEACON Edition)Uwe Printz

Introduction To Apache Pig at WHUGAdam Kawa

Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar

Pig, Making Hadoop EasyNick Dimiduk

Onyx data processing the clojure wayBahadir Cambel

Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Adam Kawa

Introduction to Apache PigJason Shao

HIVE: Data Warehousing & Analytics on HadoopZheng Shao

PySpark Cassandra - Amsterdam Spark MeetupFrens Jan Rumph

MongoDB Days UK: Using MongoDB and Python for Data Analysis PipelinesMongoDB

Hadoop Architecture in DepthSyed Hadoop

MongoDB & Hadoop: Flexible Hourly Batch Processing ModelTakahiro Inoue

January 2011 HUG: Howl PresentationYahoo Developer Network

Introduction To Elastic MapReduce at WHUGAdam Kawa

Intro to Hadoopjeffturner

R, Hadoop and Amazon Web ServicesPortland R User Group

Hive sq lfor-hadoopPragati Singh

Hadoop pigSean Murphy

Cascalog internal dsl_presoHadoop User Group

Online Analytics with Hadoop and CassandraRobbie Strickland

Tendances (20)

Introduction to the Hadoop Ecosystem (SEACON Edition)

Introduction To Apache Pig at WHUG

Practical Problem Solving with Apache Hadoop & Pig

Pig, Making Hadoop Easy

Onyx data processing the clojure way

Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)

Introduction to Apache Pig

HIVE: Data Warehousing & Analytics on Hadoop

PySpark Cassandra - Amsterdam Spark Meetup

MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines

Hadoop Architecture in Depth

MongoDB & Hadoop: Flexible Hourly Batch Processing Model

January 2011 HUG: Howl Presentation

Introduction To Elastic MapReduce at WHUG

Intro to Hadoop

R, Hadoop and Amazon Web Services

Hive sq lfor-hadoop

Hadoop pig

Cascalog internal dsl_preso

Online Analytics with Hadoop and Cassandra

En vedette

Getting more out of your big dataNathan Bijnens

A real-time architecture using Hadoop and Storm @ JAX LondonNathan Bijnens

Microsoft Big Data @ SQLUG 2013Nathan Bijnens

a real-time architecture using Hadoop and Storm at DevoxxNathan Bijnens

A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...Nathan Bijnens

A real time architecture using Hadoop and Storm @ FOSDEM 2013Nathan Bijnens

Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...Nathan Bijnens

Hadoop and Pig at Twitter__HadoopSummit2010Yahoo Developer Network

Hadoop and pig at twitter (oscon 2010)Kevin Weil

Un introduction à PigModern Data Stack France

Apache pigMudassir Khan Pathan

Pig statementsGanesh Sanap

Hadoop, Pig, and Python (PyData NYC 2012)mortardata

Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010Cloudera, Inc.

Reactive Streams: Handling Data-Flow the Reactive WayRoland Kuhn

High-level Programming Languages: Apache Pig and Pig LatinPietro Michiardi

Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Hadoop User Group

Understanding Akka Streams, Back Pressure, and Asynchronous ArchitecturesLightbend

Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaEdureka!

Hadoop et son écosystèmeKhanh Maudoux

En vedette (20)

Getting more out of your big data

A real-time architecture using Hadoop and Storm @ JAX London

Microsoft Big Data @ SQLUG 2013

a real-time architecture using Hadoop and Storm at Devoxx

A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...

A real time architecture using Hadoop and Storm @ FOSDEM 2013

Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...

Hadoop and Pig at Twitter__HadoopSummit2010

Hadoop and pig at twitter (oscon 2010)

Un introduction à Pig

Apache pig

Pig statements

Hadoop, Pig, and Python (PyData NYC 2012)

Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010

Reactive Streams: Handling Data-Flow the Reactive Way

High-level Programming Languages: Apache Pig and Pig Latin

Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...

Understanding Akka Streams, Back Pressure, and Asynchronous Architectures

Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka

Hadoop et son écosystème

Similaire à Hadoop Pig: MapReduce the easy way!

Hadoop breizhjugDavid Morin

Hadoop seminarKrishnenduKrishh

Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt

Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant

Apache PigShashidhar Basavaraju

Hadoop TechnologiesKannappan Sirchabesan

Hadoop introduction葵慶李

Hadoop workshopPurna Chander

Hadoop Tutorial for Beginnersbusiness Corporate

Intro to Apache HadoopSufi Nawaz

Big Data Summer training presentationHarshitaKamboj

Hadoop descriptionHadoop online training

BIG DATA: Apache HadoopOleksiy Krotov

Hadoop tutorial-pdf.pdfSheetal Jain

9/2017 STL HUG - Back to SchoolAdam Doyle

Getting started with Hadoop, Hive, and Elastic MapReduceobdit

Introduction to Hadoop - FinistJugDavid Morin

Python in big data worldRohit

Hadoop infoNikita Sure

Session 01 - Into to HadoopAnandMHadoop

Similaire à Hadoop Pig: MapReduce the easy way! (20)

Hadoop breizhjug

Hadoop seminar

Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...

Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...

Apache Pig

Hadoop Technologies

Hadoop introduction

Hadoop workshop

Hadoop Tutorial for Beginners

Intro to Apache Hadoop

Big Data Summer training presentation

Hadoop description

BIG DATA: Apache Hadoop

Hadoop tutorial-pdf.pdf

9/2017 STL HUG - Back to School

Getting started with Hadoop, Hive, and Elastic MapReduce

Introduction to Hadoop - FinistJug

Python in big data world

Hadoop info

Session 01 - Into to Hadoop

Plus de Nathan Bijnens

Data Mesh using Microsoft FabricNathan Bijnens

Data Mesh in Azure using Cloud Scale Analytics (WAF)Nathan Bijnens

Dataminds - ML in ProductionNathan Bijnens

Azure Databricks & Spark @ Techorama 2018Nathan Bijnens

Big Data Expo '18 - Microsoft AINathan Bijnens

Spark on Azure, a gentle introduction (nov 2015)Nathan Bijnens

Cloudera, Azure and Big Data at Cloudera Meetup '17Nathan Bijnens

Microsoft AI at SAI '17Nathan Bijnens

Microsoft Advanced Analytics @ Data Science Ghent '16Nathan Bijnens

A real-time architecture using Hadoop and Storm @ BigData.beNathan Bijnens

Plus de Nathan Bijnens (10)

Data Mesh using Microsoft Fabric

Data Mesh in Azure using Cloud Scale Analytics (WAF)

Dataminds - ML in Production

Azure Databricks & Spark @ Techorama 2018

Big Data Expo '18 - Microsoft AI

Spark on Azure, a gentle introduction (nov 2015)

Cloudera, Azure and Big Data at Cloudera Meetup '17

Microsoft AI at SAI '17

Microsoft Advanced Analytics @ Data Science Ghent '16

A real-time architecture using Hadoop and Storm @ BigData.be

Dernier

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent

SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j

Slack Application Development 101 Slidespraypatel2

How to Remove Document Management Hurdles with X-Docs?XfilesPro

Salesforce Community Group Quito, Salesforce 101Paola De la Torre

Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies

Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

Install Stable Diffusion in windows machinePadma Pradeep

How to convert PDF to text with Nanonetsnaman860154

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

Key Features Of Token Development (1).pptxLBM Solutions

Dernier (20)

GenCyber Cyber Security Day Presentation

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...

08448380779 Call Girls In Civil Lines Women Seeking Men

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

Slack Application Development 101 Slides

How to Remove Document Management Hurdles with X-Docs?

Salesforce Community Group Quito, Salesforce 101

Benefits Of Flutter Compared To Other Frameworks

Swan(sea) Song – personal research during my six years at Swansea ... and bey...

Injustice - Developers Among Us (SciFiDevCon 2024)

Install Stable Diffusion in windows machine

How to convert PDF to text with Nanonets

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

Unblocking The Main Thread Solving ANRs and Frozen Frames

Human Factors of XR: Using Human Factors to Design XR Systems

Key Features Of Token Development (1).pptx

Hadoop Pig: MapReduce the easy way!

1. Hadoop Pig: MapReduce the easy way. Nathan Bijnens http://nathan.gs @nathan_gs

2. We live in a world of data.

3. ● Data analysis becomes more and more important ● Increasing complexity of analysis ● Meanwhile the data we analyze grows big, fast! s: http://www.flickr.com/photos/pallotron/2479541331/ by pallotron

5. Hadoop: Intro Hadoop is an open source Java framework aimed at data intensive distributed applications. It enables applications to work with thousands of nodes and petabytes of data.

6. Hadoop: Intro Hadoop was inspired by Google's Map Reduce and Google File System. http://labs.google.com/papers/mapreduce.html

7. Hadoop: HDFS HDFS is a distributed, scalable filesystem designed to store large files. In combination with the Hadoop JobTracker it provides data locality. It auto replicates all blocks to 3 data nodes, where preferable 2 copies are stored on two data nodes within the same rack and one in another rack.

8. Hadoop: HDFS ● NameNode ● Keeps track of what is stored where ● In memory ● Single Point of Failure ● DataNodes

9. Hadoop: HDFS s: Practical problem solving with Hadoop and Pig by Milind Bhandarkar http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig

10. MapReduce MapReduce works by breaking processing into two phases, a map and a reduce function.

11. MapReduce ● Input ● Map ● Shuffle ● Reduce ● Output s: Practical problem solving with Hadoop and Pig by Milind Bhandarkar http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig

12. Use Cases: Who & how it's used MassiveMedia / Netlog ● Cases ● Traffic analysis ● User actions ● ... ● On a 7 node cluster.

13. Use Cases: Who & how it's used Yahoo! ● Cases ● Ad Systems ● Web Search ● ... ● More than 36000 nodes! s: http://wiki.apache.org/hadoop/PoweredBy

14. Use Cases: When not to use SETI@home ● Highly CPU oriented ● data locality is unimportant!

15.

16. Hadoop Pig: Intro Pig is a high level data flow language.

17. Hadoop Pig: 3 components Pig Latin Grunt PigServer

18. Hadoop Pig data = LOAD 'employee.csv' USING PigStorage() AS ( first_name:chararray, last_name:chararray, age:int, wage:float, department:chararray ); grouped_by_department = GROUP data BY department; total_wage_by_department = FOREACH grouped_by_department GENERATE group AS department, COUNT(data) as employee_count, SUM(data::wage) AS total_wage; total_ordered = ORDER total_wage_by_department BY total_wage; total_limited = LIMIT total_ordered 10; DUMP total_limited;

19. books = LOAD 'books.csv.bz2' USING PigStorage() AS ( book_id:int, book_name:chararray, author_name:chararray ); book_sales = LOAD 'book_sales.csv.bz2' USING PigStorage() AS ( book_id:int, price:float, country:chararray ); --- books = FILTER books BY (author_name LIKE 'Pamuk'); data = JOIN books ON book_id, book_sales ON book_id PARALLEL 12; grouped_by_book = GROUP data BY books::book_name; total_sales_by_book = FOREACH grouped_by_book GENERATE group as book, COUNT(data) as sales_volume, SUM(book_sales::price) AS total_sales; STORE total_sales_by_book INTO 'book_sale_results';

20. UDF ● Custom Load and Store classes. ● Hbase ● ProtocolBuffers ● CombinedLog ● Custom extraction eg. date, ... Take a look at the PiggyBank.

21. Some alternatives ● Hive ● Streaming ● Native Java MapReduce

22. Questions?

23. Thank you for listening!

Hadoop Pig: MapReduce the easy way!

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Hadoop Pig: MapReduce the easy way!

Similaire à Hadoop Pig: MapReduce the easy way! (20)

Plus de Nathan Bijnens

Plus de Nathan Bijnens (10)

Dernier

Dernier (20)

Hadoop Pig: MapReduce the easy way!