SlideShare une entreprise Scribd logo
1  sur  25
1© Copyright 2012 EMC Corporation. All rights reserved.
MapReduce
Design Patterns
Donald Miner
Greenplum Hadoop Solutions Architect
@donaldpminer
2© Copyright 2012 EMC Corporation. All rights reserved.
Book was made available December 2012
3© Copyright 2012 EMC Corporation. All rights reserved.
Inspiration for my book
4© Copyright 2012 EMC Corporation. All rights reserved.
What are design patterns?
(in general)
Reusable solutions to problems
Domain independent
Not a cookbook, but not a guide
Not a finished solution
5© Copyright 2012 EMC Corporation. All rights reserved.
Why design patterns?
(in general)
Makes the intent of code easier to understand
Provides a common language for solutions
Be able to reuse code
Known performance profiles and limitations of
solutions
6© Copyright 2012 EMC Corporation. All rights reserved.
Why MapReduce design patterns?
Recurring patterns in data-related problem solving
Groups are building patterns independently
Lots of new users every day
MapReduce is a new way of thinking
Foundation for higher-level tools (Pig, Hive, …)
Community is reaching the right level of maturity
7© Copyright 2012 EMC Corporation. All rights reserved.
Pattern Template
Intent
Motivation
Applicability
Structure
Consequences
Resemblances
Performance analysis
Examples
8© Copyright 2012 EMC Corporation. All rights reserved.
Pattern Categories
Summarization
Filtering
Data Organization
Joins
Metapatterns
Input and output
9© Copyright 2012 EMC Corporation. All rights reserved.
Filtering patterns
Extract interesting subsets
Filtering
Bloom filtering
Top ten
Distinct
Summarization patterns
top-down summaries
Numerical summarizations
Inverted index
Counting with counters
I only want
some of my data!
I only want
a top-level view
of my data!
10© Copyright 2012 EMC Corporation. All rights reserved.
Data organization patterns
Reorganize, restructure
Structured to hierarchical
Partitioning
Binning
Total order sorting
Shuffling
Join patterns
Bringing data sets together
Reduce-side join
Replicated join
Composite join
Cartesian product
I want to change
the way my data
is organized!
I want to mash
my different data
sources together!
11© Copyright 2012 EMC Corporation. All rights reserved.
Metapatterns
Patterns of patterns
Job chaining
Chain folding
Job merging
Input and output patterns
Custom input and output
Generating data
External source output
External source input
Partition pruning
I want to solve
a complex problem
with multiple patterns!
I want to get data or
put data in an
unusual place!
12© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Top Ten”
(filtering)
Intent
Retrieve a relatively small number of top K records, according
to a ranking scheme in your data set, no matter how large
the data.
Motivation
Finding outliers
Top ten lists are fun
Building dashboards
Sorting/Limit isn’t going to work here
13© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Top Ten”
Applicability
Rank-able records
Limited number of output records
Consequences
The top K records are returned.
14© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Top Ten”
Structure
class mapper:
setup():
initialize top ten sorted list
map(key, record):
insert record into top ten sorted list
if length of array is greater-than 10:
truncate list to a length of 10
cleanup():
for record in top sorted ten list:
emit null,record
class reducer:
setup():
initialize top ten sorted list
reduce(key, records):
sort records
truncate records to top 10
for record in records:
emit record
15© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Top Ten”
Resemblances
SQL:
SELECT * FROM table ORDER BY col4 DESC LIMIT 10;
Pig:
B = ORDER A BY col4 DESC;
C = LIMIT B 10;
16© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Top Ten”
Performance analysis
Pretty quick: map-heavy, low network usage
Pay attention to how many records the reducer is getting
[number of input splits] x K
Example
Top ten StackOverflow users by reputation
17© Copyright 2012 EMC Corporation. All rights reserved.
public static class TopTenMapper extends Mapper<Object, Text, NullWritable, Text> {
private TreeMap<Integer, Text> repToRecordMap = new TreeMap<Integer, Text>();
public void map(Object key, Text value, Context context) {
Map<String, String> parsed = MRDPUtils.transformXmlToMap(value.toString());
String userId = parsed.get("Id");
String reputation = parsed.get("Reputation");
repToRecordMap.put(Integer.parseInt(reputation), new Text(value));
if (repToRecordMap.size() > 10) {
repToRecordMap.remove(repToRecordMap.firstKey());
}
}
protected void cleanup(Context context) {
for (Text t : repToRecordMap.values()) {
context.write(NullWritable.get(), t);
}
}
}
Top Ten Mapper
18© Copyright 2012 EMC Corporation. All rights reserved.
public static class TopTenReducer extends Reducer<NullWritable, Text, NullWritable, Text>
{
private TreeMap<Integer, Text> repToRecordMap = new TreeMap<Integer, Text>();
public void reduce(NullWritable key, Iterable<Text> values, Context context) {
for (Text value : values) {
Map<String, String> parsed = MRDPUtils.transformXmlToMap(value.toString());
repToRecordMap.put(Integer.parseInt(parsed.get("Reputation")), new Text(value));
if (repToRecordMap.size() > 10) {
repToRecordMap.remove(repToRecordMap.firstKey());
}
}
for (Text t : repToRecordMap.descendingMap().values()) {
context.write(NullWritable.get(), t);
}
}
} Top Ten Reducer
19© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Bloom Filtering”
(filtering)
Intent
Keep records that are a member of some predefined set of
values. It is not a problem if the output is a bit inaccurate.
Motivation
Similar to normal Boolean filtering, but we are filtering on set
membership
Set membership is evaluated with a Bloom filter
20© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Bloom Filtering”
Applicability
A feature can be extracted and tested for set membership
Predetermined set is available
Some false positives are acceptable
Consequences
Records that pass the Bloom filter membership test are returned
Known Uses
Keep all records in a watch list (and a few records that aren’t)
Pre-filtering records before an expensive membership test
21© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Bloom Filtering”
Structure
class mapper:
setup():
load bloom filter into memory
map(key, record):
if record in bloom filter:
emit (record, null)
Resemblances
UDFs?
22© Copyright 2012 EMC Corporation. All rights reserved.
Pattern: “Bloom Filtering”
Performance analysis
Map-only
Slight overhead in moving Bloom filter into memory
Bloom filter membership tests are constant time
Example
Filter StackOverflow comments that do not contain a keyword
Distributed HBase query using a Bloom filter
23© Copyright 2012 EMC Corporation. All rights reserved.
Candidate new patterns
Link Graph processing patterns (new category)
– Shortest past, diameter, graph stats, connected
components, etc.
– Too domain specific?
– Has its own distinct patterns
Projection (filtering)
– Remove “columns” of data
Transformation (data organization?)
– Take a data set but transform it into something else
24© Copyright 2012 EMC Corporation. All rights reserved.
Future and call to action
Contributing your own patterns
Trends in the nature of data
– Images, audio, video, biomedical, social …
Libraries, abstractions, and tools
Ecosystem patterns: YARN, HBase, ZooKeeper, …
MapReduce Design Patterns

Contenu connexe

Tendances

Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature DataWorks Summit
 
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...Simplilearn
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security ArchitectureOwen O'Malley
 
10 Common Hadoop-able Problems Webinar
10 Common Hadoop-able Problems Webinar10 Common Hadoop-able Problems Webinar
10 Common Hadoop-able Problems WebinarCloudera, Inc.
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Apache HBase - Just the Basics
Apache HBase - Just the BasicsApache HBase - Just the Basics
Apache HBase - Just the BasicsHBaseCon
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databasesJames Serra
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduceFrane Bandov
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentationArvind Kumar
 
Building Your First App with MongoDB
Building Your First App with MongoDBBuilding Your First App with MongoDB
Building Your First App with MongoDBMongoDB
 

Tendances (20)

Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
 
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
 
Polymorphisme, interface et classe abstraite
Polymorphisme, interface et classe abstraitePolymorphisme, interface et classe abstraite
Polymorphisme, interface et classe abstraite
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
 
10 Common Hadoop-able Problems Webinar
10 Common Hadoop-able Problems Webinar10 Common Hadoop-able Problems Webinar
10 Common Hadoop-able Problems Webinar
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Apache HBase - Just the Basics
Apache HBase - Just the BasicsApache HBase - Just the Basics
Apache HBase - Just the Basics
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
Model View Controller (MVC)
Model View Controller (MVC)Model View Controller (MVC)
Model View Controller (MVC)
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Hadoop hdfs
Hadoop hdfsHadoop hdfs
Hadoop hdfs
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Building Your First App with MongoDB
Building Your First App with MongoDBBuilding Your First App with MongoDB
Building Your First App with MongoDB
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 

En vedette

Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word countJeff Patti
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsAnju Singh
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
BigData_Chp2: Hadoop & Map-Reduce
BigData_Chp2: Hadoop & Map-ReduceBigData_Chp2: Hadoop & Map-Reduce
BigData_Chp2: Hadoop & Map-ReduceLilia Sfaxi
 

En vedette (6)

Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word count
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce Details
 
MapReduce in Simple Terms
MapReduce in Simple TermsMapReduce in Simple Terms
MapReduce in Simple Terms
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
BigData_Chp2: Hadoop & Map-Reduce
BigData_Chp2: Hadoop & Map-ReduceBigData_Chp2: Hadoop & Map-Reduce
BigData_Chp2: Hadoop & Map-Reduce
 

Similaire à MapReduce Design Patterns

Hadoop Design Patterns
Hadoop Design PatternsHadoop Design Patterns
Hadoop Design PatternsEMC
 
Lambdas And Streams Hands On Lab, JavaOne 2014
Lambdas And Streams Hands On Lab, JavaOne 2014Lambdas And Streams Hands On Lab, JavaOne 2014
Lambdas And Streams Hands On Lab, JavaOne 2014Simon Ritter
 
Project Lambda: Functional Programming Constructs in Java - Simon Ritter (Ora...
Project Lambda: Functional Programming Constructs in Java - Simon Ritter (Ora...Project Lambda: Functional Programming Constructs in Java - Simon Ritter (Ora...
Project Lambda: Functional Programming Constructs in Java - Simon Ritter (Ora...jaxLondonConference
 
Choose'10: Ralf Laemmel - Dealing Confortably with the Confusion of Tongues
Choose'10: Ralf Laemmel - Dealing Confortably with the Confusion of TonguesChoose'10: Ralf Laemmel - Dealing Confortably with the Confusion of Tongues
Choose'10: Ralf Laemmel - Dealing Confortably with the Confusion of TonguesCHOOSE
 
Big Data & Analytics MapReduce/Hadoop – A programmer’s perspective
Big Data & Analytics MapReduce/Hadoop – A programmer’s perspectiveBig Data & Analytics MapReduce/Hadoop – A programmer’s perspective
Big Data & Analytics MapReduce/Hadoop – A programmer’s perspectiveEMC
 
Lambdas and-streams-s ritter-v3
Lambdas and-streams-s ritter-v3Lambdas and-streams-s ritter-v3
Lambdas and-streams-s ritter-v3Simon Ritter
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsRobert Grossman
 
Linq 1224887336792847 9
Linq 1224887336792847 9Linq 1224887336792847 9
Linq 1224887336792847 9google
 
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...jsvetter
 
Linq To The Enterprise
Linq To The EnterpriseLinq To The Enterprise
Linq To The EnterpriseDaniel Egan
 
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Romeo Kienzler
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Updatevithakur
 
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...Srivatsan Ramanujam
 
Scalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReduceScalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReducePietro Michiardi
 
JDBC Next: A New Asynchronous API for Connecting to a Database
JDBC Next: A New Asynchronous API for Connecting to a Database JDBC Next: A New Asynchronous API for Connecting to a Database
JDBC Next: A New Asynchronous API for Connecting to a Database Yolande Poirier
 
An Empirical Study on Using Hidden Markov Models for Search Interface Segment...
An Empirical Study on Using Hidden Markov Models for Search Interface Segment...An Empirical Study on Using Hidden Markov Models for Search Interface Segment...
An Empirical Study on Using Hidden Markov Models for Search Interface Segment...The Children's Hospital of Philadelphia
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...Srivatsan Ramanujam
 

Similaire à MapReduce Design Patterns (20)

Hadoop Design Patterns
Hadoop Design PatternsHadoop Design Patterns
Hadoop Design Patterns
 
Lambdas And Streams Hands On Lab, JavaOne 2014
Lambdas And Streams Hands On Lab, JavaOne 2014Lambdas And Streams Hands On Lab, JavaOne 2014
Lambdas And Streams Hands On Lab, JavaOne 2014
 
Project Lambda: Functional Programming Constructs in Java - Simon Ritter (Ora...
Project Lambda: Functional Programming Constructs in Java - Simon Ritter (Ora...Project Lambda: Functional Programming Constructs in Java - Simon Ritter (Ora...
Project Lambda: Functional Programming Constructs in Java - Simon Ritter (Ora...
 
Choose'10: Ralf Laemmel - Dealing Confortably with the Confusion of Tongues
Choose'10: Ralf Laemmel - Dealing Confortably with the Confusion of TonguesChoose'10: Ralf Laemmel - Dealing Confortably with the Confusion of Tongues
Choose'10: Ralf Laemmel - Dealing Confortably with the Confusion of Tongues
 
Big Data & Analytics MapReduce/Hadoop – A programmer’s perspective
Big Data & Analytics MapReduce/Hadoop – A programmer’s perspectiveBig Data & Analytics MapReduce/Hadoop – A programmer’s perspective
Big Data & Analytics MapReduce/Hadoop – A programmer’s perspective
 
Lambdas and-streams-s ritter-v3
Lambdas and-streams-s ritter-v3Lambdas and-streams-s ritter-v3
Lambdas and-streams-s ritter-v3
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
 
Linq 1224887336792847 9
Linq 1224887336792847 9Linq 1224887336792847 9
Linq 1224887336792847 9
 
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
 
Linq To The Enterprise
Linq To The EnterpriseLinq To The Enterprise
Linq To The Enterprise
 
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
 
MATLAB and HDF-EOS
MATLAB and HDF-EOSMATLAB and HDF-EOS
MATLAB and HDF-EOS
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
 
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
 
Scalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReduceScalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReduce
 
JDBC Next: A New Asynchronous API for Connecting to a Database
JDBC Next: A New Asynchronous API for Connecting to a Database JDBC Next: A New Asynchronous API for Connecting to a Database
JDBC Next: A New Asynchronous API for Connecting to a Database
 
C# in depth
C# in depthC# in depth
C# in depth
 
An Empirical Study on Using Hidden Markov Models for Search Interface Segment...
An Empirical Study on Using Hidden Markov Models for Search Interface Segment...An Empirical Study on Using Hidden Markov Models for Search Interface Segment...
An Empirical Study on Using Hidden Markov Models for Search Interface Segment...
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
 
Hadoop Internals
Hadoop InternalsHadoop Internals
Hadoop Internals
 

Plus de Donald Miner

Machine Learning Vital Signs
Machine Learning Vital SignsMachine Learning Vital Signs
Machine Learning Vital SignsDonald Miner
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about HadoopDonald Miner
 
EDHREC @ Data Science MD
EDHREC @ Data Science MDEDHREC @ Data Science MD
EDHREC @ Data Science MDDonald Miner
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
Survey of Accumulo Techniques for Indexing Data
Survey of Accumulo Techniques for Indexing DataSurvey of Accumulo Techniques for Indexing Data
Survey of Accumulo Techniques for Indexing DataDonald Miner
 
An Introduction to Accumulo
An Introduction to AccumuloAn Introduction to Accumulo
An Introduction to AccumuloDonald Miner
 
Data, The New Currency
Data, The New CurrencyData, The New Currency
Data, The New CurrencyDonald Miner
 
The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest Donald Miner
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data ScienceDonald Miner
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and HadoopDonald Miner
 

Plus de Donald Miner (11)

Machine Learning Vital Signs
Machine Learning Vital SignsMachine Learning Vital Signs
Machine Learning Vital Signs
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop
 
EDHREC @ Data Science MD
EDHREC @ Data Science MDEDHREC @ Data Science MD
EDHREC @ Data Science MD
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Survey of Accumulo Techniques for Indexing Data
Survey of Accumulo Techniques for Indexing DataSurvey of Accumulo Techniques for Indexing Data
Survey of Accumulo Techniques for Indexing Data
 
An Introduction to Accumulo
An Introduction to AccumuloAn Introduction to Accumulo
An Introduction to Accumulo
 
SQL on Accumulo
SQL on AccumuloSQL on Accumulo
SQL on Accumulo
 
Data, The New Currency
Data, The New CurrencyData, The New Currency
Data, The New Currency
 
The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data Science
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 

Dernier

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 

Dernier (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 

MapReduce Design Patterns

  • 1. 1© Copyright 2012 EMC Corporation. All rights reserved. MapReduce Design Patterns Donald Miner Greenplum Hadoop Solutions Architect @donaldpminer
  • 2. 2© Copyright 2012 EMC Corporation. All rights reserved. Book was made available December 2012
  • 3. 3© Copyright 2012 EMC Corporation. All rights reserved. Inspiration for my book
  • 4. 4© Copyright 2012 EMC Corporation. All rights reserved. What are design patterns? (in general) Reusable solutions to problems Domain independent Not a cookbook, but not a guide Not a finished solution
  • 5. 5© Copyright 2012 EMC Corporation. All rights reserved. Why design patterns? (in general) Makes the intent of code easier to understand Provides a common language for solutions Be able to reuse code Known performance profiles and limitations of solutions
  • 6. 6© Copyright 2012 EMC Corporation. All rights reserved. Why MapReduce design patterns? Recurring patterns in data-related problem solving Groups are building patterns independently Lots of new users every day MapReduce is a new way of thinking Foundation for higher-level tools (Pig, Hive, …) Community is reaching the right level of maturity
  • 7. 7© Copyright 2012 EMC Corporation. All rights reserved. Pattern Template Intent Motivation Applicability Structure Consequences Resemblances Performance analysis Examples
  • 8. 8© Copyright 2012 EMC Corporation. All rights reserved. Pattern Categories Summarization Filtering Data Organization Joins Metapatterns Input and output
  • 9. 9© Copyright 2012 EMC Corporation. All rights reserved. Filtering patterns Extract interesting subsets Filtering Bloom filtering Top ten Distinct Summarization patterns top-down summaries Numerical summarizations Inverted index Counting with counters I only want some of my data! I only want a top-level view of my data!
  • 10. 10© Copyright 2012 EMC Corporation. All rights reserved. Data organization patterns Reorganize, restructure Structured to hierarchical Partitioning Binning Total order sorting Shuffling Join patterns Bringing data sets together Reduce-side join Replicated join Composite join Cartesian product I want to change the way my data is organized! I want to mash my different data sources together!
  • 11. 11© Copyright 2012 EMC Corporation. All rights reserved. Metapatterns Patterns of patterns Job chaining Chain folding Job merging Input and output patterns Custom input and output Generating data External source output External source input Partition pruning I want to solve a complex problem with multiple patterns! I want to get data or put data in an unusual place!
  • 12. 12© Copyright 2012 EMC Corporation. All rights reserved. Pattern: “Top Ten” (filtering) Intent Retrieve a relatively small number of top K records, according to a ranking scheme in your data set, no matter how large the data. Motivation Finding outliers Top ten lists are fun Building dashboards Sorting/Limit isn’t going to work here
  • 13. 13© Copyright 2012 EMC Corporation. All rights reserved. Pattern: “Top Ten” Applicability Rank-able records Limited number of output records Consequences The top K records are returned.
  • 14. 14© Copyright 2012 EMC Corporation. All rights reserved. Pattern: “Top Ten” Structure class mapper: setup(): initialize top ten sorted list map(key, record): insert record into top ten sorted list if length of array is greater-than 10: truncate list to a length of 10 cleanup(): for record in top sorted ten list: emit null,record class reducer: setup(): initialize top ten sorted list reduce(key, records): sort records truncate records to top 10 for record in records: emit record
  • 15. 15© Copyright 2012 EMC Corporation. All rights reserved. Pattern: “Top Ten” Resemblances SQL: SELECT * FROM table ORDER BY col4 DESC LIMIT 10; Pig: B = ORDER A BY col4 DESC; C = LIMIT B 10;
  • 16. 16© Copyright 2012 EMC Corporation. All rights reserved. Pattern: “Top Ten” Performance analysis Pretty quick: map-heavy, low network usage Pay attention to how many records the reducer is getting [number of input splits] x K Example Top ten StackOverflow users by reputation
  • 17. 17© Copyright 2012 EMC Corporation. All rights reserved. public static class TopTenMapper extends Mapper<Object, Text, NullWritable, Text> { private TreeMap<Integer, Text> repToRecordMap = new TreeMap<Integer, Text>(); public void map(Object key, Text value, Context context) { Map<String, String> parsed = MRDPUtils.transformXmlToMap(value.toString()); String userId = parsed.get("Id"); String reputation = parsed.get("Reputation"); repToRecordMap.put(Integer.parseInt(reputation), new Text(value)); if (repToRecordMap.size() > 10) { repToRecordMap.remove(repToRecordMap.firstKey()); } } protected void cleanup(Context context) { for (Text t : repToRecordMap.values()) { context.write(NullWritable.get(), t); } } } Top Ten Mapper
  • 18. 18© Copyright 2012 EMC Corporation. All rights reserved. public static class TopTenReducer extends Reducer<NullWritable, Text, NullWritable, Text> { private TreeMap<Integer, Text> repToRecordMap = new TreeMap<Integer, Text>(); public void reduce(NullWritable key, Iterable<Text> values, Context context) { for (Text value : values) { Map<String, String> parsed = MRDPUtils.transformXmlToMap(value.toString()); repToRecordMap.put(Integer.parseInt(parsed.get("Reputation")), new Text(value)); if (repToRecordMap.size() > 10) { repToRecordMap.remove(repToRecordMap.firstKey()); } } for (Text t : repToRecordMap.descendingMap().values()) { context.write(NullWritable.get(), t); } } } Top Ten Reducer
  • 19. 19© Copyright 2012 EMC Corporation. All rights reserved. Pattern: “Bloom Filtering” (filtering) Intent Keep records that are a member of some predefined set of values. It is not a problem if the output is a bit inaccurate. Motivation Similar to normal Boolean filtering, but we are filtering on set membership Set membership is evaluated with a Bloom filter
  • 20. 20© Copyright 2012 EMC Corporation. All rights reserved. Pattern: “Bloom Filtering” Applicability A feature can be extracted and tested for set membership Predetermined set is available Some false positives are acceptable Consequences Records that pass the Bloom filter membership test are returned Known Uses Keep all records in a watch list (and a few records that aren’t) Pre-filtering records before an expensive membership test
  • 21. 21© Copyright 2012 EMC Corporation. All rights reserved. Pattern: “Bloom Filtering” Structure class mapper: setup(): load bloom filter into memory map(key, record): if record in bloom filter: emit (record, null) Resemblances UDFs?
  • 22. 22© Copyright 2012 EMC Corporation. All rights reserved. Pattern: “Bloom Filtering” Performance analysis Map-only Slight overhead in moving Bloom filter into memory Bloom filter membership tests are constant time Example Filter StackOverflow comments that do not contain a keyword Distributed HBase query using a Bloom filter
  • 23. 23© Copyright 2012 EMC Corporation. All rights reserved. Candidate new patterns Link Graph processing patterns (new category) – Shortest past, diameter, graph stats, connected components, etc. – Too domain specific? – Has its own distinct patterns Projection (filtering) – Remove “columns” of data Transformation (data organization?) – Take a data set but transform it into something else
  • 24. 24© Copyright 2012 EMC Corporation. All rights reserved. Future and call to action Contributing your own patterns Trends in the nature of data – Images, audio, video, biomedical, social … Libraries, abstractions, and tools Ecosystem patterns: YARN, HBase, ZooKeeper, …

Notes de l'éditeur

  1. Quick overview of bookThis talk is not to sell you on the book, its to sell you on why MRDPs are important for the communityIntermediate to advanced MapReduce resourceEarly beginners and experts alike can find some use in itSome knowledge of Hadoop is encouragedTom White’s Hadoop: The Definitive Guide is a good start
  2. Story about explaining joinsTeaching hadoop classes and explaining how to solve problems was challengingMost of the stuff in this book is not novel– it’s been collected through different sources
  3. Spend time talking about each and what purpose they haveRemember the mention that examples are in Hadoop
  4. Just briefly outline