20140427 parallel programming_zlobin_lecture11

•

0 likes•610 views

Computer Science Club

Technology

Map/Reduce
Обзор решений
Алексей Злобин
alexey.zlobin@gmail.com

Sample job: driver
public static void main(String[] a) throws Exception {
Configuration conf = new Configuration();
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(a[0]));
FileOutputFormat.setOutputPath(job, new Path(a[1]));
job.waitForCompletion(true);
}

Sample job: mapper
class M extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable k, Text v, Context ctx) {
String line = v.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
ctx.write(word, one);
}
}
}

Sample job: reducer
class R extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text k, Iterable<IntWritable> v, Context ctx)
{
int sum = 0;
for (IntWritable val : v)
sum += val.get();
context.write(k, new IntWritable(sum));
}
}

Pig snippet
raw =
LOAD 'excite.log' USING PigStorage('t') AS (user, time, qry);
clean1 = FILTER raw BY
org.apache.pig.tutorial.NonURLDetector(qry);
clean2 = FOREACH clean1
GENERATE user, time, org.apache.pig.tutorial.ToLower(qr)
as query;

Hive snippet
CREATE TABLE invites
(foo INT, bar STRING) PARTITIONED BY (ds STRING);
LOAD DATA LOCAL
INPATH './examples/files/kv2.txt' OVERWRITE
INTO TABLE invites PARTITION (ds='2008-08-15');
SELECT a.foo
FROM invites a
WHERE a.ds='2008-08-15';
INSERT OVERWRITE DIRECTORY '/tmp/reg_5'
SELECT a.foo, a.bar FROM invites a;

Spark: example
val counts = lines.flatMap(line => line.split(“ “))
.map(word => (word, 1))
.reduceByKey(_ + _)

$Shark example CREATE TABLE src(key INT, value STRING); LOAD DATA LOCAL INPATH '${env:HIVE_HOME}/examples/files/kv1.txt' INTO TABLE src; SELECT COUNT(1) FROM src; CREATE TABLE src_cached AS SELECT * FROM SRC; SELECT COUNT(1) FROM src_cached;$

Disco example
def fun_map(line, params):
for word in line.split():
yield word, 1
def fun_reduce(iter, params):
for word, counts in kvgroup(sorted(iter)):
yield word, sum(counts)

Disco driver
job = Job().run(
input=["http://discoproject.org/media/text/chekhov.
txt"],
map=map,
reduce=reduce)
for word, count in result_iterator(job.wait(show=True)):
print(word, count)

References I
● “MapReduce: Simplified Data Processing on Large Clusters” Dean, Jeffrey and
Ghemawat, Sanjay
● “A Comparison of Join Algorithms for Log Processing in MapReduce” S. Blanas, J.
Patel, V. Ercegovac, J. Rao, E. Shekita, Y. Tian
● “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing” Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica
● “Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing
on Large Clusters” Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, Ion
Stoica
● “Shark: Fast Data Analysis Using Coarse-grained Distributed Memory” Cliff Engle,
Antonio Lupher, Reynold Xin, Matei Zaharia, Haoyuan Li, Scott Shenker, Ion Stoica

References II
● Disco Technical Overview http://disco.readthedocs.org/en/latest/overview.html
● Disco Distributed Filesystem http://disco.readthedocs.org/en/latest/howto/ddfs.html
● An efficient, immutable, persistent mapping object http://discodb.readthedocs.
org/en/latest/

What's hot

Python Seaborn Data Visualization Sourabh Sahu

05. haskell streaming ioSebastian Rettig

R seminar dplyr packageMuhammad Nabi Ahmad

decision tree regressionAkhilesh Joshi

Do something in 5 minutes with gas 1-use spreadsheet as databaseBruce McPherson

Вячеслав Блинов: "Spring Integration as an Integration Patterns Provider"Anna Shymchenko

Codemimidas

Air Quality in Taiwan 2013 Kuang-Hao Cheng

Do something useful in Apps Script 5. Get your analytics pageviews to a sprea...Bruce McPherson

python高级内存管理rfyiamcool

Aerospike Nested CDTs - Meetup Dec 2019Aerospike

Hacking the Internet of Things for Fun & ProfitRuben van Vreeland

Aggregation Framework in MongoDB Overview Part-1Anuj Jain

Do something in 5 with gas 3-simple invoicing appBruce McPherson

Ganga: an interface to the LHC computing gridMatt Williams

Look Mommy, No GC! (TechDays NL 2017)Dina Goldshtein

polynomial linear regressionAkhilesh Joshi

Google Sheets in Python with gspreadJure Cuhalev

Do something in 5 with gas 7-email logBruce McPherson

LINE iOS開発で実践しているGit tipsLINE Corporation

What's hot (20)

Python Seaborn Data Visualization

05. haskell streaming io

R seminar dplyr package

decision tree regression

Do something in 5 minutes with gas 1-use spreadsheet as database

Вячеслав Блинов: "Spring Integration as an Integration Patterns Provider"

Code

Air Quality in Taiwan 2013

Do something useful in Apps Script 5. Get your analytics pageviews to a sprea...

python高级内存管理

Aerospike Nested CDTs - Meetup Dec 2019

Hacking the Internet of Things for Fun & Profit

Aggregation Framework in MongoDB Overview Part-1

Do something in 5 with gas 3-simple invoicing app

Ganga: an interface to the LHC computing grid

Look Mommy, No GC! (TechDays NL 2017)

polynomial linear regression

Google Sheets in Python with gspread

Do something in 5 with gas 7-email log

LINE iOS開発で実践しているGit tips

Similar to 20140427 parallel programming_zlobin_lecture11

MapReduceahmedelmorsy89

Introduction to Scalding and MonoidsHugo Gävert

Introduction to Ragnonchik

"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...Adrian Florea

Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab

JRubyKaigi2010 Hadoop PapyrusKoichi Fujikawa

Hadoop + Clojureelliando dias

Implementing a many-to-many Relationship with SlickHermann Hueck

Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Robert Metzger

Pune Clojure Course OutlineBaishampayan Ghose

Introduction to Map-Reduce Programming with HadoopDilum Bandara

Modern technologies in data science Chucheng Hsieh

Hadoop_PennonsoftPennonSoft

Hw09 Hadoop + ClojureCloudera, Inc.

Hadoop源码分析 mapreduce部分sg7879

Introducción a hadoopdatasalt

Hadoop - Introduction to mapreduceVibrant Technologies & Computers

RefactoringAmir Barylko

Functional programming using underscorejs偉格高

ggtimeseries-->ggplot2 extensions Dr. Volkan OBAN

Similar to 20140427 parallel programming_zlobin_lecture11 (20)

MapReduce

Introduction to Scalding and Monoids

Introduction to R

"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...

Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...

JRubyKaigi2010 Hadoop Papyrus

Hadoop + Clojure

Implementing a many-to-many Relationship with Slick

Stratosphere System Overview Big Data Beers Berlin. 20.11.2013

Pune Clojure Course Outline

Introduction to Map-Reduce Programming with Hadoop

Modern technologies in data science

Hadoop_Pennonsoft

Hw09 Hadoop + Clojure

Hadoop源码分析 mapreduce部分

Introducción a hadoop

Hadoop - Introduction to mapreduce

Refactoring

Functional programming using underscorejs

ggtimeseries-->ggplot2 extensions

Recently uploaded

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

What is Artificial Intelligence?????????blackmambaettijean

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

From Family Reminiscence to Scholarly Archive .Alan Dix

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan

Advanced Computer Architecture – An IntroductionDilum Bandara

Take control of your SAP testing with UiPath Test SuiteDianaGray10

How to write a Business Continuity PlanDatabarracks

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

Recently uploaded (20)

Anypoint Exchange: It’s Not Just a Repo!

Moving Beyond Passwords: FIDO Paris Seminar.pdf

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

What is Artificial Intelligence?????????

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

"Debugging python applications inside k8s environment", Andrii Soldatenko

From Family Reminiscence to Scholarly Archive .

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

The Ultimate Guide to Choosing WordPress Pros and Cons

DevEX - reference for building teams, processes, and platforms

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

Nell’iperspazio con Rocket: il Framework Web di Rust!

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

Generative AI for Technical Writer or Information Developers

Advanced Computer Architecture – An Introduction

Take control of your SAP testing with UiPath Test Suite

How to write a Business Continuity Plan

Unraveling Multimodality with Large Language Models.pdf

20140427 parallel programming_zlobin_lecture11

1. Map/Reduce Обзор решений Алексей Злобин alexey.zlobin@gmail.com

2. Sample job: driver public static void main(String[] a) throws Exception { Configuration conf = new Configuration(); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(a[0])); FileOutputFormat.setOutputPath(job, new Path(a[1])); job.waitForCompletion(true); }

3. Sample job: mapper class M extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable k, Text v, Context ctx) { String line = v.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); ctx.write(word, one); } } }

4. Sample job: reducer class R extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text k, Iterable<IntWritable> v, Context ctx) { int sum = 0; for (IntWritable val : v) sum += val.get(); context.write(k, new IntWritable(sum)); } }

5. Pig snippet raw = LOAD 'excite.log' USING PigStorage('t') AS (user, time, qry); clean1 = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(qry); clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.ToLower(qr) as query;

6. Hive snippet CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING); LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15'); SELECT a.foo FROM invites a WHERE a.ds='2008-08-15'; INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM invites a;

7. Spark: example val counts = lines.flatMap(line => line.split(“ “)) .map(word => (word, 1)) .reduceByKey(_ + _)

8. Shark example CREATE TABLE src(key INT, value STRING); LOAD DATA LOCAL INPATH '${env:HIVE_HOME}/examples/files/kv1.txt' INTO TABLE src; SELECT COUNT(1) FROM src; CREATE TABLE src_cached AS SELECT * FROM SRC; SELECT COUNT(1) FROM src_cached;

9. Disco example def fun_map(line, params): for word in line.split(): yield word, 1 def fun_reduce(iter, params): for word, counts in kvgroup(sorted(iter)): yield word, sum(counts)

10. Disco driver job = Job().run( input=["http://discoproject.org/media/text/chekhov. txt"], map=map, reduce=reduce) for word, count in result_iterator(job.wait(show=True)): print(word, count)

11. References I ● “MapReduce: Simplified Data Processing on Large Clusters” Dean, Jeffrey and Ghemawat, Sanjay ● “A Comparison of Join Algorithms for Log Processing in MapReduce” S. Blanas, J. Patel, V. Ercegovac, J. Rao, E. Shekita, Y. Tian ● “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing” Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica ● “Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters” Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, Ion Stoica ● “Shark: Fast Data Analysis Using Coarse-grained Distributed Memory” Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Haoyuan Li, Scott Shenker, Ion Stoica

12. References II ● Disco Technical Overview http://disco.readthedocs.org/en/latest/overview.html ● Disco Distributed Filesystem http://disco.readthedocs.org/en/latest/howto/ddfs.html ● An efficient, immutable, persistent mapping object http://discodb.readthedocs. org/en/latest/

20140427 parallel programming_zlobin_lecture11

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 20140427 parallel programming_zlobin_lecture11

Similar to 20140427 parallel programming_zlobin_lecture11 (20)

More from Computer Science Club

More from Computer Science Club (20)

Recently uploaded

Recently uploaded (20)

20140427 parallel programming_zlobin_lecture11