SlideShare une entreprise Scribd logo
1  sur  93
Writing Hadoop Jobs in Scala using
@tonicebrian
Scalding
How much storage can
100$ dollars buy you?
How much storage can
100$ dollars buy you?

1 photo
1980
How much storage can
100$ dollars buy you?

1 photo

5 songs

1980

1990
How much storage can
100$ dollars buy you?

1 photo

5 songs

7 movies

1980

1990

2000
How much storage can
100$ dollars buy you?
600 movies

170.000 songs

1 photo

5 songs

7 movies

1980

1990

2000

5 million photos

2010
From single drives…
From single drives…

to clusters…
Data
Science
“A mathematician is a
device for turning coffee
into theorems”

Alfréd Rényi
data scientist

“A mathematician is a
device for turning coffee
into theorems”

Alfréd Rényi
data scientist

“A mathematician is a
device for turning coffee
and
into theorems”
data

Alfréd Rényi
data scientist

“A mathematician is a
device for turning coffee
and
into theorems”
insights

data

Alfréd Rényi
Hadoop
=

Map
Distributed
+
File System
Reduce
Hadoop

Storage
=

Map
Distributed
+
File System
Reduce
Hadoop

Program
Model

=

Storage

Map
Distributed
+
File System
Reduce
Word Count

Raw

Hello cruel world
Say hello! Hello!
Word Count

Raw

Map
hello

Hello cruel world
Say hello! Hello!

1

cruel

1

world

1

say

1

hello

2
Word Count

Raw

Map

Reduce
hello

hello
Hello cruel world

1

2

cruel

1

world

1

say

1

Say hello! Hello!
Word Count

Raw

Map

Reduce

Result
hello

3

Hello cruel world

cruel

1

Say hello! Hello!

world

1

say

1
4 Main Characteristics of Scala
4 Main Characteristics of Scala

JVM
4 Main Characteristics of Scala

JVM

Statically
Typed
4 Main Characteristics of Scala

JVM

Object
Oriented

Statically
Typed
4 Main Characteristics of Scala

JVM

Statically
Typed

Object
Oriented

Functional
Programming
def map[B](f: (A) ⇒ B): List[B]
Builds a new collection by applying a function to all
elements of this list.

def reduce[A1 >: A](op: (A1, A1) ⇒ A1): A1
Reduces the elements of this list using the specified
associative binary operator.
Recap
Recap
Map/Reduce
• Programming paradigm that employs concepts from Functional
Programming
Recap
Map/Reduce
• Programming paradigm that employs concepts from Functional
Programming
Scala

• Map/Reduce

• Functional Language that runs on the JVM
Recap
Map/Reduce
• Programming paradigm that employs concepts from Functional
Programming
Scala

• Map/Reduce

• Functional Language that runs on the JVM
Hadoop

• Open Source Implementation of MR in the JVM
So in what language is Hadoop
implemented?
The Result?
The Result?
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public static class Reduce extends Reducer<Text, IntWritable,
Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");

public class WordCount {
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

public static class Map extends Mapper<LongWritable, Text, Text,
IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}

job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
High level approaches

SQL

Data
Transformations
High level approaches

input_lines = LOAD ‘myfile.txt' AS (line:chararray);
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
filtered_words = FILTER words BY word MATCHES 'w+';
word_groups = GROUP filtered_words BY word;
word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count,
group AS word;
ordered_word_count = ORDER word_count BY count DESC;
STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';
User defined functions (UDF)
-- myscript.pig
REGISTER myudfs.jar;
A = LOAD 'student_data' AS (name: chararray,
age: int, gpa: float);
B = FOREACH A GENERATE myudfs.UPPER(name); package myudfs;
import java.io.IOException;
DUMP B;
import org.apache.pig.EvalFunc;

Java

Pig

import org.apache.pig.data.Tuple;
import org.apache.pig.impl.util.WrappedIOException;
public class UPPER extends EvalFunc<String>
{
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
String str = (String)input.get(0);
return str.toUpperCase();
}catch(Exception e){
throw WrappedIOException.wrap("Caught exception processing
input row ", e);
}
}
}
WordCount in Cascading
package impatient;
import java.util.Properties;
import cascading.flow.Flow;
import cascading.flow.FlowDef;
import cascading.flow.hadoop.HadoopFlowConnector;
import cascading.operation.aggregator.Count;
import cascading.operation.regex.RegexFilter;
import cascading.operation.regex.RegexSplitGenerator;
import cascading.pipe.Each;
import cascading.pipe.Every;
import cascading.pipe.GroupBy;
import cascading.pipe.Pipe;
import cascading.property.AppProps;
import cascading.scheme.Scheme;
import cascading.scheme.hadoop.TextDelimited;
import cascading.tap.Tap;
import cascading.tap.hadoop.Hfs;
import cascading.tuple.Fields;

public class Main {
public static void main( String[] args )
{
String docPath = args[ 0 ];
String wcPath = args[ 1 ];

);

// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
// specify a regex operation to split the "document" text lines into a
token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[
[](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef()
.setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
Properties properties = new Properties();
wcFlow.writeDOT( "dot/wc.dot" );
AppProps.setApplicationJarClass( properties, Main.class );
wcFlow.complete();
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties}
}
Good parts
• Data Flow Programming Model
• User Defined Functions
Good parts
• Data Flow Programming Model
• User Defined Functions
Bad

• Still Java
• Objects for Flows
package com.twitter.scalding.examples
import com.twitter.scalding._
class WordCountJob(args : Args) extends Job(args) {
TextLine( args("input") )
.flatMap('line -> 'word) { line : String => tokenize(line) }
.groupBy('word) { _.size }
.write( Tsv( args("output") ) )
// Split a piece of text into individual words.
def tokenize(text : String) : Array[String] = {
// Lowercase each word and remove punctuation.
text.toLowerCase.replaceAll("[^a-zA-Z0-9s]", "").split("s+")
}

}
TDD Cycle
Red

Refactor

Green
Broader view
Red

…
Refactor

Continuous
Deployment

Green

Acceptance
Testing
Unit
Testing

Lean
Startup
Big Data

Big Speed
A typical day working with Hadoop
A typical day working with Hadoop
A typical day working with Hadoop
A typical day working with Hadoop
A typical day working with Hadoop
A typical day working with Hadoop
A typical day working with Hadoop
A typical day working with Hadoop
Is Scalding of any help here?
Is Scalding of any help here?

0

Size of code
Is Scalding of any help here?

0

Size of code

1

Types
Is Scalding of any help here?

0

Size of code

1

Types

2

Unit Testing
Is Scalding of any help here?

0

Size of code

1

Types

2

Unit Testing

3

Local execution
1
Types
An extra cycle

Continuous
Deployment

Acceptance
Testing
Unit
Testing

Lean
Startup
An extra cycle

Continuous
Deployment

Acceptance
Testing
Unit
Testing
Compilation
Phase

Lean
Startup
Static typechecking makes
you a better
programmer™
Fail-fast with type errors
(Int,Int,Int,Int)
Fail-fast with type errors
(Int,Int,Int,Int)
TypedPipe[(Meters,Miles,Celsius,Fahrenheit)]
Fail-fast with type errors
(Int,Int,Int,Int)
TypedPipe[(Meters,Miles,Celsius,Fahrenheit)]
val
val
val
val

w
x
y
z

=
=
=
=

5
5
5
5

w + x + y + z = 20
Fail-fast with type errors
(Int,Int,Int,Int)
TypedPipe[(Meters,Miles,Celsius,Fahrenheit)]
val
val
val
val

w
x
y
z

=
=
=
=

5
5
5
5

w + x + y + z = 20

val
val
val
val

w
x
y
z

=
=
=
=

Meters(5)
Miles(5)
Celsius(5)
Fahrenheit(5)

w + x + y + z

=> type error
2
Unit Testing
How do you test a distributed
algorithm without a distributed
platform?
Source

Tap
Source

Tap
Source

Tap
// Scalding
import com.twitter.scalding._
class WordCountTest extends Specification with TupleConversions {
"A WordCount job" should {
JobTest("com.snowplowanalytics.hadoop.scalding.WordCountJob").
arg("input", "inputFile").
arg("output", "outputFile").
source(TextLine("inputFile"), List("0" -> "hack hack hack and hack")).
sink[(String,Int)](Tsv("outputFile")){ outputBuffer =>
val outMap = outputBuffer.toMap
"count words correctly" in {
outMap("hack") must be_==(4)
outMap("and") must be_==(1)
}
}.
run.
finish
}
}
3
Local Execution
HDFS

Local
HDFS

Local
SBT as a REPL

> run-main com.twitter.scalding.Tool MyJob --local
> run-main com.twitter.scalding.Tool MyJob --hdfs
More Scalding goodness
More Scalding goodness

Algebird
More Scalding goodness

Algebird

Matrix library
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding

Contenu connexe

Tendances

Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
 
Parallel-Ready Java Code: Managing Mutation in an Imperative Language
Parallel-Ready Java Code: Managing Mutation in an Imperative LanguageParallel-Ready Java Code: Managing Mutation in an Imperative Language
Parallel-Ready Java Code: Managing Mutation in an Imperative LanguageMaurice Naftalin
 
Good and Wicked Fairies, and the Tragedy of the Commons: Understanding the Pe...
Good and Wicked Fairies, and the Tragedy of the Commons: Understanding the Pe...Good and Wicked Fairies, and the Tragedy of the Commons: Understanding the Pe...
Good and Wicked Fairies, and the Tragedy of the Commons: Understanding the Pe...Maurice Naftalin
 
Modern technologies in data science
Modern technologies in data science Modern technologies in data science
Modern technologies in data science Chucheng Hsieh
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David SzakallasDatabricks
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Sparksamthemonad
 
Scalding Presentation
Scalding PresentationScalding Presentation
Scalding PresentationLandoop Ltd
 
Sparkling Water Meetup
Sparkling Water MeetupSparkling Water Meetup
Sparkling Water MeetupSri Ambati
 
HBase RowKey design for Akka Persistence
HBase RowKey design for Akka PersistenceHBase RowKey design for Akka Persistence
HBase RowKey design for Akka PersistenceKonrad Malawski
 
Shooting the Rapids: Getting the Best from Java 8 Streams
Shooting the Rapids: Getting the Best from Java 8 StreamsShooting the Rapids: Getting the Best from Java 8 Streams
Shooting the Rapids: Getting the Best from Java 8 StreamsMaurice Naftalin
 
MapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London MeetupMapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London MeetupLandoop Ltd
 
Interactive Session on Sparkling Water
Interactive Session on Sparkling WaterInteractive Session on Sparkling Water
Interactive Session on Sparkling WaterSri Ambati
 
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...CloudxLab
 
EuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and HadoopEuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and HadoopMax Tepkeev
 
Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014Samir Bessalah
 
あなたのScalaを爆速にする7つの方法
あなたのScalaを爆速にする7つの方法あなたのScalaを爆速にする7つの方法
あなたのScalaを爆速にする7つの方法x1 ichi
 

Tendances (20)

Scala+data
Scala+dataScala+data
Scala+data
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
 
Parallel-Ready Java Code: Managing Mutation in an Imperative Language
Parallel-Ready Java Code: Managing Mutation in an Imperative LanguageParallel-Ready Java Code: Managing Mutation in an Imperative Language
Parallel-Ready Java Code: Managing Mutation in an Imperative Language
 
Shooting the Rapids
Shooting the RapidsShooting the Rapids
Shooting the Rapids
 
Good and Wicked Fairies, and the Tragedy of the Commons: Understanding the Pe...
Good and Wicked Fairies, and the Tragedy of the Commons: Understanding the Pe...Good and Wicked Fairies, and the Tragedy of the Commons: Understanding the Pe...
Good and Wicked Fairies, and the Tragedy of the Commons: Understanding the Pe...
 
Modern technologies in data science
Modern technologies in data science Modern technologies in data science
Modern technologies in data science
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David Szakallas
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Spark
 
Scalding Presentation
Scalding PresentationScalding Presentation
Scalding Presentation
 
Sparkling Water Meetup
Sparkling Water MeetupSparkling Water Meetup
Sparkling Water Meetup
 
HBase RowKey design for Akka Persistence
HBase RowKey design for Akka PersistenceHBase RowKey design for Akka Persistence
HBase RowKey design for Akka Persistence
 
Shooting the Rapids: Getting the Best from Java 8 Streams
Shooting the Rapids: Getting the Best from Java 8 StreamsShooting the Rapids: Getting the Best from Java 8 Streams
Shooting the Rapids: Getting the Best from Java 8 Streams
 
MapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London MeetupMapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London Meetup
 
Interactive Session on Sparkling Water
Interactive Session on Sparkling WaterInteractive Session on Sparkling Water
Interactive Session on Sparkling Water
 
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
 
EuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and HadoopEuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and Hadoop
 
Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014
 
あなたのScalaを爆速にする7つの方法
あなたのScalaを爆速にする7つの方法あなたのScalaを爆速にする7つの方法
あなたのScalaを爆速にする7つの方法
 
Let's Get to the Rapids
Let's Get to the RapidsLet's Get to the Rapids
Let's Get to the Rapids
 

Similaire à Writing Hadoop Jobs in Scala using Scalding

Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInVitaly Gordon
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusKoichi Fujikawa
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinPietro Michiardi
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Uwe Printz
 
Introduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe SeilerIntroduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe SeilerCodemotion
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Uwe Printz
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Big-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiBig-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiUnmesh Baile
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau
 
Crossing the Bridge: Connecting Rails and your Front-end Framework
Crossing the Bridge: Connecting Rails and your Front-end FrameworkCrossing the Bridge: Connecting Rails and your Front-end Framework
Crossing the Bridge: Connecting Rails and your Front-end FrameworkDaniel Spector
 
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseCodepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseSages
 
Atlassian Groovy Plugins
Atlassian Groovy PluginsAtlassian Groovy Plugins
Atlassian Groovy PluginsPaul King
 
Create & Execute First Hadoop MapReduce Project in.pptx
Create & Execute First Hadoop MapReduce Project in.pptxCreate & Execute First Hadoop MapReduce Project in.pptx
Create & Execute First Hadoop MapReduce Project in.pptxvishal choudhary
 
2007 09 10 Fzi Training Groovy Grails V Ws
2007 09 10 Fzi Training Groovy Grails V Ws2007 09 10 Fzi Training Groovy Grails V Ws
2007 09 10 Fzi Training Groovy Grails V Wsloffenauer
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 

Similaire à Writing Hadoop Jobs in Scala using Scalding (20)

Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedIn
 
Having Fun with Play
Having Fun with PlayHaving Fun with Play
Having Fun with Play
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop Papyrus
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig Latin
 
JS everywhere 2011
JS everywhere 2011JS everywhere 2011
JS everywhere 2011
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
Introduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe SeilerIntroduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe Seiler
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Big-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiBig-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbai
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
Crossing the Bridge: Connecting Rails and your Front-end Framework
Crossing the Bridge: Connecting Rails and your Front-end FrameworkCrossing the Bridge: Connecting Rails and your Front-end Framework
Crossing the Bridge: Connecting Rails and your Front-end Framework
 
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseCodepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
 
Atlassian Groovy Plugins
Atlassian Groovy PluginsAtlassian Groovy Plugins
Atlassian Groovy Plugins
 
Create & Execute First Hadoop MapReduce Project in.pptx
Create & Execute First Hadoop MapReduce Project in.pptxCreate & Execute First Hadoop MapReduce Project in.pptx
Create & Execute First Hadoop MapReduce Project in.pptx
 
2007 09 10 Fzi Training Groovy Grails V Ws
2007 09 10 Fzi Training Groovy Grails V Ws2007 09 10 Fzi Training Groovy Grails V Ws
2007 09 10 Fzi Training Groovy Grails V Ws
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 

Dernier

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 

Dernier (20)

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 

Writing Hadoop Jobs in Scala using Scalding

  • 1. Writing Hadoop Jobs in Scala using @tonicebrian Scalding
  • 2. How much storage can 100$ dollars buy you?
  • 3. How much storage can 100$ dollars buy you? 1 photo 1980
  • 4. How much storage can 100$ dollars buy you? 1 photo 5 songs 1980 1990
  • 5. How much storage can 100$ dollars buy you? 1 photo 5 songs 7 movies 1980 1990 2000
  • 6. How much storage can 100$ dollars buy you? 600 movies 170.000 songs 1 photo 5 songs 7 movies 1980 1990 2000 5 million photos 2010
  • 10. “A mathematician is a device for turning coffee into theorems” Alfréd Rényi
  • 11. data scientist “A mathematician is a device for turning coffee into theorems” Alfréd Rényi
  • 12. data scientist “A mathematician is a device for turning coffee and into theorems” data Alfréd Rényi
  • 13. data scientist “A mathematician is a device for turning coffee and into theorems” insights data Alfréd Rényi
  • 14.
  • 18. Word Count Raw Hello cruel world Say hello! Hello!
  • 19. Word Count Raw Map hello Hello cruel world Say hello! Hello! 1 cruel 1 world 1 say 1 hello 2
  • 20. Word Count Raw Map Reduce hello hello Hello cruel world 1 2 cruel 1 world 1 say 1 Say hello! Hello!
  • 21. Word Count Raw Map Reduce Result hello 3 Hello cruel world cruel 1 Say hello! Hello! world 1 say 1
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 29. 4 Main Characteristics of Scala JVM
  • 30. 4 Main Characteristics of Scala JVM Statically Typed
  • 31. 4 Main Characteristics of Scala JVM Object Oriented Statically Typed
  • 32. 4 Main Characteristics of Scala JVM Statically Typed Object Oriented Functional Programming
  • 33. def map[B](f: (A) ⇒ B): List[B] Builds a new collection by applying a function to all elements of this list. def reduce[A1 >: A](op: (A1, A1) ⇒ A1): A1 Reduces the elements of this list using the specified associative binary operator.
  • 34. Recap
  • 35. Recap Map/Reduce • Programming paradigm that employs concepts from Functional Programming
  • 36. Recap Map/Reduce • Programming paradigm that employs concepts from Functional Programming Scala • Map/Reduce • Functional Language that runs on the JVM
  • 37. Recap Map/Reduce • Programming paradigm that employs concepts from Functional Programming Scala • Map/Reduce • Functional Language that runs on the JVM Hadoop • Open Source Implementation of MR in the JVM
  • 38. So in what language is Hadoop implemented?
  • 39.
  • 41. The Result? package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); public class WordCount { job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }
  • 43. High level approaches input_lines = LOAD ‘myfile.txt' AS (line:chararray); words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; filtered_words = FILTER words BY word MATCHES 'w+'; word_groups = GROUP filtered_words BY word; word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';
  • 44. User defined functions (UDF) -- myscript.pig REGISTER myudfs.jar; A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float); B = FOREACH A GENERATE myudfs.UPPER(name); package myudfs; import java.io.IOException; DUMP B; import org.apache.pig.EvalFunc; Java Pig import org.apache.pig.data.Tuple; import org.apache.pig.impl.util.WrappedIOException; public class UPPER extends EvalFunc<String> { public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ String str = (String)input.get(0); return str.toUpperCase(); }catch(Exception e){ throw WrappedIOException.wrap("Caught exception processing input row ", e); } } }
  • 45.
  • 46.
  • 47. WordCount in Cascading package impatient; import java.util.Properties; import cascading.flow.Flow; import cascading.flow.FlowDef; import cascading.flow.hadoop.HadoopFlowConnector; import cascading.operation.aggregator.Count; import cascading.operation.regex.RegexFilter; import cascading.operation.regex.RegexSplitGenerator; import cascading.pipe.Each; import cascading.pipe.Every; import cascading.pipe.GroupBy; import cascading.pipe.Pipe; import cascading.property.AppProps; import cascading.scheme.Scheme; import cascading.scheme.hadoop.TextDelimited; import cascading.tap.Tap; import cascading.tap.hadoop.Hfs; import cascading.tuple.Fields; public class Main { public static void main( String[] args ) { String docPath = args[ 0 ]; String wcPath = args[ 1 ]; ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex operation to split the "document" text lines into a token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); Properties properties = new Properties(); wcFlow.writeDOT( "dot/wc.dot" ); AppProps.setApplicationJarClass( properties, Main.class ); wcFlow.complete(); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties} }
  • 48. Good parts • Data Flow Programming Model • User Defined Functions
  • 49. Good parts • Data Flow Programming Model • User Defined Functions Bad • Still Java • Objects for Flows
  • 50.
  • 51. package com.twitter.scalding.examples import com.twitter.scalding._ class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => tokenize(line) } .groupBy('word) { _.size } .write( Tsv( args("output") ) ) // Split a piece of text into individual words. def tokenize(text : String) : Array[String] = { // Lowercase each word and remove punctuation. text.toLowerCase.replaceAll("[^a-zA-Z0-9s]", "").split("s+") } }
  • 52.
  • 56. A typical day working with Hadoop
  • 57. A typical day working with Hadoop
  • 58. A typical day working with Hadoop
  • 59. A typical day working with Hadoop
  • 60. A typical day working with Hadoop
  • 61. A typical day working with Hadoop
  • 62. A typical day working with Hadoop
  • 63. A typical day working with Hadoop
  • 64. Is Scalding of any help here?
  • 65. Is Scalding of any help here? 0 Size of code
  • 66. Is Scalding of any help here? 0 Size of code 1 Types
  • 67. Is Scalding of any help here? 0 Size of code 1 Types 2 Unit Testing
  • 68. Is Scalding of any help here? 0 Size of code 1 Types 2 Unit Testing 3 Local execution
  • 72.
  • 73. Static typechecking makes you a better programmer™
  • 74. Fail-fast with type errors (Int,Int,Int,Int)
  • 75. Fail-fast with type errors (Int,Int,Int,Int) TypedPipe[(Meters,Miles,Celsius,Fahrenheit)]
  • 76. Fail-fast with type errors (Int,Int,Int,Int) TypedPipe[(Meters,Miles,Celsius,Fahrenheit)] val val val val w x y z = = = = 5 5 5 5 w + x + y + z = 20
  • 77. Fail-fast with type errors (Int,Int,Int,Int) TypedPipe[(Meters,Miles,Celsius,Fahrenheit)] val val val val w x y z = = = = 5 5 5 5 w + x + y + z = 20 val val val val w x y z = = = = Meters(5) Miles(5) Celsius(5) Fahrenheit(5) w + x + y + z => type error
  • 79. How do you test a distributed algorithm without a distributed platform?
  • 83. // Scalding import com.twitter.scalding._ class WordCountTest extends Specification with TupleConversions { "A WordCount job" should { JobTest("com.snowplowanalytics.hadoop.scalding.WordCountJob"). arg("input", "inputFile"). arg("output", "outputFile"). source(TextLine("inputFile"), List("0" -> "hack hack hack and hack")). sink[(String,Int)](Tsv("outputFile")){ outputBuffer => val outMap = outputBuffer.toMap "count words correctly" in { outMap("hack") must be_==(4) outMap("and") must be_==(1) } }. run. finish } }
  • 85.
  • 88. SBT as a REPL > run-main com.twitter.scalding.Tool MyJob --local > run-main com.twitter.scalding.Tool MyJob --hdfs