Writing Hadoop Jobs in Scala using Scalding

Writing Hadoop Jobs in Scala using
@tonicebrian
Scalding

How much storage can
100$ dollars buy you?


1 photo
1980


1 photo

5 songs

1980

1990


1 photo

5 songs

7 movies

1980

1990

2000

600 movies

170.000 songs

1 photo

5 songs

7 movies

1980

1990

2000

5 million photos

2010

From single drives…

to clusters…

“A mathematician is a
device for turning coffee
into theorems”

Alfréd Rényi

data scientist

into theorems”

Alfréd Rényi

data scientist

and
into theorems”
data

Alfréd Rényi

data scientist

and
into theorems”
insights

data

Alfréd Rényi

Hadoop
=

Map
Distributed
+
File System
Reduce

Hadoop

Storage
=

Map
Distributed
+
File System
Reduce

Hadoop

Program
Model

=

Storage

Map
Distributed
+
File System
Reduce

Word Count

Raw

Hello cruel world
Say hello! Hello!

Word Count

Raw

Map
hello

Hello cruel world
Say hello! Hello!

1

cruel

1

world

1

say

1

hello

2

Word Count

Raw

Map

Reduce
hello

hello
Hello cruel world

1

2

cruel

1

world

1

say

1

Say hello! Hello!

Word Count

Raw

Map

Reduce

Result
hello

3

Hello cruel world

cruel

1

Say hello! Hello!

world

1

say

1

4 Main Characteristics of Scala


JVM


JVM

Statically
Typed


JVM

Object
Oriented

Statically
Typed


JVM

Statically
Typed

Object
Oriented

Functional
Programming

def map[B](f: (A) ⇒ B): List[B]
Builds a new collection by applying a function to all
elements of this list.

def reduce[A1 >: A](op: (A1, A1) ⇒ A1): A1
Reduces the elements of this list using the specified
associative binary operator.

Recap
Map/Reduce
• Programming paradigm that employs concepts from Functional
Programming

Recap
Map/Reduce
Programming
Scala

• Map/Reduce

• Functional Language that runs on the JVM

Recap
Map/Reduce
Programming
Scala

• Map/Reduce

• Functional Language that runs on the JVM
Hadoop

• Open Source Implementation of MR in the JVM

So in what language is Hadoop
implemented?

The Result?
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public static class Reduce extends Reducer<Text, IntWritable,
Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");

public class WordCount {
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

public static class Map extends Mapper<LongWritable, Text, Text,
IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}

job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}

High level approaches

SQL

Data
Transformations

High level approaches

input_lines = LOAD ‘myfile.txt' AS (line:chararray);
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
filtered_words = FILTER words BY word MATCHES 'w+';
word_groups = GROUP filtered_words BY word;
word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count,
group AS word;
ordered_word_count = ORDER word_count BY count DESC;
STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';

User defined functions (UDF)
-- myscript.pig
REGISTER myudfs.jar;
A = LOAD 'student_data' AS (name: chararray,
age: int, gpa: float);
B = FOREACH A GENERATE myudfs.UPPER(name); package myudfs;
import java.io.IOException;
DUMP B;
import org.apache.pig.EvalFunc;

Java

Pig

import org.apache.pig.data.Tuple;
import org.apache.pig.impl.util.WrappedIOException;
public class UPPER extends EvalFunc<String>
{
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
String str = (String)input.get(0);
return str.toUpperCase();
}catch(Exception e){
throw WrappedIOException.wrap("Caught exception processing
input row ", e);
}
}
}

WordCount in Cascading
package impatient;
import java.util.Properties;
import cascading.flow.Flow;
import cascading.flow.FlowDef;
import cascading.flow.hadoop.HadoopFlowConnector;
import cascading.operation.aggregator.Count;
import cascading.operation.regex.RegexFilter;
import cascading.operation.regex.RegexSplitGenerator;
import cascading.pipe.Each;
import cascading.pipe.Every;
import cascading.pipe.GroupBy;
import cascading.pipe.Pipe;
import cascading.property.AppProps;
import cascading.scheme.Scheme;
import cascading.scheme.hadoop.TextDelimited;
import cascading.tap.Tap;
import cascading.tap.hadoop.Hfs;
import cascading.tuple.Fields;

public class Main {
public static void main( String[] args )
{
String docPath = args[ 0 ];
String wcPath = args[ 1 ];

);

// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
// specify a regex operation to split the "document" text lines into a
token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[
[](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef()
.setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
Properties properties = new Properties();
wcFlow.writeDOT( "dot/wc.dot" );
AppProps.setApplicationJarClass( properties, Main.class );
wcFlow.complete();
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties}
}

Good parts
• Data Flow Programming Model
• User Defined Functions

Good parts
• Data Flow Programming Model
• User Defined Functions
Bad

• Still Java
• Objects for Flows

package com.twitter.scalding.examples
import com.twitter.scalding._
class WordCountJob(args : Args) extends Job(args) {
TextLine( args("input") )
.flatMap('line -> 'word) { line : String => tokenize(line) }
.groupBy('word) { _.size }
.write( Tsv( args("output") ) )
// Split a piece of text into individual words.
def tokenize(text : String) : Array[String] = {
// Lowercase each word and remove punctuation.
text.toLowerCase.replaceAll("[^a-zA-Z0-9s]", "").split("s+")
}

}

TDD Cycle
Red

Refactor

Green

Broader view
Red

…
Refactor

Continuous
Deployment

Green

Acceptance
Testing
Unit
Testing

Lean
Startup

A typical day working with Hadoop

Is Scalding of any help here?

0

Size of code


0

Size of code

1

Types


0

Size of code

1

Types

2

Unit Testing


0

Size of code

1

Types

2

Unit Testing

3

Local execution

An extra cycle

Continuous
Deployment

Acceptance
Testing
Unit
Testing

Lean
Startup

An extra cycle

Continuous
Deployment

Acceptance
Testing
Unit
Testing
Compilation
Phase

Lean
Startup

Static typechecking makes
you a better
programmer™

Fail-fast with type errors
(Int,Int,Int,Int)

(Int,Int,Int,Int)
TypedPipe[(Meters,Miles,Celsius,Fahrenheit)]

(Int,Int,Int,Int)
val
val
val
val

w
x
y
z

=
=
=
=

5
5
5
5

w + x + y + z = 20

(Int,Int,Int,Int)
val
val
val
val

w
x
y
z

=
=
=
=

5
5
5
5

w + x + y + z = 20

val
val
val
val

w
x
y
z

=
=
=
=

Meters(5)
Miles(5)
Celsius(5)
Fahrenheit(5)

w + x + y + z

=> type error

How do you test a distributed
algorithm without a distributed
platform?

// Scalding
import com.twitter.scalding._
class WordCountTest extends Specification with TupleConversions {
"A WordCount job" should {
JobTest("com.snowplowanalytics.hadoop.scalding.WordCountJob").
arg("input", "inputFile").
arg("output", "outputFile").
source(TextLine("inputFile"), List("0" -> "hack hack hack and hack")).
sink[(String,Int)](Tsv("outputFile")){ outputBuffer =>
val outMap = outputBuffer.toMap
"count words correctly" in {
outMap("hack") must be_==(4)
outMap("and") must be_==(1)
}
}.
run.
finish
}
}

SBT as a REPL

> run-main com.twitter.scalding.Tool MyJob --local
> run-main com.twitter.scalding.Tool MyJob --hdfs

More Scalding goodness

Algebird

More Scalding goodness

Algebird

Matrix library

Writing Hadoop Jobs in Scala using Scalding

Writing Hadoop Jobs in Scala using Scalding

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Writing Hadoop Jobs in Scala using Scalding

Similaire à Writing Hadoop Jobs in Scala using Scalding (20)

Dernier

Dernier (20)

Writing Hadoop Jobs in Scala using Scalding