The presentation given by Chris Severs and myself at the Bay Area Scala Enthusiasts meetup. http://www.meetup.com/Bay-Area-Scala-Enthusiasts/events/105409962/
3. Stuff you will see today …
Different types of data scientists – Comparison of different
approaches to develop machine learning flows
Code
The five tool tool – Why Scala (and its ecosystem) is the best tool to
develop machine learning flows (Hint: MapReduce is functional)
Some more code
Machine Learning examples – Real life (well … almost) examples
of different machine learning problems
Even more code
3
4. “Good data scientists understand, in a deep
way, that the heavy lifting of cleanup and
preparation is not something that gets in the
way of solving the problem – it is the problem!”
DJ Patil – Founding member of the LinkedIn data science team
4
5. The data funnel
Real data is an awful, terrible mess
Cleaning often is a process of operating on data, excluding some
data, bucketing data and calculating aggregates about the data
Generate map, flatMap, for
Exclude filter
Bucket group, groupBy, groupWith
Aggregate sum, reduce, foldLeft
These blocks form the basis of most data flows
5
8. {"schema": {
The Mixer Word Count "type": "record",
"name": "WordCount",
#wordcount.py "fields": [
from org.apache.pig.scripting import * {
"name": "word",
@outputSchema("b: bag{ w: chararray}") "type": "string"
def tokenize(words): },
return words.split(" ") {
"name": "count",
script = """ "type": "int"
A = load './input.txt'; }]}}
B = foreach A generate flatten(tokenize((chararray)$0)) as word;
C = group B by word;
D = foreach C generate group, COUNT(B);
store D into './wordcount’ using AvroStorage("schema");
"""
Pig.compile(script).bind().runSingle()
8
9. The Mixer Data Scientist
Too many occurrences of code inside strings
Three different languages inside a single file
User Defined Functions (UDFs) vs. Language Support
Not real Python, but Jython (which missing some libraries)
This is just a simple word count!
9
10. The Mixer Data Scientist
Pig is great at extract, transform, load (ETL)
… as long as you want to use a function that is already part of the
included library
… or you get someone else to write it for you (hello, DataFu!)
Realistically you will need to maintain a Pig code base and a code
base in some language which can run on the JVM
Pig Latin is a bit funky, missing a lot of core programming language
features
Pig Latin is interpreted so you get (limited) type and syntax
checking only at runtime
10
12. The Expert Word Count
hadoop fs –get input.txt input.txt
cp /mnt/hadoop/input.txt ~/MyProjects/WordCount/input.txt
##!/usr/bin/perl
use strict;
use warnings;
my %count_of;
while (my $line = <>) { #read from file or STDIN
foreach my $word (split /s+/, $line) {
$count_of{$word}++;
}
}
print "All words and their counts: n";
for my $word (sort keys %count_of) {
print "'$word': $count_of{$word}n";
}
__END__
12
13. The Scalable Expert – Hadoop Streaming
Lets you use any language you want.
Same issues as Java MapReduce with regards to multiple passes,
complicated joins, etc.
Always reading from stdin and writing to stdout.
Easy to test out on local data
– cat myfile.txt | mymapper.sh | sort | myreducer.sh
Actual data may not be as nice. No type checking on input or output
can will lead to problems.
The main reason to do this is so you can use a nice interpreted
language to do your processing.
13
15. The Craftsman Word Count
package org.myorg; public static class Reduce extends
Reducer<Text, IntWritable, Text, IntWritable> {
import java.io.IOException;
import java.util.*; public void reduce(Text key, Iterable<IntWritable> values, Context
context)
throws IOException, InterruptedException {
import org.apache.hadoop.fs.Path;
int sum = 0;
import org.apache.hadoop.conf.*;
for (IntWritable val : values) {
import org.apache.hadoop.io.*;
sum += val.get();
import org.apache.hadoop.mapreduce.*;
}
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
context.write(key, new IntWritable(sum));
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
}
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
}
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public static void main(String[] args) throws Exception {
public class WordCount {
Configuration conf = new Configuration();
public static class Map extends Mapper<LongWritable, Text, Text,
IntWritable> { Job job = new Job(conf, "wordcount");
private final static IntWritable one = new IntWritable(1);
private Text word = new Text(); job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException { job.setMapperClass(Map.class);
String line = value.toString(); job.setReducerClass(Reduce.class);
StringTokenizer tokenizer = new StringTokenizer(line); job.setInputFormatClass(TextInputFormat.class);
while (tokenizer.hasMoreTokens()) { job.setOutputFormatClass(TextOutputFormat.class);
word.set(tokenizer.nextToken()); FileInputFormat.addInputPath(job, new Path(args[0]));
context.write(word, one); FileOutputFormat.setOutputPath(job, new Path(args[1]));
}
} job.waitForCompletion(true);
} }
15
16. The Craftsman Data Scientist
If you like Java it works fine
… until you want to do more than one pass, a complicated join or
anything fancy.
Cascading solves many of these problems for you but it is still very
verbose
16
17. We need a better tool
A five tool tool!
http://en.wikipedia.org/wiki/Five-tool_player
http://en.wikipedia.org/wiki/Willie_Mays
17
18. The Pragmatic Data Scientist
Agile – Iterates quickly
Productive - Uses the right tool for the right job
Correct - Tests as much as he can before the job is even submitted
Scalable – Can handle real world problems
Simple - Single language to represent Operations, UDFs and Data
18
19. The Pragmatic Data Scientist
Agile – Iterates quickly
Productive - Uses the right tool for the right job
Correct - Tests as much as he can before the job is even submitted
Scalable – Can handle real world problems
Simple - Single language to represent Operations, UDFs and Data
19
25. Scalability – works on more than your machine
Integrates with Hadoop (more than just streaming)
Has the support of scalable libraries
Parallel by design – not just for M/R flows
25
26. Simplicity
Paco Nathan, Evil Mad Scientist, Concurrent Inc., @pacoid, says:
– “[Scalding] code is compact, simple to understand”
– “nearly 1:1 between elements of conceptual flow diagram and function
calls”
– “Cascalog and Scalding DSLs leverage the functional aspects of
MapReduce, helping to limit complexity in process”
Scala is a functional tool for a fundamentally functional job
26
28. Let’s count some words
This is the “Hello, World!” of anything tangentially related to
Hadoop.
Let’s try it in Scala first without any Hadoop stuff.
val myLines : Seq[String] = ... // get some stuff
val myWords = myLines.flatMap(w => w.split("s+"))
val myWordsGrouped = myWords.groupBy(identity)
val countedWords = myWordsGrouped.mapValues(x=>x.size)
Now write out the words somehow
val countedWords = myLines.flatMap(_.split("s+"))
.groupBy(identity)
.mapValues(_.size)
28
29. Let’s count a lot of words
I’ve gone to the trouble of rewriting this example to run in Hadoop.
Here it is:
val myLines : TypedPipe[String] = TextLine(args("input"))
val myWords = myLines.flatMap(w => w.split("s+"))
val myWordsGrouped = myWords.groupBy(identity)
val countedWords = myWordsGrouped.mapValueStream(x =>
Iterator(x.size))
We can make this even better.
val countedWords = myWordsGrouped.size
countedWords.write(TypedTsv[(String,Long)](output))
29
30. Something for nothing
Other people have already done the hard work to make the
previous example run
The previous example is using Scalding, a Scala library to write
(mainly) Hadoop MapReduce jobs.
https://github.com/twitter/scalding
It even has its own Twitter account, @scalding
Created by:
– Avi Bryant @avibryant
– Oscar Boykin @posco
– Argyris Zymnis @argyris
Tweet them now and tell them how awesome it is
… I’ll wait
30
31. Side by side comparison of local and Hadoop
val myWords = val myWords =
myLines.flatMap(w => myLines.flatMap(w =>
w.split("s+")) w.split("s+"))
val myWordsGrouped = val myWordsGrouped =
myLines.groupBy(identity) myWords.groupBy(identity)
val countedWords = val countedWords =
myWordsGrouped.
myWordsGrouped.
size
mapValues(x=>x.size)
There are some small differences, mainly due to how the
underlying Hadoop process needs to happen.
31
32. Why does this work?
Scala has support for embedded domain specific languages (DSLs)
Scalding includes a couple DSLs for specifying Cascading (and by
extension Hadoop) workflows.
Info about Cascading: http://www.cascading.org/
One of the Scalding DSLs, the Typed one, is designed to be very
close to the standard Scala collections API
It’s not a perfect mapping due to how Cascading and Hadoop work,
but in general it is very easy to write your code locally, change a
couple small bits, and have it run on a Hadoop cluster
Scalding also has a local mode if you want the syntactic sugar
without fussing with Hadoop
32
33. DSLs for everyone!
We’re showing you Scalding in this talk, but there are others that
are similar.
– Scoobi: https://github.com/NICTA/scoobi
– Scrunch: https://github.com/cloudera/crunch/tree/master/scrunch
All three attempt to make using code to written on Scala collections
work (almost) seamlessly in Hadoop.
More on DSLs: http://www.scala-lang.org/node/1403
Some guts:
33
34. Fields based DSL
From com.twitter.scalding.Dsl
/**
* This object has all the implicit functions and values that are used
* to make the scalding DSL.
*
* It's useful to import Dsl._ when you are writing scalding code outside
* of a Job.
*/
object Dsl extends FieldConversions with TupleConversions with
GeneratedTupleAdders with java.io.Serializable {
implicit def pipeToRichPipe(pipe : Pipe) : RichPipe = new
RichPipe(pipe)
implicit def richPipeToPipe(rp : RichPipe) : Pipe = rp.pipe
}
}
}
34
35. Typed DSL
From com.twitter.scalding.TDsl
/** implicits for the type-safe DSL
* import TDsl._ to get the implicit conversions from
Grouping/CoGrouping to Pipe,
* to get the .toTypedPipe method on standard cascading Pipes.
* to get automatic conversion of Mappable[T] to TypedPipe[T]
*/
object TDsl extends Serializable with GeneratedTupleAdders {
implicit def pipeTExtensions(pipe : Pipe) : PipeTExtensions = new
PipeTExtensions(pipe)
implicit def mappableToTypedPipe[T](mappable : Mappable[T])
(implicit flowDef : FlowDef, mode : Mode, conv :
TupleConverter[T]) : TypedPipe[T] = {
TypedPipe.from(mappable)(flowDef, mode, conv)
}
}
35
36. Algebird – It’s like algebra and a bird
We did something fancy in the previous example:
val countedWords = myGroupedWords.size
val countedWords = myGroupedWords.mapValues(x =>
1L).sum
val countedWords = myGroupedWords.mapValues(x =>
1L).reduce(implicit mon: Monoid[Long])((l,r) => mon.plus(l,r))
Scalding uses Algebird extensively to make your life easier.
Algebird can also be used outside of Scalding with no trouble.
Algebird has your favorite things like monoids, monads, bloom
filters, count-min sketches, hyperloglogs, etc.
36
37. Counting words with some extra information
Sometimes we want to know some information about the contexts
that words occurred in. At eBay, this is often the category that a
term appeared in.
Let’s count words and calculate the entropy of the category
distribution for each word.
– If you’re unfamiliar with this type of entropy just think of it as a
measure of how concentrated the distribution is.
– If you really like formulas it is: Σi p(xi) log(pi)
http://en.wikipedia.org/wiki/Entropy_%28information_theory%29
37
38. More code
case class MyAvroOutput(word: String, count: Long,
entropy: Double) extends AvroRecord
TypedTsv[(String,Int)]
.flatMap{case(line,cat) => line.split("s+").map(x =>
(x,Map(cat->1L))}
.group
.sum
.map{ case(word, dist) =>
val total: Double = dist.values.sum
val entropy = (-1)*dist.values.map{ count =>
(count/total)*math.log(count/total)}.sum
MyAvroOutput(word,total.toLong,entropy)
}
.write(PackedAvroSource[MyAvroOutput](output))
Math is great
38
41. Titanic II case study
We want to sell life insurance to passengers of Titanic II
All we have is data from Titanic I
We have to be able to explain why we charge the prices we do
(damn regulators!)
http://commons.wikimedia.org
41
42. Titanic I Data
Cabin class – e.g. 1st, 2nd, 3rd ..
Name – String
Age – Integer
Embark place – String
Destination – String
Room – Integer
Ticket – Integer
Gender – Male or Female
42
44. Classifier code
object Titanic {
def main(args: Array[String]) = {
// parse data
val reader = new CSVReader(new FileReader(
"src/main/data/titanic.csv"))
val passengers = reader.readAll.tail.map(Passenger(_))
val instances = passengers.map(_.getInstance).toSet
// build tree
val treeBuilder = new TreeBuilder
val tree = treeBuilder.buildTree(instances)
// print tree
tree.dump(System.out)
}
}
44
47. Motivation
eBay, like any large site, has a massive number of unique queries
every day
Identifying groups of queries based on user behavior might help us
to understand the individual queries better
For queries we are unsure of we can even try and match them into
a cluster that contains queries we know a lot about.
We can use behavioral things like:
– number of searches
– number of clicks
– number of subsequent bids, buys
– number of exits
– etc
47
48. Let’s use Mahout
Apache Mahout, http://mahout.apache.org/, @ApacheMahout, is a
powerful machine learning and data mining library that works with
Hadoop.
It has a ton of great stuff in it, but many of the drawbacks of using
Java MapReduce apply.
It uses some proprietary data formats (is your data in
VectorWritable SequenceFiles?)
Luckily for us, there are some nice things that work as standalone
pieces.
Coming in release 0.8, there is an excellent single pass k-means
clustering algorithm we can use.
48
49. Let’s use Mahout, inside Scalding
lazy val clust = new StreamingKMeans(new FastProjectionSearch(new
EuclideanDistanceMeasure,5,10),
args("sloppyclusters").toInt, (10e-6).asInstanceOf[Float])
var count = 0;
val sloppyClusters =
TextLine(args("input"))
.map{ str =>
val vec = str.split("t").map(_.toDouble)
val cent = new Centroid(count, new DenseVector(vec))
count += 1
cent
}
.toPipe('centroids)
// This won't work with the current build, coming soon though
.unorderedFoldTo[StreamingKMeans,Centroid]('centroids->’clusters)(clust){(cl,cent) =>
cl.cluster(cent); cl}
.toTypedPipe[StreamingKMeans](Dsl.intFields(Seq(0)))
.flatMap(c => c.iterator.asScala.toIterable)
49
50. Let’s use Mahout, inside Scalding
val finalClusters = sloppyClusters.groupAll
.mapValueStream{centList =>
lazy val bclusterer = new BallKMeans(new BruteSearch(
new EuclideanDistanceMeasure),
args("numclusters").toInt, 100)
bclusterer.cluster(centList.toList.asJava)
bclusterer.iterator.asScala
}
.values
50
51. Results
These are primarily eBay head queries. Remember that the
clustering algorithm knows nothing about the text in the query.
Sample groups:
– chanel, tory burch, diamond ring, kathy van zeeland handbags, ...
– ipad 4th generation, samsung galaxy s iii, iphone 4 s, nexus 4, ipad
mini, ...
– kohls coupons, lowes coupons
– jcrew, cole haan, diesel, banana republic, gucci, burberry, brooks
brothers, …
– ferrari, utility trailer, polaris ranger, porsche 911, dump truck, bmw
m3, chainsaw, rv, chevelle, vw bus, dodge charger, ...
– paypal account, ebay.com, apple touch icon
precomposed.png, paypal, undefined, ps3%2520games, michael%25
20kors
51
52. Clustering Takeaway
There are some excellent libraries that exist, and even fit the
functional model
Scala and Scalding will help you work around the rough edges and
integrate them into your data flow, rather than having to create new
data flows
Being able to prototype locally and in the Scala REPL saves
massive amounts of developer time
52
53. Matrix API case study
Using LinkedIn endorsement data
to rank Scala experts
53
62. Stuff you have seen today …
There are many ways to develop machine learning programs, none
of them are perfect
Scala which reflects the years of evolution since Java's
invention, and Scalding which is the same for vanilla MapReduce,
are a much better alternative
Machine learning is fun and not necessarily complicated
62
I have been in this room on 3 special occasionsJust this last Friday some of the most influential people in our industry like Jeff Weiner and Reid Hoffman judged our internal mini startup contest called incubatorWhen Bryan Stevenson, the person who got the longest standing ovation on TED was giving a private talk to LinkedIn employeesAnd when we celebrated out successful year by everyone getting new IPadsAnd let me tell you, I have never seen this room so full.So thank you all for coming and I promise you that this talk won’t be nearly as exciting as those occasions
There are a lot of ways to develop MR flows, when I came to LinkedIn I saw 3 different patterns …
Simple here is as oppose to complex (reference Rich Hickey’s – Simple made Easy talk)
Simple is the key message – if you had to takeaway a single point from the entire talk, this would be it.
----- Meeting Notes (3/11/13 17:16) -----This is the Facebook user object, well part of it. Avro schemas can get ridiculously big