This document discusses Scala and big data technologies. It provides an overview of Scala libraries for working with Hadoop and MapReduce, including Scalding which provides a Scala DSL for Cascading. It also covers Spark, a cluster computing framework that operates on distributed datasets in memory for faster performance. Additional Scala projects for data analysis using functional programming approaches on Hadoop are also mentioned.
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Scala+data
1. SCALA + BIG DATA
PARIS SCALA MEETUP, 05/29/2013
Sam BESSALAH
2. Outline
Scala in the Hadoop World
Hadoop and Map Reduce Basics
Scalding
A word about other Scala DSL : Scrunch and Scoobi
Spark and Co.
Spark
Spark Streaming
More projects using Scala for Data Analysis
7. MapReduce
A programming model for expressing distributed computations
at massive scale
An execution framework for organizing and performing those
computations in an efficient and fault tolerant way,
Bundled within the hadoop framework
8. MapReduce redux ..
Implements two functions at a high level
Map(k1, v1) → List(k2, v2)
Reduce (k2, List(v2)) → List(v3,k3)
The framework takes care of all the plumbing and the
distribution, sorting, shuffling ...
Values with the same key flowed to the same reducer
9.
10.
11.
12. Way too long for a simple word counting
This gave birth too new tools like Hive or Pig
Pig : Script language for dataflow
text = LOAD 'text' USING TextLoader();
tokens = FOREACH text GENERATE FLATTEN(TOKENIZE($0)) as word;
wordcount = FOREACH (GROUP tokens BY word) GENERATE
Group as word,
COUNT_STAR($1) as ct ;
13. Cascading
Open source created by Chris Wensel, now developped at
@Concurrent.
Written in Java, evolves around the concept of Pipes or Data
flow eventually transformed into MapReduce jobs
14.
Cascading change the MR programming model to a generic data
flow oriented programming model
A Flow is composed of a Source, a Sink and a Pipe to connect
them
A pipe is a set of transformations over the input data
Pipes can be combined to create more complex workflow
Contains a flow Optimizer that converts a user data flow to an
optimized data flow, that can be converted in its turn to an efficient
map reduce job.
We could think of pipes as distributed collections
17. But ...
- Cascading makes use of FP idioms.
- Functions are wrapped in Objects
- Constructors (New) define composition
between pipes
- Map Reduce paradigm itself derive from FP
Why not use functional programming ?
18. SCALDING
- A Scala DSL on top of Cascading
- Open Source project developed at Twitter
By Avi Bryant (@avibryant)
Oscar Boykin (@posco)
Argyris Zymnis (@argyris)
-http://github.twitter.com/twitter/scalding
19. Scalding
- Two APIs :
* Field API : Primary API, using Cascading
Fields, dynamic with errors at runtime
* TypeSafe API : Uses Scala Types, errors at
compile time. We’ll focus on this one
- Both can be joined using pipe.Typed and
TypedPipe.from
24. Grouping and Mapping
GroupBuilder :
Builder Pattern object that operates over groups of rows in a
pipe.
Helps building several parallel aggregations : counting,
summing, in one pass . Awesome for stream aggregation.
Used for GroupBy, adds fields which are reduction of existing
ones.
MapReduceMap : map side aggregation, derived from cascading,
using combiners intead of reducers.
Gotcha : doesn’t work with FoldLeft, which is pushed to reducers
25. Type Safe API
Two concepts :
TypePipe[T]
-Wraps Cascading Pipe object. Instances distributed on the cluster,
on top of which transformations occur.
-Similar interface as scala.collection.Iterator[T]
KeyedList[K,V]
- Sharding of Key value objects. Two implementations
Grouped[K,V] : usual grouping on key K
CoGrouped[K,V,W,Result] : a co group over two grouped
pipes, used for joins.
26. Optimized Joins
JoinWithTiny : map side joins
Left side assymetric join with a smaller pipe.
Uses Cascading HashJoin, a non blocking assymetrical join where
the smaller join fits in memory.
BlockJoinWithSmaller : Performs a block join, by replicating data.
SkewJoinwithSmaller|Larger : deals with skewed pipes
CrossWithTiny : Doing a cross product with a moderate sized
pipe, can create a huge output.
27.
28.
29. MATRIX API
Generic Matrix API build using Abstract Algebra(Monoids, Ring, ..)
Value Operation : mapValues, filterValues, binarizeAs
Vector Operations : getRow,reduceRowVectors …
mapRows, rowL2Normalize, rowMeanCentering ..
Usual Matrix operation : trnspose, product ….
Pipe.toMatrix, pipe.flatMaptoMatrix(fields) mapping function ..
30.
31. Scalding is not the only Scala DSL for MR
- Scrunch
Build on top of Crunch, a MR pipelining
library in Java developed by Cloudera.
- Scoobi , build at NICTA
Same idea as crunch, except fully written in
Scala, uses Distributed Lists Dlist to mimic
pipelines.
34. Spark
In-Memory Interactive and Real time
Analytics for Large DataSets
Sam Bessalah
@samklr
Adapted Slides from Matei Zaharia, UC Berkeley
35. Fast, expressive cluster computing system compatible with Apache
Hadoop
Works with any Hadoop-supported storage system (HDFS, S3,
Avro, …)
Improves efficiency through:
In-memory computing primitives
General computation graphs
Improves usability through:
Rich APIs in Java, Scala, Python
Interactive shell
Up to 100× faster
Often 2-10× less code
What is Spark?
36. Key Idea
Work with distributed collections as if they were local
Concept: Resilient Distributed Datasets (RDDs)
- Immutable collections of objects spread across a
cluster
- Built through parallel transformations (map, filter, etc)
- Automatically rebuilt on failure
- Controllable persistence (like caching in RAM)
37. Example: Log Mining
L
oad error messages from a log into memory,
then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘t’)(2))
cachedMsgs = messages.cache()
Block 1Block 1
Block 2Block 2
Block 3Block 3
Worke
r
Worke
r
Worke
r
Worke
r
Worke
r
Worke
r
DriverDriver
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
. . .
tasks
results
Cache
1
Cache
1
Cache
2
Cache
2
Cache
3
Cache
3
Base
RDD
Base
RDDTransformed
RDD
Transformed
RDD
ActionAction
Result: full-text search of Wikipedia in <1 sec (vs
20 sec for on-disk data)
Result: scaled to 1 TB data in 5-7 sec
(vs 170 sec for on-disk data)
38. Fault Tolerance
RDDs track lineage information that can be used
to efficiently reconstruct lost partitions
Ex: messages = textFile(...).filter(_.startsWith(“ERROR”))
.map(_.split(‘t’)(2))
HDFS FileHDFS File Filtered RDDFiltered RDD Mapped RDDMapped RDD
filter
(func = _.contains(...))
map
(func = _.split(...))
39. Spark in Java and Scala
Java API:
JavaRDD<String> lines =
spark.textFile(…);
errors = lines.filter(
new Function<String, Boolean>() {
public Boolean call(String s) {
return s.contains(“ERROR”);
}
});
errors.count()
Scala API:
val lines = spark.textFile(…)
errors = lines.filter(s =>
s.contains(“ERROR”))
// can also write
filter(_.contains(“ERROR”))
errors.count
40. Which Language Should I Use?
Standalone programs can be written in any, but
console is only Python & Scala
Python developers: can stay with Python for both
Java developers: consider using Scala for console
(to learn the API)
Performance: Java / Scala will be faster (statically
typed), but Python can do well for numerical work
with NumPy
41. Scala Cheat Sheet
Variables:
var x: Int = 7
var x = 7 // type inferred
val y = “hi” // read-only
Functions:
def square(x: Int): Int = x*x
def square(x: Int): Int = {
x*x // last line returned
}
Collections and closures:
val nums = Array(1, 2, 3)
nums.map((x: Int) => x + 2) // => Array(3, 4,
5)
nums.map(x => x + 2) // => same
nums.map(_ + 2) // => same
nums.reduce((x, y) => x + y) // => 6
nums.reduce(_ + _) // => 6
42. Learning Spark
Easiest way: Spark interpreter (spark-shell or
pyspark)
Special Scala and Python consoles for cluster use
Runs in local mode on 1 thread by default, but can
control with MASTER environment var:
MASTER=local ./spark-shell # local, 1 thread
MASTER=local[2] ./spark-shell # local, 2 threads
MASTER=spark://host:port ./spark-shell # Spark standalone cluster
43. Main entry point to Spark functionality
Created for you in Spark shells as variable sc
In standalone programs, you’d make your own
(see later for details)
First Stop: SparkContext
44. Creating RDDs
# Turn a local collection into an RDD
sc.parallelize([1, 2, 3])
# Load text file from local FS, HDFS, or S3
sc.textFile(“file.txt”)
sc.textFile(“directory/*.txt”)
sc.textFile(“hdfs://namenode:9000/path/file”)
# Use any existing Hadoop InputFormat
sc.hadoopFile(keyClass, valClass, inputFmt, conf)
45. Basic Transformations
nums = sc.parallelize([1, 2, 3])
# Pass each element through a function
squares = nums.map(lambda x: x*x) # => {1, 4, 9}
# Keep elements passing a predicate
even = squares.filter(lambda x: x % 2 == 0) # => {4}
# Map each element to zero or more others
nums.flatMap(lambda x: range(0, x)) # => {0, 0, 1, 0, 1,
2} Range object (sequence
of numbers 0, 1, …, x-1)
Range object (sequence
of numbers 0, 1, …, x-1)
46. nums = sc.parallelize([1, 2, 3])
# Retrieve RDD contents as a local collection
nums.collect() # => [1, 2, 3]
# Return first K elements
nums.take(2) # => [1, 2]
# Count number of elements
nums.count() # => 3
# Merge elements with an associative function
nums.reduce(lambda x, y: x + y) # => 6
# Write elements to a text file
nums.saveAsTextFile(“hdfs://file.txt”)
Basic Actions
47. Spark’s “distributed reduce” transformations act on RDDs
of key-value pairs
Python: pair = (a, b)
pair[0] # => a
pair[1] # => b
Scala: val pair = (a, b)
pair._1 // => a
pair._2 // => b
Java: Tuple2 pair = new Tuple2(a, b); // class scala.Tuple2
pair._1 // => a
pair._2 // => b
Working with Key-Value Pairs
48. Some Key-Value Operations
pets = sc.parallelize([(“cat”, 1), (“dog”, 1), (“cat”, 2)])
pets.reduceByKey(lambda x, y: x + y)
# => {(cat, 3), (dog, 1)}
pets.groupByKey()
# => {(cat, Seq(1, 2)), (dog, Seq(1)}
pets.sortByKey()
# => {(cat, 1), (cat, 2), (dog, 1)}
reduceByKey also automatically implements combiners on the map
side
50. class MyCoolRddApp {
val param = 3.14
val log = new Log(...)
...
def work(rdd: RDD[Int]) {
rdd.map(x => x + param)
.reduce(...)
}
}
How to get around it:
class MyCoolRddApp {
...
def work(rdd: RDD[Int]) {
val param_ = param
rdd.map(x => x + param_)
.reduce(...)
}
}
NotSerializableException:
MyCoolRddApp (or Log)
NotSerializableException:
MyCoolRddApp (or Log) References only local
variable instead of
this.param
References only local
variable instead of
this.param
Closure Mishap Example
55. Data Locality
First run: data not in cache, so use HadoopRDD’s
locality prefs (from HDFS)
Second run: FilteredRDD is in cache, so use its
locations
If something falls out of cache, go back to HDFS
56. Broadcast Variables
When one creates a broadcast variable b with a
value v, v is saved to a file in a shared file
system. The serialized form of b is a path to this
file. When b’s value is queried on a worker
node, Spark first checks whether v is in a local
cache, and reads it from the file system if it isn’t.
57. Accumulators
Each accumulator is given a unique ID when it is
created. When the accumulator is saved, its
serialized form contains its ID and the “zero” value
for its type.
On the workers, a separate copy of the accumulator
is created for each thread that runs a task using
thread-local variables, and is reset to zero when a
task begins. After each task runs, the worker
sends a message to the driver program containing
the updates it made to various accumulators.
58. Scheduling Process
rdd1.join(rdd2)
.groupBy(…)
.filter(…)
RDD Objects
build operator DAG
agnostic to
operators!
agnostic to
operators!
doesn’t know
about stages
doesn’t know
about stages
DAGScheduler
split graph into
stages of tasks
submit each
stage as ready
DAG
TaskScheduler
TaskSet
launch tasks via
cluster manager
retry failed or
straggling tasks
Cluster
manager
Worker
execute tasks
store and serve
blocks
Block
manager
Threads
Task
stage
failed
60. Example: FilteredRDD
partitions = same as parent RDD
dependencies = “one-to-one” on parent
compute(partition) = compute parent and filter it
preferredLocations(part) = none (ask parent)
partitioner = none
61. Example: JoinedRDD
partitions = one per reduce task
dependencies = “shuffle” on each parent
compute(partition) = read and join shuffled data
preferredLocations(part) = none
partitioner = HashPartitioner(numTasks)
Spark will now know
this data is hashed!
Spark will now know
this data is hashed!
63. DAG Scheduler
Interface: receives a “target” RDD, a function to
run on each partition, and a listener for results
Roles:
Build stages of Task objects (code + preferred loc.)
Submit them to TaskScheduler as ready
Resubmit failed stages if outputs are lost
64. Scheduler Optimizations
Pipelines narrow ops.
within a stage
Picks join algorithms
based on partitioning
(minimize shuffles)
Reuses previously
cached data join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= previously computed partition
Task
73. K-Means Algorithm
Feature 1
Feature2
• Initialize K cluster centers
• Repeat until convergence:
Assign each data point to the
cluster with the closest
center.
Assign each cluster center to
be the mean of its cluster’s
data points.
74. K-Means Algorithm
Feature 1
Feature2
• Initialize K cluster centers
• Repeat until
convergence:
Assign each data point to
the cluster with the
closest center.
Assign each cluster
center to be the mean
of its cluster’s data
points.
75. K-Means Algorithm
Feature 1
Feature2
• Initialize K cluster centers
• Repeat until convergence:
Assign each data point to the
cluster with the closest center.
Assign each cluster center to be
the mean of its cluster’s data
points.
centers = data.takeSample(
false, K, seed)
76. K-Means Algorithm
Feature 1
Feature2
• Initialize K cluster centers
• Repeat until convergence:
Assign each data point to the
cluster with the closest center.
Assign each cluster center to be
the mean of its cluster’s data
points.
centers = data.takeSample(
false, K, seed)
77. K-Means Algorithm
Feature 1
Feature2
• Initialize K cluster centers
• Repeat until convergence:
Assign each data point to the
cluster with the closest center.
Assign each cluster center to be
the mean of its cluster’s data
points.
centers = data.takeSample(
false, K, seed)
78. K-Means Algorithm
Feature 1
Feature2
• Initialize K cluster centers
• Repeat until convergence:
Assign each cluster center to be
the mean of its cluster’s data
points.
centers = data.takeSample(
false, K, seed)
closest = data.map(p =>
(closestPoint(p,centers),p))
79. K-Means Algorithm
Feature 1
Feature2
• Initialize K cluster centers
• Repeat until convergence:
Assign each cluster center to be the
mean of its cluster’s data points.
centers = data.takeSample(
false, K, seed)
closest = data.map(p =>
(closestPoint(p,centers),p))
80. K-Means Algorithm
Feature 1
Feature2
• Initialize K cluster centers
• Repeat until convergence:
Assign each cluster center to be
the mean of its cluster’s data
points.
centers = data.takeSample(
false, K, seed)
closest = data.map(p =>
(closestPoint(p,centers),p))
90. Why PageRank?
Good example of a more complex algorithm
Multiple stages of map & reduce
Benefits from Spark’s in-memory caching
Multiple iterations over the same data
91. Basic Idea
Give pages ranks (scores) based on links to them
Links from many pages high rank
Link from a high-rank page high rank
Image: en.wikipedia.org/wiki/File:PageRank-hi-res-2.png
92. Algorithm
1.0 1.0
1.0
1.0
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
93. Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
1.0 1.0
1.0
1.0
1
0.5
0.5
0.5
1
0.5
94. Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
0.58 1.0
1.85
0.58
95. Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
0.58
0.29
0.29
0.5
1.85
0.58 1.0
1.85
0.58
0.5
96. Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
0.39 1.72
1.31
0.58
. . .
97. Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
0.46 1.37
1.44
0.73
Final state:
98. Scala Implementation
val links = // RDD of (url, neighbors) pairs
var ranks = // RDD of (url, rank) pairs
for (i <- 1 to ITERATIONS) {
val contribs = links.join(ranks).flatMap {
case (url, (links, rank)) =>
links.map(dest => (dest, rank/links.size))
}
ranks = contribs.reduceByKey(_ + _)
.mapValues(0.15 + 0.85 * _)
}
ranks.saveAsTextFile(...)
101. What is Spark Streaming?
Framework for large scale stream processing
Scales to 100s of nodes
Can achieve second scale latencies
Integrates with Spark’s batch and interactive processing
Provides a simple batch-like API for implementing complex algorithm
Can absorb live data streams from Kafka, Flume, ZeroMQ, etc.
103. Requirements
Scalable to large clusters
Second-scale latencies
Simple programming model
Integrated with batch & interactive processing
104. Stateful Stream Processing
Traditional streaming systems have a
event-driven record-at-a-time
processing model
Each node has mutable state
For each record, update state & send
new records
State is lost if node dies!
Making stateful stream processing be
fault-tolerant is challenging
mutable state
node 1
node 3
input
records
node 2
input
records
104
105. Existing Streaming Systems
Storm
Replays record if not processed by a node
Processes each record at least once
May update mutable state twice!
Mutable state can be lost due to failure!
Trident – Use transactions to update
state
Processes each record exactly once
Per state transaction updates slow
105
106. Requirements
Scalable to large clusters
Second-scale latencies
Simple programming model
Integrated with batch & interactive processing
Efficient fault-tolerance in stateful computations
107. Discretized Stream Processing
Run a streaming computation as a series
of very small, deterministic batch jobs
107
Spark
Spark
Streaming
batches of X
seconds
live data stream
processed
results
Chop up the live stream into batches of X
seconds
Spark treats each batch of data as RDDs
and processes them using RDD operations
Finally, the processed results of the RDD
operations are returned in batches
108. Discretized Stream Processing
Run a streaming computation as a series
of very small, deterministic batch jobs
108
Spark
Spark
Streamin
batches of X
seconds
live data stream
processed
results
Batch sizes as low as ½ second, latency ~ 1
second
Potential for combining batch processing
and streaming processing in the same
system
109. Example – Get hashtags from
Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
DStream: a sequence of RDD representing a stream
of data
batch @
t+1
batch @
t+1batch @ tbatch @ t batch @
t+2
batch @
t+2
tweets DStream
stored in memory as an
RDD (immutable,
distributed)
Twitter Streaming API
110. Example – Get hashtags from
Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
flatMap flatMap flatMap
…
transformation: modify data in one Dstream to create
another DStreamnew DStream
new RDDs created
for every batch
batch @
t+1
batch @
t+1batch @ tbatch @ t batch @
t+2
batch @
t+2
tweets DStream
hashTags Dstream
[#cat, #dog, … ]
111. Example – Get hashtags from
Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
output operation: to push data to external
storage
flatMa
p
flatMa
p
flatMa
p
save save save
batch @
t+1
batch @ t
batch @
t+2
tweets DStream
hashTags DStream
every batch
saved to HDFS
112. Java Example
Scala
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
Java
JavaDStream<Status>s = ssc.twitterStream(<Twitter username>, <Twitter
password>)
JavaDstream<String> hashTags = tweets.flatMap(new Function<...> { })
hashTags.saveAsHadoopFiles("hdfs://...") Function object to define the
transformation
113. Fault-tolerance
RDDs are remember the sequence
of operations that created it from
the original fault-tolerant input
data
Batches of input data are
replicated in memory of multiple
worker nodes, therefore fault-
tolerant
Data lost due to worker failure, can
be recomputed from input data
input data
replicated
in memory
flatMap
lost partitions
recomputed on
other workers
tweets
RDD
hashTags
RDD
114. Key concepts
DStream – sequence of RDDs representing a stream of data
Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets
Transformations – modify data from on DStream to another
Standard RDD operations – map, countByValue, reduce, join, …
Stateful operations – window, countByValueAndWindow, …
Output Operations – send data to external entity
saveAsHadoopFiles – saves to HDFS
foreach – do anything with each batch of results
115. Example 2 – Count the hashtags
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
val tagCounts = hashTags.countByValue()
flatMap
map
reduceByKey
flatMap
map
reduceByKey
…
flatMap
map
reduceByKey
batch @
t+1
batch @ t
batch @
t+2
hashTags
tweets
tagCounts
[(#cat, 10), (#dog, 25), ... ]
117. Fault-tolerant Stateful Processing
State data not lost even if a worker node dies
Does not change the value of your result
Exactly once semantics to all transformations
No double counting!
117
118. Other Interesting Operations
Maintaining arbitrary state, track sessions
Maintain per-user mood as state, and update it with his/her tweets
tweets.updateStateByKey(tweet => updateMood(tweet))
Do arbitrary Spark RDD computation within DStream
Join incoming tweets with a spam file to filter out bad tweets
tweets.transform(tweetsRDD => {
tweetsRDD.join(spamHDFSFile).filter(...)
})
119. Performance
Can process 6 GB/sec (60M records/sec) of data on 100 nodes at sub-
second latency
Tested with 100 streams of data on 100 EC2 instances with 4 cores each
119