Contenu connexe Similaire à Spark etl (20) Spark etl1. 1© Cloudera, Inc. All rights reserved.
Tips for Writing ETL Pipelines
with Spark
Imran Rashid|Cloudera, Apache Spark PMC
2. 2© Cloudera, Inc. All rights reserved.
Outline
• Quick Refresher
• Tips for Pipelines
• Spark Performance
• Using the UI
• Understanding Stage Boundaries
• Baby photos
3. 3© Cloudera, Inc. All rights reserved.
About Me
• Member of the Spark PMC
• User of Spark from v0.5 at Quantifind
• Built ETL pipelines, prototype to production
• Supported Data Scientists
• Now work on Spark full time at Cloudera
4. 4© Cloudera, Inc. All rights reserved.
RDDs: Resilient Distributed Dataset
• Data is distributed into partitions spread across a cluster
• Each partition is processed independently and in parallel
• Logical view of the data – not materialized
Image from Dean Wampler, Typesafe
5. 5© Cloudera, Inc. All rights reserved.
Expressive API
• map
• filter
• groupBy
• sort
• union
• join
• leftOuterJoin
• rightOuterJoin
• reduce
• count
• fold
• reduceByKey
• groupByKey
• cogroup
• cross
• zip
• sample
• take
• first
• partitionBy
• mapWith
• pipe
• save
• ...
6. 6© Cloudera, Inc. All rights reserved.
Cheap!
• No serialization
• No IO
• Pipelined
Expensive!
• Serialize Data
• Write to disk
• Transfer over
network
• Deserialize Data
7. 7© Cloudera, Inc. All rights reserved.
Compare to MapReduce Word Count
Spark
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Hadoop MapReduce
val spark = new SparkContext(master, appName, [sparkHome], [jars])
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
9. 9© Cloudera, Inc. All rights reserved.
Pipelines get complicated
• Pipelines get messy
• Input data is messy
• Things go wrong
• Never fast enough
• Need stability for months to
years
• Need Forecasting / Capacity
Planning
Alice one
year ago
Bob 6
months ago
Connie 3
months ago
Derrick last
month
Alice last week
10. 10© Cloudera, Inc. All rights reserved.
Design Goals
• Modularity
• Error Handling
• Understand where and how
11. 11© Cloudera, Inc. All rights reserved.
Catching Errors (1)
sc.textFile(…).map{ line =>
//blows up with parse exception
parse(line)
}
sc.textFile(…).flatMap { line =>
//now we’re safe, right?
Try(parse(line)).toOption
}
How many errors?
1 record? 100 records?
90% of our data?
12. 12© Cloudera, Inc. All rights reserved.
Catching Errors (2)
val parseErrors = sc.accumulator(0L)
val parsed = sc.textFile(…).flatMap { line =>
Try(parse(line)) match {
case Success(s) => Some(s)
case Failure(f) =>
parseErrors += 1
None
}
// parse errors is always 0
if (parseErrors > 500) fail(…)
// and what if we want to see those errors?
13. 13© Cloudera, Inc. All rights reserved.
Catching Errors (3)
• Accumulators break the
RDD abstraction
• You care about when
an action has taken
place
• Force action, or pass
error handling on
• SparkListener to deal w/
failures
• https://gist.github.com/squito/2f7cc02c313
e4c9e7df4#file-accumulatorlistener-scala
case class ParsedWithErrorCounts(val parsed:
RDD[LogLine], errors: Accumulator[Long])
def parseCountErrors(path: String, sc: SparkContext):
ParsedWithErrorCounts = {
val parseErrorCounter =
sc.accumulator(0L).setName(“parseErrors”)
val parsed = sc.textFile(path).flatMap { line =>
line match {
case LogPattern(date, thread, level, source, msg)
=>
Some(LogLine(date, thread, level, source, msg))
case _ =>
parseErrorCounter += 1
None
}
}
ParsedWithErrorCounts(parsed, parseErrorCounter)
}
14. 14© Cloudera, Inc. All rights reserved.
Catching Errors (4)
• Accumulators can give
you “multiple output”
• Create sample of error
records
• You can look at them for
debugging
• WARNING: accumulators
are not scalable
class ReservoirSample[T] {...}
class ReservoirSampleAccumulableParam[T] extends
AccumulableParam[ReservoirSample[T], T]{...}
def parseCountErrors(path: String, sc: SparkContext):
ParsedWithErrorCounts = {
val parseErrors = sc.accumulable(
new ReservoirSample[String](100))(…)
val parsed = sc.textFile(path).flatMap { line =>
line match {
case LogPattern(date, thread, level, source, msg)
=>
Some(LogLine(date, thread, level, source, msg))
case _ =>
parseErrors += line
None
}
}
ParsedWithErrorCounts(parsed, parseErrors)
}
15. 15© Cloudera, Inc. All rights reserved.
Catching Errors (5)
• What if instead, we just filter out each condition?
• Beware deep pipelines
• Eg. RDD.randomSplit
Huge Raw Data
Filter
FlatMap
…parsed
Error 1
Error 2
16. 16© Cloudera, Inc. All rights reserved.
Modularity with RDDs
• Who is caching what?
• What resources should each component?
• What assumptions are made on inputs?
17. 17© Cloudera, Inc. All rights reserved.
Win By Cheating
• Fastest way to shuffle a lot of data:
• Don’t shuffle
• Second fastest way to shuffle a lot of data:
• Shuffle a small amount of data
• ReduceByKey
• Approximate Algorithms
• Same as MapReduce
• BloomFilters, HyperLogLog, Tdigest
• Joins with Narrow Dependencies
18. 18© Cloudera, Inc. All rights reserved.
ReduceByKey when Possible
• ReduceByKey allows a map-side-combine
• Data is merged together before its
serialized & sent over network
• GroupByKey transfers all the data
• Higher serialization and network transfer
costs
parsed
.map{line =>(line.level, 1)}
.reduceByKey{(a, b) => a + b}
.collect()
parsed
.map{line =>(line.level, 1)}
.groupByKey.map{case(word,counts) =>
(word,counts.sum)}
.collect()
19. 19© Cloudera, Inc. All rights reserved.
But I need groupBy
• Eg., incoming transaction logs from user
• 10 TB of historical data
• 50 GB of new data each day
Historical Logs
Day 1
logs
Day 2
Logs
Day 3
Logs
Grouped Logs
20. 20© Cloudera, Inc. All rights reserved.
Using Partitioners for Narrow Joins
• Sort the Historical Logs once
• Each day, sort the small new data
• Join – narrow dependency
• Write data to hdfs
• Day 2 – now what?
• SPARK-1061
• Read from hdfs
• “Remember” data was written
with a partitioner
Wide Join Narrow Join
21. 21© Cloudera, Inc. All rights reserved.
Assume Partitioned
• Day 2 – now what?
• SPARK-1061
• Read from hdfs
• “Remember” data was
written with a partitioner
// Day 1
val myPartitioner = …
val historical =
sc.hadoopFile(“…/mergedLogs/2015/05/19”, …)
.partitionBy(myPartitioner)
val newData =
sc.hadoopFile(“…/newData/2015/05/20”, …)
.partitionBy(myPartitioner)
val grouped = myRdd.cogroup(newData)
grouped.saveAsHadoopFile(
“…/mergedLogs/2015/05/20”)
//Day 2 – new spark context
val historical =
sc.hadoopFile(“…/mergedLogs/2015/05/20”, …)
.assumePartitionedBy(myPartitioner)
22. 22© Cloudera, Inc. All rights reserved.
Recovering from Errors
• I write bugs
• You write bugs
• Spark has bugs
• The bugs might appear after 17 hours in stage 78 of your application
• Spark’s failure recovery might not help you
23. 23© Cloudera, Inc. All rights reserved.
HDFS: Its not so bad
• DiskCachedRDD
• Before doing any work, check if it exists on disk
• If so, just load it
• If not, create it and write it to disk
24. 24© Cloudera, Inc. All rights reserved.
Partitions, Partitions, Partitions …
• Partitions should be small
• Max partition size is 2GB*
• Small partitions help deal w/ stragglers
• Small partitions avoid overhead – take a closer look at internals …
• Partitions should be big
• “For ML applications, the best setting to set the number of partitions to match
the number of cores to reduce shuffle size.” Xiangrui Meng on user@
• Why? Take a closer look at internals …
25. 25© Cloudera, Inc. All rights reserved.
Parameterize Partition Numbers
• Many transformations take a second parameter
• reduceByKey(…, nPartitions)
• sc.textFile(…, nPartitions)
• Both sides of shuffle matter!
• Shuffle read (aka “reduce”)
• Shuffle write (aka “map”) – controlled by previous stage
• As datasets change, you might need to change the numbers
• Make this a parameter to your application
• Yes, you may need to expose a LOT of parameters
27. 27© Cloudera, Inc. All rights reserved.
Some Demos
• Collect a lot of data
• Slow tasks
• DAG visualization
• RDD names
29. 29© Cloudera, Inc. All rights reserved.
What data and where is it going?
• Narrow Dependencies (aka “OneToOneDependency”)
• cheap
• Wide Dependencies (aka shuffles)
• how much is shuffled
• Is it skewed
• Driver bottleneck
30. 30© Cloudera, Inc. All rights reserved.
Driver can be a bottleneck
Credit: Sandy Ryza, Cloudera
31. 31© Cloudera, Inc. All rights reserved.
Driver can be a bottleneck
GOOD BAD
rdd.collect() Exploratory data analysis; merging a
small set of results.
Sequentially scan entire data set on driver.
No parallelism, OOM on driver.
rdd.reduce() Summarize the results from a small
dataset.
Big Data Structures, from lots of
partitions.
sc.accumulator() Small data types, eg., counters. Big Data Structures, from lots of
partitions. Set of a million “most
interesting” user ids from each partition.
33. 33© Cloudera, Inc. All rights reserved.
Stages are not MapReduce Steps!
Map
Reduce
Shuffle
Map
Reduce
Shuffle
Map
Reduce
Shuffle
Map
Reduce
Shuffle
Map
ReduceByKey
(mapside
combine)
Shuffle
Filter
MapReduce
Step
ReduceByKey
FlatMap
GroupByKey
Collect
Shuffle
34. 34© Cloudera, Inc. All rights reserved.
I still get confused
(discussion in a code review, testing a large sortByKey)
WP: … then we wait for completion of stage 3 …
ME: hang on, stage 3? Why are there 3 stages?
SortByKey does one extra pass to find the range of the
keys, but that’s two stages
WP: The other stage is data generation
ME: That can’t be right. Data Generation is pipelined,
its just part of the first stage
…
ME: duh – the final sort is two stages – shuffle write
then shuffle read
InputRDD
Sample
data to find
range of
keys
ShuffleMap
for Sort
ShuffleRead
for Sort
Stage 1
Stage 2
Stage 3
NB:
computed twice!
35. 35© Cloudera, Inc. All rights reserved.
Tip grab bag
• Minimize data volume
• Compact formats: avro, parquet
• Kryo Serialization
• require registration in development, but not in production
• Look at data skew, key cardinality
• Tune your cluster
• Use the UI to tune your job
• Set names on all cached RDDs
36. 36© Cloudera, Inc. All rights reserved.
More Resources
• Very active and friendly community
• http://spark.apache.org/community.html
• Dean Wampler’s self-paced spark workshop
• https://github.com/deanwampler/spark-workshop
• Tips for Better Spark Jobs
• http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-
better-spark-programs
• Tuning & Debugging Spark (with another explanation of internals)
• http://www.slideshare.net/pwendell/tuning-and-debugging-in-apache-spark
• Tuning Spark On Yarn
• http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/