Effective testing for spark programs Strata NY 2015

Effective Testing for Spark
Programs
or avoiding “I didn’t think that could happen”
Now
mostly
“works”*
*See developer for details. Does not imply warranty. :p

Who am I?
● My name is Holden Karau
● Prefered pronouns are she/her
● I’m a Software Engineer
● currently Alpine and previously Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & Fast Data processing with Spark
● @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Linkedin https://www.linkedin.com/in/holdenkarau
● Spark Videos http://bit.ly/holdenSparkVideos

What is going to be covered:
● What I think I might know about you
● A bit about why you should test your programs
● Doing traditional unit testing for Spark programs
○ Along with special considerations for SQL/Hive & Streaming
● Using counters & other job acceptance tests w/ Spark
● Cute & scary pictures
○ I promise at least one panda and one cat
● “Future Work”
○ Some of this future work might even get done!

Who I think you wonderful humans are?
● Nice* people
● Like silly pictures
● Familiar with Apache Spark
○ If not, buy one of my books or watch Paco’s awesome video
● Familiar with one of Scala, Java, or Python
○ If you know R well I’d love to chat though
● Want to make better software
○ (or models, or w/e)

So why should you test?
● Makes you a better person
● Save $s
○ May help you avoid losing your employer all of their money
■ Or “users” if we were in the bay
○ AWS is expensive
● Waiting for our jobs to fail is a pretty long dev cycle
● This is really just to guilt trip you & give you flashbacks to your QA internships

Why don’t we test?
● It’s hard
○ Faking data, setting up integration tests, urgh w/e
● Our tests can get too slow
● It takes a lot of time
○ and people always want everything done yesterday
○ or I just want to go home see my partner
○ etc.

Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455

An artisanal Spark unit test
@transient private var _sc: SparkContext = _
override def beforeAll() {
_sc = new SparkContext("local[4]")
super.beforeAll()
}
override def afterAll() {
if (sc != null)
sc.stop()
System.clearProperty("spark.driver.port") // rebind issue
_sc = null
super.afterAll()
}
Photo by morinesque

And on to the actual test...
test("really simple transformation") {
val input = List("hi", "hi holden", "bye")
val expected = List(List("hi"), List("hi", "holden"), List("bye"))
assert(tokenize(sc.parallelize(input)).collect().toList === expected)
}
def tokenize(f: RDD[String]) = {
f.map(_.split(" ").toList)
}
Photo by morinesque

Wait, where were the batteries?
Photo by Jim Bauer

Let’s get batteries!
● Spark unit testing
○ spark-testing-base - https://github.com/holdenk/spark-testing-base
○ sscheck - https://github.com/juanrh/sscheck
● Integration testing
○ spark-integration-tests (Spark internals) - https://github.com/databricks/spark-integration-tests
● Performance
○ spark-perf (also for Spark internals) - https://github.com/databricks/spark-perf
● Spark job validation
○ spark-validator - https://github.com/holdenk/spark-validator
Photo by Mike Mozart

A simple unit test re-visited (Scala)
class SampleRDDTest extends FunSuite with SharedSparkContext {
val input = List("hi", "hi holden", "bye")
assert(SampleRDD.tokenize(sc.parallelize(input)).collect().toList === expected)
}
}

A simple unit test (Java)
public class SampleJavaRDDTest
extends SharedJavaSparkContext implements Serializable {
@Test public void verifyMapTest() {
List<Integer> input = Arrays.asList(1,2);
JavaRDD<Integer> result = jsc().parallelize(input).map(
new Function<Integer, Integer>() { public Integer call(Integer x) { return
x * x;}});
assertEquals(input.size(), result.count());
}
}

A simple unit test (Python)
class SimpleTest(SparkTestingBaseTestCase):
"""A simple test."""
def test_basic(self):
"""Test a simple collect."""
input = ["hello world"]
rdd = self.sc.parallelize(input)
result = rdd.collect()
assert result == input

Making fake data
● If you have production data you can sample you are lucky!
○ If possible you can try and save in the same format
● sc.parallelize is pretty good for small tests
○ Note: that we can specify the number of partitions
● Coming up with good test data can take a long time
Lori Rielly

QuickCheck / ScalaCheck
● QuickCheck generates tests data under a set of constraints
● Scala version is ScalaCheck - supported by the two unit testing libraries for
Spark
● sscheck
○ Awesome people*, supports generating DStreams too!
● spark-testing-base
○ Also Awesome people*, generates more pathological RDDs
*I assume
PROtara hunt

With sscheck
def forallRDDGenOfNtoM = {
val minWords, maxWords = (50, 100)
Prop.forAll(RDDGen.ofNtoM(50, 100, arbitrary[String])) { rdd : RDD[String] =>
rdd.map(_.length()).sum must be_>=(0.0)
}
}

With spark-testing-base
test("map should not change number of elements") {
forAll(RDDGenerator.genRDD[String](sc)){
rdd => rdd.map(_.length).count() == rdd.count()
}
}

Testing streaming….
Photo by Steve Jurvetson

// Setup our Stream:
class TestInputStream[T: ClassTag](@transient var sc:
SparkContext,
ssc_ : StreamingContext, input: Seq[Seq[T]], numPartitions: Int)
extends FriendlyInputDStream[T](ssc_) {
def start() {}
def stop() {}
def compute(validTime: Time): Option[RDD[T]] = {
logInfo("Computing RDD for time " + validTime)
val index = ((validTime - ourZeroTime) / slideDuration - 1).
toInt
val selectedInput = if (index < input.size) input(index) else
Seq[T]()
// lets us test cases where RDDs are not created
if (selectedInput == null) {
return None
}
val rdd = sc.makeRDD(selectedInput, numPartitions)
logInfo("Created RDD " + rdd.id + " with " + selectedInput)
Some(rdd)
}
}
Artisanal Stream Testing Code
trait StreamingSuiteBase extends FunSuite with BeforeAndAfterAll with Logging
with SharedSparkContext {
// Name of the framework for Spark context
def framework: String = this.getClass.getSimpleName
// Master for Spark context
def master: String = "local[4]"
// Batch duration
def batchDuration: Duration = Seconds(1)
// Directory where the checkpoint data will be saved
lazy val checkpointDir = {
val dir = Utils.createTempDir()
logDebug(s"checkpointDir: $dir")
dir.toString
}
// Default after function for any streaming test suite. Override this
// if you want to add your stuff to "after" (i.e., don't call after { } )
override def afterAll() {
System.clearProperty("spark.streaming.clock")
super.afterAll()
}
Phot
o by
Stev
e
Jurv
etso
n

and continued….
/**
* Create an input stream for the provided input sequence. This is done using
* TestInputStream as queueStream's are not checkpointable.
*/
def createTestInputStream[T: ClassTag](sc: SparkContext, ssc_ :
TestStreamingContext,
input: Seq[Seq[T]]): TestInputStream[T] = {
new TestInputStream(sc, ssc_, input, numInputPartitions)
}
// Default before function for any streaming test suite. Override this
// if you want to add your stuff to "before" (i.e., don't call before { } )
override def beforeAll() {
if (useManualClock) {
logInfo("Using manual clock")
conf.set("spark.streaming.clock", "org.apache.spark.streaming.util.
TestManualClock") // We can specify our own clock
} else {
logInfo("Using real clock")
conf.set("spark.streaming.clock", "org.apache.spark.streaming.util.SystemClock")
}
super.beforeAll()
}
/**
* Run a block of code with the given StreamingContext and automatically
* stop the context when the block completes or when an exception is thrown.
*/
def withOutputAndStreamingContext[R](outputStreamSSC: (TestOutputStream
[R], TestStreamingContext))
(block: (TestOutputStream[R], TestStreamingContext) => Unit): Unit = {
val outputStream = outputStreamSSC._1
val ssc = outputStreamSSC._2
try {
block(outputStream, ssc)
} finally {
try {
ssc.stop(stopSparkContext = false)
} catch {
case e: Exception =>
logError("Error stopping StreamingContext", e)
}
}
}
}

and now for the clock
/*
* Allows us access to a manual clock. Note that the manual clock changed between
1.1.1 and 1.3
*/
class TestManualClock(var time: Long) extends Clock {
def this() = this(0L)
def getTime(): Long = getTimeMillis() // Compat
def currentTime(): Long = getTimeMillis() // Compat
def getTimeMillis(): Long =
synchronized {
time
}
def setTime(timeToSet: Long): Unit =
synchronized {
time = timeToSet
notifyAll()
}
def advance(timeToAdd: Long): Unit =
synchronized {
time += timeToAdd
notifyAll()
}
def addToTime(timeToAdd: Long): Unit = advance(timeToAdd) // Compat
/**
* @param targetTime block until the clock time is set or advanced to at least this
time
* @return current time reported by the clock when waiting finishes
*/
def waitTillTime(targetTime: Long): Long =
synchronized {
while (time < targetTime) {
wait(100)
}
getTimeMillis()
}
}

Testing streaming the happy panda way
● Creating test data is hard
○ ssc.queueStream works - unless you need checkpoints (1.4.1+)
● Collecting the data locally is hard
○ foreachRDD & a var
● figuring out when your test is “done”
Let’s abstract all that away into testOperation

We can hide all of that:
val input = List(List("hi"), List("hi holden"), List("bye"))
testOperation[String, String](input, tokenize _, expected, useSet = true)
}
Photo by An eye
for my mind

What about DataFrames?
● We can do the same as we did for RDD’s
● Inside of Spark validation looks like:
def checkAnswer(df: DataFrame, expectedAnswer: Seq[Row])
● Sadly it’s not in a published package :(
def equalDataFrames(expected: DataFrame, result: DataFrame) {
def approxEqualDataFrames(e: DataFrame, r: DataFrame, tol: Double) {

We can make it easier!*
test("dataframe should be equal to its self") {
val sqlCtx = sqlContext
import sqlCtx.implicits._// Yah I know this is ugly
val input = sc.parallelize(inputList).toDF
equalDataFrames(input, input)
}
*This may or may not be easier.

Let’s talk about local mode
● It’s way better than you would expect*
● It does its best to try and catch serialization errors
● It’s still not the same as running on a “real” cluster
Photo by: Bev Sykes

Running on a real* cluster
● Start one with your shell scripts & change the master
○ Really easy way to plug into existing integration testing
● spark-docker - hack in our own tests
● YarnMiniCluster
○ https://github.
com/apache/spark/blob/master/yarn/src/test/scala/org/apache/spark/deploy/yarn/BaseYarnClu
sterSuite.scala
○ In Spark Testing Base extend SharedMiniCluster
■ Not recommended until after SPARK-10812 is merged & (maybe 1.6 ?)
Photo by Richard Masoner

Why should we validate our jobs?
● Our code will most likely fail
○ Sometimes data sources fail in new & exciting ways (see Mongo)
○ That jerk on that other floor changed the meaning of a field :(
○ Our tests won’t catch all of the corner cases that the real world finds
● We should try and minimize the impact
○ Avoid making potentially embarrassing recommendations
○ Save having to be woken up at 3am to do a roll-back
○ Specifying a few simple invariants isn’t that hard
Photo of GSM intercept by Matt E

So how do we validate our jobs?
● Spark has it own counters
○ Per-stage bytes r/w, shuffle r/w, record r/w. execution time, etc.
● We can add counters for things we care about
○ invalid records, users with no recommendations, etc.
● We can write rules for if the values are expected
○ Simple rules (X > J)
○ Historic rules (X > Avg(J))
Photo by:
Paul Schadler

Simple historic validation Photo by Dvortygirl
val vc = new ValidationConf(jobHistoryPath, "1", true,
List[ValidationRule](new AvgRule("acc", 0.001, Some(200))))
val v = Validation(sc, vc)
// Some job logic
// Register an accumulator (optional)
val acc = sc.accumulator(0)
v.registerAccumulator(acc, "acc")
// More Job logic goes here
if (v.validate(jobId)) {
// Success logic goes here
} else sadness()

With a Spark internal counter...
val vc = new ValidationConf(tempPath, "1", true,
List[ValidationRule](
new AbsoluteSparkCounterValidationRule("recordsRead", Some(30), Some
(1000)))
)
val sqlCtx = new SQLContext(sc)
val v = Validation(sc, sqlCtx, vc)
//Do work here....
assert(v.validate(5) === true)
}
Photo by Dvortygirl

Related talks & blog posts
● Testing Spark Best Practices (Spark Summit 2014)
● Every Day I’m Shuffling (Strata 2015) & slides
● Spark and Spark Streaming Unit Testing
● Making Spark Unit Testing With Spark Testing Base

Related packages
● spark-testing-base: https://github.com/holdenk/spark-testing-base
● sscheck: https://github.com/juanrh/sscheck
● spark-validator: https://github.com/holdenk/spark-validator *ALPHA*
● spark-perf - https://github.com/databricks/spark-perf
● spark-integration-tests - https://github.com/databricks/spark-integration-tests
● scalacheck - https://www.scalacheck.org/

“Future Work”
● Integrating into Apache Spark
○ Using their style rules to simplify future transition
● Better ScalaCheck integration (with the help of the sscheck people)
● Some reasonable prefab rules for Job validation
● Testing details in my next Spark book
● Whatever* you all want
○ Testing with Spark survey: http://bit.ly/holdenTestingSpark
Semi-likely:
● integration testing
*That I feel like doing, or you feel like making a pull request for.
Photo by
bullet101

Cat wave photo by Quinn Dombrowski
k thnx bye!
If you want to fill out survey: http://bit.
ly/holdenTestingSpark
Will tweet results
“eventually” @holdenkarau

Effective testing for spark programs Strata NY 2015

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Effective testing for spark programs Strata NY 2015

Similaire à Effective testing for spark programs Strata NY 2015 (20)

Dernier

Dernier (20)

Effective testing for spark programs Strata NY 2015