SlideShare a Scribd company logo
1 of 64
Download to read offline
Validating
Big Data & ML Pipelines
With Apache Spark & Beam
And Avoiding the Awk
Now
mostly
โ€œworksโ€*
Melinda
Seckington
Some links (slides & recordings will be at):
http://bit.ly/2RQQqPi
CatLoversShow
Holden:
โ— My name is Holden Karau
โ— Prefered pronouns are she/her
โ— Developer Advocate at Google
โ— Apache Spark PMC, Beam contributor
โ— previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
โ— co-author of Learning Spark & High Performance Spark
โ— Twitter: @holdenkarau
โ— Slide share http://www.slideshare.net/hkarau
โ— Code review livestreams: https://www.twitch.tv/holdenkarau /
https://www.youtube.com/user/holdenkarau
โ— Spark Talk Videos http://bit.ly/holdenSparkVideos
โ— Talk feedback (if you are so inclined): http://bit.ly/holdenTalkFeedback
What is going to be covered:
โ— A super brief look at property testing
โ— What validation is & why you should do it for your data pipelines
โ— How to make simple validation rules & our current limitations
โ— ML Validation - Guessing if our black box is โ€œcorrectโ€
โ— Cute & scary pictures
โ—‹ I promise at least one cat
Andrew
Who I think you wonderful humans are?
โ— Nice* people
โ— Like silly pictures
โ— Possibly Familiar with one of Scala, Java, or Python?
โ— Possibly Familiar with one of Spark, BEAM, or a similar system (but also ok if
not)
โ— Want to make better software
โ—‹ (or models, or w/e)
โ— Or just want to make software good enough to not have to keep your resume
up to date
So why should you test?
โ— Makes you a better person
โ— Avoid making your users angry
โ— Save $s
โ—‹ Having an ML job fail in hour 26 to restart everything can be expensive...
โ— Waiting for our jobs to fail is a pretty long dev cycle
โ— Honestly youโ€™re probably not watching this unless you agree
So why should you validate?
โ— tl;dr - Your tests probably arenโ€™t perfect
โ— You want to know when you're aboard the failboat
โ— Our code will most likely fail at some point
โ—‹ Sometimes data sources fail in new & exciting ways (see โ€œCall me Maybeโ€)
โ—‹ That jerk on that other floor changed the meaning of a field :(
โ—‹ Our tests wonโ€™t catch all of the corner cases that the real world finds
โ— We should try and minimize the impact
โ—‹ Avoid making potentially embarrassing recommendations
โ—‹ Save having to be woken up at 3am to do a roll-back
โ—‹ Specifying a few simple invariants isnโ€™t all that hard
โ—‹ Repeating Holdenโ€™s mistakes is still not fun
So why should you test & validate:
Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark
So why should you test & validate - cont
Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark
Why donโ€™t we test?
โ— Itโ€™s hard
โ—‹ Faking data, setting up integration tests
โ— Our tests can get too slow
โ—‹ Packaging and building scala is already sad
โ— It takes a lot of time
โ—‹ and people always want everything done yesterday
โ—‹ or I just want to go home see my partner
โ—‹ Etc.
โ— Distributed systems is particularly hard
Why donโ€™t we test? (continued)
Why donโ€™t we validate?
โ— We already tested our code
โ—‹ Riiiight?
โ— What could go wrong?
Also extra hard in distributed systems
โ— Distributed metrics are hard
โ— not much built in (not very consistent)
โ— not always deterministic
โ— Complicated production systems
What happens when we donโ€™t
โ— Personal stories go here
โ—‹ I have no comment about where these stories are from
This talk is being recorded so weโ€™ll leave it at:
โ— Negatively impacted the brand in difficult to quantify ways with words with
multiple meanings
โ— Breaking a feature that cost a few million dollars
โ— Almost recommended illegal content (caught by a lucky manual)
โ— Every search result was a coffee shop
itsbruce
Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
Where do folks get the data for pipeline tests?
โ— Most people generate data by hand
โ— If you have production data you can
sample you are lucky!
โ—‹ If possible you can try and save in the same
format
โ— If our data is a bunch of Vectors or
Doubles Sparkโ€™s got tools :)
โ— Coming up with good test data can
take a long time
โ— Important to test different distributions,
input files, empty partitions etc.
Lori Rielly
Property generating libs: QuickCheck / ScalaCheck
โ— QuickCheck (haskell) generates tests data under a set of constraints
โ— Scala version is ScalaCheck - supported by the two unit testing libraries for
Spark
โ— Sscheck (scala check for spark)
โ—‹ Awesome people*, supports generating DStreams too!
โ— spark-testing-base
โ—‹ Also Awesome people*, generates more pathological (e.g. empty partitions etc.) RDDs
*I assume
PROtara hunt
With spark-testing-base & a million entries
test("map should not change number of elements") {
implicit val generatorDrivenConfig =
PropertyCheckConfig(minSize = 0, maxSize = 1000000)
val property = forAll(RDDGenerator.genRDD[String](sc)){
rdd => importantBussinesLogicFunction(rdd).count() == rdd.count()
}
check(property)
}
But that can get a bit slow for all of our tests
โ— Not all of your tests should need a cluster (or even a cluster simulator)
โ— If you are ok with not using lambdas everywhere you can factor out that logic
and test with traditional tools
โ— Or if you want to keep those lambdas - or verify the transformations logic
without the overhead of running a local distributed systems you can try a
library like kontextfrei
โ—‹ Donโ€™t rely on this alone (but can work well with something like scalacheck)
Lets focus on validation some more:
*Can be used during integration tests to further validate integration results
So how do we validate our jobs?
โ— The idea is, at some point, you made software which worked.
โ— Maybe you manually tested and sampled your results
โ— Hopefully you did a lot of other checks too
โ— But we canโ€™t do that every time, our pipelines are no longer write-once
run-once they are often write-once, run forever, and debug-forever.
Photo by:
Paul Schadler
How many people have something like this?
val data = ...
val parsed = data.flatMap(x =>
try {
Some(parse(x))
} catch {
case _ => None // Whatever, it's JSON
}
}
Lilithis
But we need some data...
val data = ...
data.cache()
val validData = data.filter(isValid)
val badData = data.filter(! isValid(_))
if validData.count() < badData.count() {
// Ruh Roh! Special business error handling goes here
}
...
Pager photo by Vitachao CC-SA 3
Well thatโ€™s less fun :(
โ— Our optimizer canโ€™t just magically chain everything together anymore
โ— My flatMap.map.map is fnur :(
โ— Now Iโ€™m blocking on a thing in the driver
Sn.Ho
Counters* to the rescue**!
โ— Both BEAM & Spark have their it own counters
โ—‹ Per-stage bytes r/w, shuffle r/w, record r/w. execution time, etc.
โ—‹ In UI can also register a listener from spark validator project
โ— We can add counters for things we care about
โ—‹ invalid records, users with no recommendations, etc.
โ—‹ Accumulators have some challenges (see SPARK-12469 for progress) but are an interesting
option
โ— We can _pretend_ we still have nice functional code
*Counters are your friends, but the kind of friends who steal your lunch money
** In a similar way to how regular expressions can solve problemsโ€ฆ.
Miguel Olaya
So what does that look like?
val parsed = data.flatMap(x => try {
Some(parse(x))
happyCounter.add(1)
} catch {
case _ =>
sadCounter.add(1)
None // What's it's JSON
}
}
// Special business data logic (aka wordcount)
// Much much later* business error logic goes here
Pager photo by Vitachao CC-SA 3
Phoebe Baker
Ok but what about those *s
โ— Both BEAM & Spark have their it own counters
โ—‹ Per-stage bytes r/w, shuffle r/w, record r/w. execution time, etc.
โ—‹ In UI can also register a listener from spark validator project
โ— We can add counters for things we care about
โ—‹ invalid records, users with no recommendations, etc.
โ—‹ Accumulators have some challenges (see SPARK-12469 for progress) but are an interesting
option
โ— We can _pretend_ we still have nice functional code
Miguel Olaya
General Rules for making Validation rules
โ— According to a sad survey most people check execution time & record count
โ— spark-validator is still in early stages but interesting proof of concept
โ— Sometimes your rules will miss-fire and youโ€™ll need to manually approve a job
โ— Remember those property tests? Could be Validation rules
โ— Historical data
โ— Domain specific solutions
Photo by:
Paul Schadler
Turning property tests to validation rules*
โ— Yes in theory theyโ€™re already โ€œtestedโ€ but...
โ— Common function to check accumulator value between validation & tests
โ— The real-world is can be fuzzier
Photo by:
Paul Schadler
Input Schema Validation
โ— Handling the โ€œwrongโ€ type of cat
โ— Many many different approaches
โ—‹ filter/flatMap stages
โ—‹ Working in Scala/Java? .as[T]
โ—‹ Manually specify your schema after doing inference the first time :p
โ— Unless your working on mnist.csv there is a good chance your validation is
going to be fuzzy (reject some records accept others)
โ— How do we know if weโ€™ve rejected too much?
Bradley Gordon
As a relative rule:
val (ok, bad) = (sc.accumulator(0), sc.accumulator(0))
val records = input.map{ x => if (isValid(x)) ok +=1 else bad += 1
// Actual parse logic here
}
// An action (e.g. count, save, etc.)
if (bad.value > 0.1* ok.value) {
throw Exception("bad data - do not use results")
// Optional cleanup
}
// Mark as safe
P.S: If you are interested in this check out spark-validator (still early stages).
Found Animals Foundation Follow
Validating records read matches our expectations:
val vc = new ValidationConf(tempPath, "1", true,
List[ValidationRule](
new AbsoluteSparkCounterValidationRule("recordsRead", Some(3000000),
Some(10000000)))
)
val sqlCtx = new SQLContext(sc)
val v = Validation(sc, sqlCtx, vc)
//Business logic goes here
assert(v.validate(5) === true)
}
Photo by Dvortygirl
Counters in BEAM: (1 of 2)
private final Counter matchedWords =
Metrics.counter(FilterTextFn.class, "matchedWords");
private final Counter unmatchedWords =
Metrics.counter(FilterTextFn.class, "unmatchedWords");
// Your special business logic goes here (aka shell out to Fortan
or Cobol)
Luke Jones
Counters in BEAM: (2 of 2)
Long matchedWordsValue = metrics.metrics().queryMetrics(
new MetricsFilter.Builder()
.addNameFilter("matchedWords")).counters().next().committed();
Long unmatchedWordsValue = metrics.metrics().queryMetrics(
new MetricsFilter.Builder()
.addNameFilter("unmatchedWords")).counters().next().committed();
assertThat("unmatchWords less than matched words",
unmatchedWordsValue,
lessThan(matchedWordsValue));
Luke Jones
TFDV: Magic*
โ— Counters, schema inference, anomaly detection, oh my!
# Compute statistics over a new set of data
new_stats = tfdv.generate_statistics_from_csv(NEW_DATA)
# Compare how new data conforms to the schema
anomalies = tfdv.validate_statistics(new_stats, schema)
# Display anomalies inline
tfdv.display_anomalies(anomalies)
Details:
https://medium.com/tensorflow/introducing-tensorflow-data-
validation-data-understanding-validation-and-monitoring-at-
scale-d38e3952c2f0
% of data change
โ— Not just invalid records, if a fieldโ€™s value changes everywhere it could still be
โ€œvalidโ€ but have a different meaning
โ—‹ Remember that example about almost recommending illegal content?
โ— Join and see number of rows different on each side
โ— Expensive operation, but if your data changes slowly / at a constant ish rate
โ—‹ Sometimes done as a separate parallel job
โ— Can also be used on output if applicable
โ—‹ You do have a table/file/as applicable to roll back to right?
Not just data changes: Software too
โ— Things change! Yay! Often for the better.
โ—‹ Especially with handling edge cases like NA fields
โ—‹ Donโ€™t expect the results to change - side-by-side run + diff
โ— Have an ML model?
โ—‹ Welcome to new params - or old params with different default values.
โ—‹ Weโ€™ll talk more about that later
โ— Excellent PyData London talk about how this can impact
ML models
โ—‹ Done with sklearn shows vast differences in CVE results only changing
the model number
Francesco
Onto ML (or Beyond ETL :p)
โ— Some of the same principals work (yay!)
โ—‹ Schemas, invalid records, etc.
โ— Some new things to check
โ—‹ CV performance, Feature normalization ranges
โ— Some things donโ€™t really work
โ—‹ Output size probably isnโ€™t that great a metric anymore
โ—‹ Eyeballing the results for override is a lot harder
contraption
Traditional theory (Models)
โ— Human decides it's time to โ€œupdate their modelsโ€
โ— Human goes through a model update run-book
โ— Human does other work while their โ€œbig-dataโ€ job runs
โ— Human deploys X% new models
โ— Looks at graphs
โ— Presses deploy
Andrew
Traditional practice (Models)
โ— Human is cornered by stakeholders and forced to update models
โ— Spends a few hours trying to remember where the guide is
โ— Gives up and kind of wings it
โ— Comes back to a trained model
โ— Human deploys X% models
โ— Human reads reddit/hacker news/etc.
โ— Presses deploy
Bruno Caimi
New possible practice (sometimes)
โ— Computer kicks off job (probably at an hour boundary because *shrug*) to
update model
โ— Workflow tool notices new model is available
โ— Computer deploys X% models
โ— Software looks at monitoring graphs, uses statistical test to see if itโ€™s bad
โ— Robot rolls it back & pager goes off
โ— Human Presses overrides and deploys anyways
Henrique Pinto
Extra considerations for ML jobs:
โ— Harder to look at output size and say if its good
โ— We can look at the cross-validation performance
โ— Fixed test set performance
โ— Number of iterations / convergence rate
โ— Number of features selected / number of features
changed in selection
โ— (If applicable) delta in model weights or tree size or ...
Jennifer C.
Cross-validation
because saving a test set is effort
โ— Trains on X% of the data and tests on Y%
โ—‹ Multiple times switching the samples
โ— org.apache.spark.ml.tuning has the tools for auto fitting
using CB
โ—‹ If your going to use this for auto-tuning please please save a test set
โ—‹ Otherwise your models will look awesome and perform like a ford
pinto (or whatever a crappy car is here. Maybe a renault reliant?)
Jonathan Kotta
False sense of security:
โ— A/B test please even if CV says many many $s
โ— Rank based things can have training bias with previous
orders
โ— Non-displayed options: unlikely to be chosen
โ— Sometimes can find previous formulaic corrections
โ— Sometimes we can โ€œexperimentallyโ€ determine
โ— Other times we just hope itโ€™s better than nothing
โ— Try and make sure your ML isnโ€™t evil or re-encoding
human biases but stronger
The state of serving is generally a mess
โ— If itโ€™s not ML models its can be better
โ—‹ Reports for everyone!
โ—‹ Or database updates for everyone!
โ— Big challenge: when something goes wrong - how do I
fix it?
โ—‹ Something will go wrong eventually - do you have an old snap shot
you can roll back to quickly?
โ— One project which aims to improve this for ML is
KubeFlow
โ—‹ Goal is unifying training & serving experiences
โ—‹ Despite the name targeting more than just TensorFlow
โ—‹ Doesnโ€™t work with Spark yet, but itโ€™s on my TODO list.
Updating your model
โ— The real world changes
โ— Online learning (streaming) is super cool, but hard to
version
โ—‹ Common kappa-like arch and then revert to checkpoint
โ—‹ Slowly degrading models, oh my!
โ— Iterative batches: automatically train on new data,
deploy model, and A/B test
โ— But A/B testing isnโ€™t enough -- bad data can result in
wrong or even illegal results (ask me after a bud light
lime)
Jennifer C.
Some ending notes
โ— Your validation rules donโ€™t have to be perfect
โ—‹ But they should be good enough they alert infrequently
โ— You should have a way for the human operator to
override.
โ— Just like tests, try and make your validation rules
specific and actionable
โ—‹ # of input rows changed is not a great message - table XYZ grew
unexpectedly to Y%
โ— While you can use (some of) your tests as a basis for
your rules, your rules need tests too
โ—‹ e.g. add junk records/pure noise and see if it rejects
James Petts
Related talks & blog posts
โ— Testing Spark Best Practices (Spark Summit 2014)
โ— Every Day Iโ€™m Shuffling (Strata 2015) & slides
โ— Spark and Spark Streaming Unit Testing
โ— Making Spark Unit Testing With Spark Testing Base
โ— Testing strategy for Apache Spark jobs
โ— The BEAM programming guide
Interested in OSS (especially Spark)?
โ— Check out my Twitch & Youtube for livestreams - http://twitch.tv/holdenkarau
& https://www.youtube.com/user/holdenkarau
Becky Lai
Related packages
โ— spark-testing-base: https://github.com/holdenk/spark-testing-base
โ— sscheck: https://github.com/juanrh/sscheck
โ— spark-validator: https://github.com/holdenk/spark-validator *Proof of
concept, do not actually use*
โ— spark-perf - https://github.com/databricks/spark-perf
โ— spark-integration-tests - https://github.com/databricks/spark-integration-tests
โ— scalacheck - https://www.scalacheck.org/
Becky Lai
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Spark in Action
High Performance SparkLearning PySpark
High Performance Spark!
Available today, not a lot on testing and almost nothing on
validation, but that should not stop you from buying several
copies (if you have an expense account).
Catโ€™s love it!
Amazon sells it: http://bit.ly/hkHighPerfSpark :D
Sign up for the mailing list @
http://www.distributedcomputing4kids.com
And some upcoming talks:
โ— November
โ—‹ Big Data Spain again (tomorrow @ 16:10)
โ—‹ Scale By The Bay - San Francisco
โ— December
โ—‹ ScalaX - London
โ— January
โ—‹ Data Day Texas
โ— February
โ—‹ TBD
โ— March
โ—‹ Strata San Francisco
Cat wave photo by Quinn Dombrowski
k thnx bye! (or questionsโ€ฆ)
If you want to fill out survey:
http://bit.ly/holdenTestingSpark
I will use update results in &
give the talk again the next
time Spark adds a major
feature.
Give feedback on this presentation
http://bit.ly/holdenTalkFeedback
Have questions? - sli.do: SL18 -
Union Grand EF
Iโ€™m sadly heading out to
Spark Summit right after this
but e-mail me:
holden@pigscanfly.ca
And including spark-testing-base up to spark 2.3.1
sbt:
"com.holdenkarau" %% "spark-testing-base" % "2.3.1_0.10.0" % "test"
Maven:
<dependency>
<groupId>com.holdenkarau</groupId>
<artifactId>spark-testing-base_2.11</artifactId>
<version>${spark.version}_0.10.0</version>
<scope>test</scope>
</dependency>
Vladimir Pustovit
Other options for generating data:
โ— mapPartitions + Random + custom code
โ— RandomRDDs in mllib
โ—‹ Uniform, Normal, Possion, Exponential, Gamma, logNormal & Vector versions
โ—‹ Different type: implement the RandomDataGenerator interface
โ— Random
RandomRDDs
val zipRDD = RandomRDDs.exponentialRDD(sc, mean = 1000, size
= rows).map(_.toInt.toString)
val valuesRDD = RandomRDDs.normalVectorRDD(sc, numRows = rows,
numCols = numCols).repartition(zipRDD.partitions.size)
val keyRDD = sc.parallelize(1L.to(rows),
zipRDD.getNumPartitions)
keyRDD.zipPartitions(zipRDD, valuesRDD){
(i1, i2, i3) =>
new Iterator[(Long, String, Vector)] {
...
Testing libraries:
โ— Spark unit testing
โ—‹ spark-testing-base - https://github.com/holdenk/spark-testing-base
โ—‹ sscheck - https://github.com/juanrh/sscheck
โ— Simplified unit testing (โ€œbusiness logic onlyโ€)
โ—‹ kontextfrei - https://github.com/dwestheide/kontextfrei *
โ— Integration testing
โ—‹ spark-integration-tests (Spark internals) - https://github.com/databricks/spark-integration-tests
โ— Performance
โ—‹ spark-perf (also for Spark internals) - https://github.com/databricks/spark-perf
โ— Spark job validation
โ—‹ spark-validator - https://github.com/holdenk/spark-validator *
Photo by Mike Mozart
*Early stage or work-in progress, or proof of concept
Letโ€™s talk about local mode
โ— Itโ€™s way better than you would expect*
โ— It does its best to try and catch serialization errors
โ— Itโ€™s still not the same as running on a โ€œrealโ€ cluster
โ— Especially since if we were just local mode, parallelize and collect might be
fine
Photo by: Bev Sykes
Options beyond local mode:
โ— Just point at your existing cluster (set master)
โ— Start one with your shell scripts & change the master
โ—‹ Really easy way to plug into existing integration testing
โ— spark-docker - hack in our own tests
โ— YarnMiniCluster
โ—‹ https://github.com/apache/spark/blob/master/yarn/src/test/scala/org/apache/spark/deploy/yarn/
BaseYarnClusterSuite.scala
โ—‹ In Spark Testing Base extend SharedMiniCluster
โ–  Not recommended until after SPARK-10812 (e.g. 1.5.2+ or 1.6+)
Photo by Richard Masoner
Integration testing - docker is awesome
โ— Spark-docker, kafka-docker, etc.
โ—‹ Not always super up to date sadly - if you are last stable release A-OK, if you build from
master - sad pandas
โ— Or checkout JuJu Charms (from Canonical) - https://jujucharms.com/
โ—‹ Makes it easy to deploy a bunch of docker containers together & configured in a reasonable
way.
Setting up integration on Yarn/Mesos
โ— So lucky!
โ— You can write your tests in the same way as before - just read from your test
data sources
โ— Missing a data source?
โ—‹ Can you sample it or fake it using the techniques from before?
โ—‹ If so - do that and save the result to your integration enviroment
โ—‹ If notโ€ฆ well good luck
โ— Need streaming integration?
โ—‹ You will probably need a second Spark (or other) job to generate the test data
โ€œBusiness logicโ€ only test w/kontextfrei
import com.danielwestheide.kontextfrei.DCollectionOps
trait UsersByPopularityProperties[DColl[_]] extends
BaseSpec[DColl] {
import DCollectionOps.Imports._
property("Each user appears only once") {
forAll { starredEvents: List[RepoStarred] =>
val result =
logic.usersByPopularity(unit(starredEvents)).collect().toList
result.distinct mustEqual result
}
}
โ€ฆ (continued in example/src/test/scala/com/danielwestheide/kontextfrei/example/)
Generating Data with Spark
import org.apache.spark.mllib.random.RandomRDDs
...
RandomRDDs.exponentialRDD(sc, mean = 1000, size = rows)
RandomRDDs.normalVectorRDD(sc, numRows = rows, numCols = numCols)

More Related Content

What's hot

Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Holden Karau
ย 
PySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupPySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupHolden Karau
ย 
Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?Holden Karau
ย 
Spark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New YorkSpark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New YorkHolden Karau
ย 
Powering tensor flow with big data using apache beam, flink, and spark cern...
Powering tensor flow with big data using apache beam, flink, and spark   cern...Powering tensor flow with big data using apache beam, flink, and spark   cern...
Powering tensor flow with big data using apache beam, flink, and spark cern...Holden Karau
ย 
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018Holden Karau
ย 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018Holden Karau
ย 
Validating Big Data Jobsโ€”Stopping Failures Before Production on Apache Spark...
 Validating Big Data Jobsโ€”Stopping Failures Before Production on Apache Spark... Validating Big Data Jobsโ€”Stopping Failures Before Production on Apache Spark...
Validating Big Data Jobsโ€”Stopping Failures Before Production on Apache Spark...Databricks
ย 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYCHolden Karau
ย 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Holden Karau
ย 
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)Holden Karau
ย 
Devoxx France: Fault tolerant microservices on the JVM with Cassandra
Devoxx France: Fault tolerant microservices on the JVM with CassandraDevoxx France: Fault tolerant microservices on the JVM with Cassandra
Devoxx France: Fault tolerant microservices on the JVM with CassandraChristopher Batey
ย 
HTTP2 in action - Piet Van Dongen - Codemotion Amsterdam 2017
HTTP2 in action - Piet Van Dongen - Codemotion Amsterdam 2017HTTP2 in action - Piet Van Dongen - Codemotion Amsterdam 2017
HTTP2 in action - Piet Van Dongen - Codemotion Amsterdam 2017Codemotion
ย 
Managing Memory in Swift (Yes, that's a thing)
Managing Memory in Swift (Yes, that's a thing)Managing Memory in Swift (Yes, that's a thing)
Managing Memory in Swift (Yes, that's a thing)Carl Brown
ย 
Node.js: CAMTA Presentation
Node.js: CAMTA PresentationNode.js: CAMTA Presentation
Node.js: CAMTA PresentationRob Tweed
ย 
The Need for Async @ ScalaWorld
The Need for Async @ ScalaWorldThe Need for Async @ ScalaWorld
The Need for Async @ ScalaWorldKonrad Malawski
ย 
Os Whitaker
Os WhitakerOs Whitaker
Os Whitakeroscon2007
ย 
Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Demi Ben-Ari
ย 
Not Only Streams for Akademia JLabs
Not Only Streams for Akademia JLabsNot Only Streams for Akademia JLabs
Not Only Streams for Akademia JLabsKonrad Malawski
ย 
Erik Wendel - Beyond JavaScript Frameworks: Writing Reliable Web Apps With El...
Erik Wendel - Beyond JavaScript Frameworks: Writing Reliable Web Apps With El...Erik Wendel - Beyond JavaScript Frameworks: Writing Reliable Web Apps With El...
Erik Wendel - Beyond JavaScript Frameworks: Writing Reliable Web Apps With El...Codemotion
ย 

What's hot (20)

Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018
ย 
PySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupPySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March Meetup
ย 
Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?
ย 
Spark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New YorkSpark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New York
ย 
Powering tensor flow with big data using apache beam, flink, and spark cern...
Powering tensor flow with big data using apache beam, flink, and spark   cern...Powering tensor flow with big data using apache beam, flink, and spark   cern...
Powering tensor flow with big data using apache beam, flink, and spark cern...
ย 
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018Debugging Spark:  Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
ย 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
ย 
Validating Big Data Jobsโ€”Stopping Failures Before Production on Apache Spark...
 Validating Big Data Jobsโ€”Stopping Failures Before Production on Apache Spark... Validating Big Data Jobsโ€”Stopping Failures Before Production on Apache Spark...
Validating Big Data Jobsโ€”Stopping Failures Before Production on Apache Spark...
ย 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
ย 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
ย 
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
ย 
Devoxx France: Fault tolerant microservices on the JVM with Cassandra
Devoxx France: Fault tolerant microservices on the JVM with CassandraDevoxx France: Fault tolerant microservices on the JVM with Cassandra
Devoxx France: Fault tolerant microservices on the JVM with Cassandra
ย 
HTTP2 in action - Piet Van Dongen - Codemotion Amsterdam 2017
HTTP2 in action - Piet Van Dongen - Codemotion Amsterdam 2017HTTP2 in action - Piet Van Dongen - Codemotion Amsterdam 2017
HTTP2 in action - Piet Van Dongen - Codemotion Amsterdam 2017
ย 
Managing Memory in Swift (Yes, that's a thing)
Managing Memory in Swift (Yes, that's a thing)Managing Memory in Swift (Yes, that's a thing)
Managing Memory in Swift (Yes, that's a thing)
ย 
Node.js: CAMTA Presentation
Node.js: CAMTA PresentationNode.js: CAMTA Presentation
Node.js: CAMTA Presentation
ย 
The Need for Async @ ScalaWorld
The Need for Async @ ScalaWorldThe Need for Async @ ScalaWorld
The Need for Async @ ScalaWorld
ย 
Os Whitaker
Os WhitakerOs Whitaker
Os Whitaker
ย 
Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"
ย 
Not Only Streams for Akademia JLabs
Not Only Streams for Akademia JLabsNot Only Streams for Akademia JLabs
Not Only Streams for Akademia JLabs
ย 
Erik Wendel - Beyond JavaScript Frameworks: Writing Reliable Web Apps With El...
Erik Wendel - Beyond JavaScript Frameworks: Writing Reliable Web Apps With El...Erik Wendel - Beyond JavaScript Frameworks: Writing Reliable Web Apps With El...
Erik Wendel - Beyond JavaScript Frameworks: Writing Reliable Web Apps With El...
ย 

Similar to Validating Big Data Pipelines - Big Data Spain 2018

Testing and validating distributed systems with Apache Spark and Apache Beam ...
Testing and validating distributed systems with Apache Spark and Apache Beam ...Testing and validating distributed systems with Apache Spark and Apache Beam ...
Testing and validating distributed systems with Apache Spark and Apache Beam ...Holden Karau
ย 
Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Holden Karau
ย 
Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Holden Karau
ย 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
ย 
Effective testing for spark programs scala bay preview (pre-strata ny 2015)
Effective testing for spark programs scala bay preview (pre-strata ny 2015)Effective testing for spark programs scala bay preview (pre-strata ny 2015)
Effective testing for spark programs scala bay preview (pre-strata ny 2015)Holden Karau
ย 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Holden Karau
ย 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowAdam Doyle
ย 
Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017
Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017
Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017Chris Gates
ย 
Developer Tests - Things to Know (Vilnius JUG)
Developer Tests - Things to Know (Vilnius JUG)Developer Tests - Things to Know (Vilnius JUG)
Developer Tests - Things to Know (Vilnius JUG)vilniusjug
ย 
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Holden Karau
ย 
Using Spark ML on Spark Errors โ€“ What Do the Clusters Tell Us? with Holden K...
 Using Spark ML on Spark Errors โ€“ What Do the Clusters Tell Us? with Holden K... Using Spark ML on Spark Errors โ€“ What Do the Clusters Tell Us? with Holden K...
Using Spark ML on Spark Errors โ€“ What Do the Clusters Tell Us? with Holden K...Databricks
ย 
Beyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden KarauBeyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden KarauSpark Summit
ย 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckData Con LA
ย 
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018Holden Karau
ย 
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Holden Karau
ย 
Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflowsBeyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflowsDataWorks Summit
ย 
Aws uk ug #8 not everything that happens in vegas stay in vegas
Aws uk ug #8   not everything that happens in vegas stay in vegasAws uk ug #8   not everything that happens in vegas stay in vegas
Aws uk ug #8 not everything that happens in vegas stay in vegasPeter Mounce
ย 
Black Ops Testing Workshop from Agile Testing Days 2014
Black Ops Testing Workshop from Agile Testing Days 2014Black Ops Testing Workshop from Agile Testing Days 2014
Black Ops Testing Workshop from Agile Testing Days 2014Alan Richardson
ย 
Demise of test scripts rise of test ideas
Demise of test scripts rise of test ideasDemise of test scripts rise of test ideas
Demise of test scripts rise of test ideasRichard Robinson
ย 
Getting started contributing to Apache Spark
Getting started contributing to Apache SparkGetting started contributing to Apache Spark
Getting started contributing to Apache SparkHolden Karau
ย 

Similar to Validating Big Data Pipelines - Big Data Spain 2018 (20)

Testing and validating distributed systems with Apache Spark and Apache Beam ...
Testing and validating distributed systems with Apache Spark and Apache Beam ...Testing and validating distributed systems with Apache Spark and Apache Beam ...
Testing and validating distributed systems with Apache Spark and Apache Beam ...
ย 
Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016
ย 
Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017
ย 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
ย 
Effective testing for spark programs scala bay preview (pre-strata ny 2015)
Effective testing for spark programs scala bay preview (pre-strata ny 2015)Effective testing for spark programs scala bay preview (pre-strata ny 2015)
Effective testing for spark programs scala bay preview (pre-strata ny 2015)
ย 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
ย 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
ย 
Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017
Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017
Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017
ย 
Developer Tests - Things to Know (Vilnius JUG)
Developer Tests - Things to Know (Vilnius JUG)Developer Tests - Things to Know (Vilnius JUG)
Developer Tests - Things to Know (Vilnius JUG)
ย 
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018
ย 
Using Spark ML on Spark Errors โ€“ What Do the Clusters Tell Us? with Holden K...
 Using Spark ML on Spark Errors โ€“ What Do the Clusters Tell Us? with Holden K... Using Spark ML on Spark Errors โ€“ What Do the Clusters Tell Us? with Holden K...
Using Spark ML on Spark Errors โ€“ What Do the Clusters Tell Us? with Holden K...
ย 
Beyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden KarauBeyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden Karau
ย 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuck
ย 
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018
ย 
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
ย 
Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflowsBeyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
ย 
Aws uk ug #8 not everything that happens in vegas stay in vegas
Aws uk ug #8   not everything that happens in vegas stay in vegasAws uk ug #8   not everything that happens in vegas stay in vegas
Aws uk ug #8 not everything that happens in vegas stay in vegas
ย 
Black Ops Testing Workshop from Agile Testing Days 2014
Black Ops Testing Workshop from Agile Testing Days 2014Black Ops Testing Workshop from Agile Testing Days 2014
Black Ops Testing Workshop from Agile Testing Days 2014
ย 
Demise of test scripts rise of test ideas
Demise of test scripts rise of test ideasDemise of test scripts rise of test ideas
Demise of test scripts rise of test ideas
ย 
Getting started contributing to Apache Spark
Getting started contributing to Apache SparkGetting started contributing to Apache Spark
Getting started contributing to Apache Spark
ย 

Recently uploaded

20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdfMatthew Sinclair
ย 
Busty DesiโšกCall Girls in Vasundhara Ghaziabad >เผ’8448380779 Escort Service
Busty DesiโšกCall Girls in Vasundhara Ghaziabad >เผ’8448380779 Escort ServiceBusty DesiโšกCall Girls in Vasundhara Ghaziabad >เผ’8448380779 Escort Service
Busty DesiโšกCall Girls in Vasundhara Ghaziabad >เผ’8448380779 Escort ServiceDelhi Call girls
ย 
All Time Service Available Call Girls Mg Road ๐Ÿ‘Œ โญ๏ธ 6378878445
All Time Service Available Call Girls Mg Road ๐Ÿ‘Œ โญ๏ธ 6378878445All Time Service Available Call Girls Mg Road ๐Ÿ‘Œ โญ๏ธ 6378878445
All Time Service Available Call Girls Mg Road ๐Ÿ‘Œ โญ๏ธ 6378878445ruhi
ย 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtrahman018755
ย 
๐“€คCall On 7877925207 ๐“€ค Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
๐“€คCall On 7877925207 ๐“€ค Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...๐“€คCall On 7877925207 ๐“€ค Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
๐“€คCall On 7877925207 ๐“€ค Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...Neha Pandey
ย 
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
ย 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdfMatthew Sinclair
ย 
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...SUHANI PANDEY
ย 
best call girls in Hyderabad Finest Escorts Service ๐Ÿ“ž 9352988975 ๐Ÿ“ž Available ...
best call girls in Hyderabad Finest Escorts Service ๐Ÿ“ž 9352988975 ๐Ÿ“ž Available ...best call girls in Hyderabad Finest Escorts Service ๐Ÿ“ž 9352988975 ๐Ÿ“ž Available ...
best call girls in Hyderabad Finest Escorts Service ๐Ÿ“ž 9352988975 ๐Ÿ“ž Available ...kajalverma014
ย 
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...SUHANI PANDEY
ย 
Wagholi & High Class Call Girls Pune Neha 8005736733 | 100% Gennuine High Cla...
Wagholi & High Class Call Girls Pune Neha 8005736733 | 100% Gennuine High Cla...Wagholi & High Class Call Girls Pune Neha 8005736733 | 100% Gennuine High Cla...
Wagholi & High Class Call Girls Pune Neha 8005736733 | 100% Gennuine High Cla...SUHANI PANDEY
ย 
Hireโ† Young Call Girls in Tilak nagar (Delhi) โ˜Ž๏ธ 9205541914 โ˜Ž๏ธ Independent Esc...
Hireโ† Young Call Girls in Tilak nagar (Delhi) โ˜Ž๏ธ 9205541914 โ˜Ž๏ธ Independent Esc...Hireโ† Young Call Girls in Tilak nagar (Delhi) โ˜Ž๏ธ 9205541914 โ˜Ž๏ธ Independent Esc...
Hireโ† Young Call Girls in Tilak nagar (Delhi) โ˜Ž๏ธ 9205541914 โ˜Ž๏ธ Independent Esc...Delhi Call girls
ย 
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls DubaiDubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubaikojalkojal131
ย 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirtrahman018755
ย 
โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป mehsana Call-girls in Women Seeking Men ๐Ÿ”mehsana๐Ÿ” Escorts...
โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป mehsana Call-girls in Women Seeking Men  ๐Ÿ”mehsana๐Ÿ”   Escorts...โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป mehsana Call-girls in Women Seeking Men  ๐Ÿ”mehsana๐Ÿ”   Escorts...
โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป mehsana Call-girls in Women Seeking Men ๐Ÿ”mehsana๐Ÿ” Escorts...nirzagarg
ย 
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfJOHNBEBONYAP1
ย 
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...SUHANI PANDEY
ย 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdfMatthew Sinclair
ย 
Wadgaon Sheri $ Call Girls Pune 10k @ I'm VIP Independent Escorts Girls 80057...
Wadgaon Sheri $ Call Girls Pune 10k @ I'm VIP Independent Escorts Girls 80057...Wadgaon Sheri $ Call Girls Pune 10k @ I'm VIP Independent Escorts Girls 80057...
Wadgaon Sheri $ Call Girls Pune 10k @ I'm VIP Independent Escorts Girls 80057...SUHANI PANDEY
ย 

Recently uploaded (20)

20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
ย 
Busty DesiโšกCall Girls in Vasundhara Ghaziabad >เผ’8448380779 Escort Service
Busty DesiโšกCall Girls in Vasundhara Ghaziabad >เผ’8448380779 Escort ServiceBusty DesiโšกCall Girls in Vasundhara Ghaziabad >เผ’8448380779 Escort Service
Busty DesiโšกCall Girls in Vasundhara Ghaziabad >เผ’8448380779 Escort Service
ย 
All Time Service Available Call Girls Mg Road ๐Ÿ‘Œ โญ๏ธ 6378878445
All Time Service Available Call Girls Mg Road ๐Ÿ‘Œ โญ๏ธ 6378878445All Time Service Available Call Girls Mg Road ๐Ÿ‘Œ โญ๏ธ 6378878445
All Time Service Available Call Girls Mg Road ๐Ÿ‘Œ โญ๏ธ 6378878445
ย 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirt
ย 
๐“€คCall On 7877925207 ๐“€ค Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
๐“€คCall On 7877925207 ๐“€ค Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...๐“€คCall On 7877925207 ๐“€ค Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
๐“€คCall On 7877925207 ๐“€ค Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
ย 
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
ย 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
ย 
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
ย 
best call girls in Hyderabad Finest Escorts Service ๐Ÿ“ž 9352988975 ๐Ÿ“ž Available ...
best call girls in Hyderabad Finest Escorts Service ๐Ÿ“ž 9352988975 ๐Ÿ“ž Available ...best call girls in Hyderabad Finest Escorts Service ๐Ÿ“ž 9352988975 ๐Ÿ“ž Available ...
best call girls in Hyderabad Finest Escorts Service ๐Ÿ“ž 9352988975 ๐Ÿ“ž Available ...
ย 
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...
ย 
Wagholi & High Class Call Girls Pune Neha 8005736733 | 100% Gennuine High Cla...
Wagholi & High Class Call Girls Pune Neha 8005736733 | 100% Gennuine High Cla...Wagholi & High Class Call Girls Pune Neha 8005736733 | 100% Gennuine High Cla...
Wagholi & High Class Call Girls Pune Neha 8005736733 | 100% Gennuine High Cla...
ย 
Hireโ† Young Call Girls in Tilak nagar (Delhi) โ˜Ž๏ธ 9205541914 โ˜Ž๏ธ Independent Esc...
Hireโ† Young Call Girls in Tilak nagar (Delhi) โ˜Ž๏ธ 9205541914 โ˜Ž๏ธ Independent Esc...Hireโ† Young Call Girls in Tilak nagar (Delhi) โ˜Ž๏ธ 9205541914 โ˜Ž๏ธ Independent Esc...
Hireโ† Young Call Girls in Tilak nagar (Delhi) โ˜Ž๏ธ 9205541914 โ˜Ž๏ธ Independent Esc...
ย 
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls DubaiDubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
ย 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirt
ย 
Thalassery Escorts Service โ˜Ž๏ธ 6378878445 ( Sakshi Sinha ) High Profile Call G...
Thalassery Escorts Service โ˜Ž๏ธ 6378878445 ( Sakshi Sinha ) High Profile Call G...Thalassery Escorts Service โ˜Ž๏ธ 6378878445 ( Sakshi Sinha ) High Profile Call G...
Thalassery Escorts Service โ˜Ž๏ธ 6378878445 ( Sakshi Sinha ) High Profile Call G...
ย 
โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป mehsana Call-girls in Women Seeking Men ๐Ÿ”mehsana๐Ÿ” Escorts...
โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป mehsana Call-girls in Women Seeking Men  ๐Ÿ”mehsana๐Ÿ”   Escorts...โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป mehsana Call-girls in Women Seeking Men  ๐Ÿ”mehsana๐Ÿ”   Escorts...
โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป mehsana Call-girls in Women Seeking Men ๐Ÿ”mehsana๐Ÿ” Escorts...
ย 
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
ย 
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
ย 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
ย 
Wadgaon Sheri $ Call Girls Pune 10k @ I'm VIP Independent Escorts Girls 80057...
Wadgaon Sheri $ Call Girls Pune 10k @ I'm VIP Independent Escorts Girls 80057...Wadgaon Sheri $ Call Girls Pune 10k @ I'm VIP Independent Escorts Girls 80057...
Wadgaon Sheri $ Call Girls Pune 10k @ I'm VIP Independent Escorts Girls 80057...
ย 

Validating Big Data Pipelines - Big Data Spain 2018

  • 1. Validating Big Data & ML Pipelines With Apache Spark & Beam And Avoiding the Awk Now mostly โ€œworksโ€* Melinda Seckington
  • 2. Some links (slides & recordings will be at): http://bit.ly/2RQQqPi CatLoversShow
  • 3. Holden: โ— My name is Holden Karau โ— Prefered pronouns are she/her โ— Developer Advocate at Google โ— Apache Spark PMC, Beam contributor โ— previously IBM, Alpine, Databricks, Google, Foursquare & Amazon โ— co-author of Learning Spark & High Performance Spark โ— Twitter: @holdenkarau โ— Slide share http://www.slideshare.net/hkarau โ— Code review livestreams: https://www.twitch.tv/holdenkarau / https://www.youtube.com/user/holdenkarau โ— Spark Talk Videos http://bit.ly/holdenSparkVideos โ— Talk feedback (if you are so inclined): http://bit.ly/holdenTalkFeedback
  • 4.
  • 5. What is going to be covered: โ— A super brief look at property testing โ— What validation is & why you should do it for your data pipelines โ— How to make simple validation rules & our current limitations โ— ML Validation - Guessing if our black box is โ€œcorrectโ€ โ— Cute & scary pictures โ—‹ I promise at least one cat Andrew
  • 6. Who I think you wonderful humans are? โ— Nice* people โ— Like silly pictures โ— Possibly Familiar with one of Scala, Java, or Python? โ— Possibly Familiar with one of Spark, BEAM, or a similar system (but also ok if not) โ— Want to make better software โ—‹ (or models, or w/e) โ— Or just want to make software good enough to not have to keep your resume up to date
  • 7. So why should you test? โ— Makes you a better person โ— Avoid making your users angry โ— Save $s โ—‹ Having an ML job fail in hour 26 to restart everything can be expensive... โ— Waiting for our jobs to fail is a pretty long dev cycle โ— Honestly youโ€™re probably not watching this unless you agree
  • 8. So why should you validate? โ— tl;dr - Your tests probably arenโ€™t perfect โ— You want to know when you're aboard the failboat โ— Our code will most likely fail at some point โ—‹ Sometimes data sources fail in new & exciting ways (see โ€œCall me Maybeโ€) โ—‹ That jerk on that other floor changed the meaning of a field :( โ—‹ Our tests wonโ€™t catch all of the corner cases that the real world finds โ— We should try and minimize the impact โ—‹ Avoid making potentially embarrassing recommendations โ—‹ Save having to be woken up at 3am to do a roll-back โ—‹ Specifying a few simple invariants isnโ€™t all that hard โ—‹ Repeating Holdenโ€™s mistakes is still not fun
  • 9. So why should you test & validate: Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark
  • 10. So why should you test & validate - cont Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark
  • 11. Why donโ€™t we test? โ— Itโ€™s hard โ—‹ Faking data, setting up integration tests โ— Our tests can get too slow โ—‹ Packaging and building scala is already sad โ— It takes a lot of time โ—‹ and people always want everything done yesterday โ—‹ or I just want to go home see my partner โ—‹ Etc. โ— Distributed systems is particularly hard
  • 12. Why donโ€™t we test? (continued)
  • 13. Why donโ€™t we validate? โ— We already tested our code โ—‹ Riiiight? โ— What could go wrong? Also extra hard in distributed systems โ— Distributed metrics are hard โ— not much built in (not very consistent) โ— not always deterministic โ— Complicated production systems
  • 14. What happens when we donโ€™t โ— Personal stories go here โ—‹ I have no comment about where these stories are from This talk is being recorded so weโ€™ll leave it at: โ— Negatively impacted the brand in difficult to quantify ways with words with multiple meanings โ— Breaking a feature that cost a few million dollars โ— Almost recommended illegal content (caught by a lucky manual) โ— Every search result was a coffee shop itsbruce
  • 15. Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
  • 16. Where do folks get the data for pipeline tests? โ— Most people generate data by hand โ— If you have production data you can sample you are lucky! โ—‹ If possible you can try and save in the same format โ— If our data is a bunch of Vectors or Doubles Sparkโ€™s got tools :) โ— Coming up with good test data can take a long time โ— Important to test different distributions, input files, empty partitions etc. Lori Rielly
  • 17. Property generating libs: QuickCheck / ScalaCheck โ— QuickCheck (haskell) generates tests data under a set of constraints โ— Scala version is ScalaCheck - supported by the two unit testing libraries for Spark โ— Sscheck (scala check for spark) โ—‹ Awesome people*, supports generating DStreams too! โ— spark-testing-base โ—‹ Also Awesome people*, generates more pathological (e.g. empty partitions etc.) RDDs *I assume PROtara hunt
  • 18. With spark-testing-base & a million entries test("map should not change number of elements") { implicit val generatorDrivenConfig = PropertyCheckConfig(minSize = 0, maxSize = 1000000) val property = forAll(RDDGenerator.genRDD[String](sc)){ rdd => importantBussinesLogicFunction(rdd).count() == rdd.count() } check(property) }
  • 19. But that can get a bit slow for all of our tests โ— Not all of your tests should need a cluster (or even a cluster simulator) โ— If you are ok with not using lambdas everywhere you can factor out that logic and test with traditional tools โ— Or if you want to keep those lambdas - or verify the transformations logic without the overhead of running a local distributed systems you can try a library like kontextfrei โ—‹ Donโ€™t rely on this alone (but can work well with something like scalacheck)
  • 20. Lets focus on validation some more: *Can be used during integration tests to further validate integration results
  • 21. So how do we validate our jobs? โ— The idea is, at some point, you made software which worked. โ— Maybe you manually tested and sampled your results โ— Hopefully you did a lot of other checks too โ— But we canโ€™t do that every time, our pipelines are no longer write-once run-once they are often write-once, run forever, and debug-forever. Photo by: Paul Schadler
  • 22. How many people have something like this? val data = ... val parsed = data.flatMap(x => try { Some(parse(x)) } catch { case _ => None // Whatever, it's JSON } } Lilithis
  • 23. But we need some data... val data = ... data.cache() val validData = data.filter(isValid) val badData = data.filter(! isValid(_)) if validData.count() < badData.count() { // Ruh Roh! Special business error handling goes here } ... Pager photo by Vitachao CC-SA 3
  • 24. Well thatโ€™s less fun :( โ— Our optimizer canโ€™t just magically chain everything together anymore โ— My flatMap.map.map is fnur :( โ— Now Iโ€™m blocking on a thing in the driver Sn.Ho
  • 25. Counters* to the rescue**! โ— Both BEAM & Spark have their it own counters โ—‹ Per-stage bytes r/w, shuffle r/w, record r/w. execution time, etc. โ—‹ In UI can also register a listener from spark validator project โ— We can add counters for things we care about โ—‹ invalid records, users with no recommendations, etc. โ—‹ Accumulators have some challenges (see SPARK-12469 for progress) but are an interesting option โ— We can _pretend_ we still have nice functional code *Counters are your friends, but the kind of friends who steal your lunch money ** In a similar way to how regular expressions can solve problemsโ€ฆ. Miguel Olaya
  • 26. So what does that look like? val parsed = data.flatMap(x => try { Some(parse(x)) happyCounter.add(1) } catch { case _ => sadCounter.add(1) None // What's it's JSON } } // Special business data logic (aka wordcount) // Much much later* business error logic goes here Pager photo by Vitachao CC-SA 3 Phoebe Baker
  • 27. Ok but what about those *s โ— Both BEAM & Spark have their it own counters โ—‹ Per-stage bytes r/w, shuffle r/w, record r/w. execution time, etc. โ—‹ In UI can also register a listener from spark validator project โ— We can add counters for things we care about โ—‹ invalid records, users with no recommendations, etc. โ—‹ Accumulators have some challenges (see SPARK-12469 for progress) but are an interesting option โ— We can _pretend_ we still have nice functional code Miguel Olaya
  • 28. General Rules for making Validation rules โ— According to a sad survey most people check execution time & record count โ— spark-validator is still in early stages but interesting proof of concept โ— Sometimes your rules will miss-fire and youโ€™ll need to manually approve a job โ— Remember those property tests? Could be Validation rules โ— Historical data โ— Domain specific solutions Photo by: Paul Schadler
  • 29. Turning property tests to validation rules* โ— Yes in theory theyโ€™re already โ€œtestedโ€ but... โ— Common function to check accumulator value between validation & tests โ— The real-world is can be fuzzier Photo by: Paul Schadler
  • 30. Input Schema Validation โ— Handling the โ€œwrongโ€ type of cat โ— Many many different approaches โ—‹ filter/flatMap stages โ—‹ Working in Scala/Java? .as[T] โ—‹ Manually specify your schema after doing inference the first time :p โ— Unless your working on mnist.csv there is a good chance your validation is going to be fuzzy (reject some records accept others) โ— How do we know if weโ€™ve rejected too much? Bradley Gordon
  • 31. As a relative rule: val (ok, bad) = (sc.accumulator(0), sc.accumulator(0)) val records = input.map{ x => if (isValid(x)) ok +=1 else bad += 1 // Actual parse logic here } // An action (e.g. count, save, etc.) if (bad.value > 0.1* ok.value) { throw Exception("bad data - do not use results") // Optional cleanup } // Mark as safe P.S: If you are interested in this check out spark-validator (still early stages). Found Animals Foundation Follow
  • 32. Validating records read matches our expectations: val vc = new ValidationConf(tempPath, "1", true, List[ValidationRule]( new AbsoluteSparkCounterValidationRule("recordsRead", Some(3000000), Some(10000000))) ) val sqlCtx = new SQLContext(sc) val v = Validation(sc, sqlCtx, vc) //Business logic goes here assert(v.validate(5) === true) } Photo by Dvortygirl
  • 33. Counters in BEAM: (1 of 2) private final Counter matchedWords = Metrics.counter(FilterTextFn.class, "matchedWords"); private final Counter unmatchedWords = Metrics.counter(FilterTextFn.class, "unmatchedWords"); // Your special business logic goes here (aka shell out to Fortan or Cobol) Luke Jones
  • 34. Counters in BEAM: (2 of 2) Long matchedWordsValue = metrics.metrics().queryMetrics( new MetricsFilter.Builder() .addNameFilter("matchedWords")).counters().next().committed(); Long unmatchedWordsValue = metrics.metrics().queryMetrics( new MetricsFilter.Builder() .addNameFilter("unmatchedWords")).counters().next().committed(); assertThat("unmatchWords less than matched words", unmatchedWordsValue, lessThan(matchedWordsValue)); Luke Jones
  • 35. TFDV: Magic* โ— Counters, schema inference, anomaly detection, oh my! # Compute statistics over a new set of data new_stats = tfdv.generate_statistics_from_csv(NEW_DATA) # Compare how new data conforms to the schema anomalies = tfdv.validate_statistics(new_stats, schema) # Display anomalies inline tfdv.display_anomalies(anomalies) Details: https://medium.com/tensorflow/introducing-tensorflow-data- validation-data-understanding-validation-and-monitoring-at- scale-d38e3952c2f0
  • 36. % of data change โ— Not just invalid records, if a fieldโ€™s value changes everywhere it could still be โ€œvalidโ€ but have a different meaning โ—‹ Remember that example about almost recommending illegal content? โ— Join and see number of rows different on each side โ— Expensive operation, but if your data changes slowly / at a constant ish rate โ—‹ Sometimes done as a separate parallel job โ— Can also be used on output if applicable โ—‹ You do have a table/file/as applicable to roll back to right?
  • 37. Not just data changes: Software too โ— Things change! Yay! Often for the better. โ—‹ Especially with handling edge cases like NA fields โ—‹ Donโ€™t expect the results to change - side-by-side run + diff โ— Have an ML model? โ—‹ Welcome to new params - or old params with different default values. โ—‹ Weโ€™ll talk more about that later โ— Excellent PyData London talk about how this can impact ML models โ—‹ Done with sklearn shows vast differences in CVE results only changing the model number Francesco
  • 38. Onto ML (or Beyond ETL :p) โ— Some of the same principals work (yay!) โ—‹ Schemas, invalid records, etc. โ— Some new things to check โ—‹ CV performance, Feature normalization ranges โ— Some things donโ€™t really work โ—‹ Output size probably isnโ€™t that great a metric anymore โ—‹ Eyeballing the results for override is a lot harder contraption
  • 39. Traditional theory (Models) โ— Human decides it's time to โ€œupdate their modelsโ€ โ— Human goes through a model update run-book โ— Human does other work while their โ€œbig-dataโ€ job runs โ— Human deploys X% new models โ— Looks at graphs โ— Presses deploy Andrew
  • 40. Traditional practice (Models) โ— Human is cornered by stakeholders and forced to update models โ— Spends a few hours trying to remember where the guide is โ— Gives up and kind of wings it โ— Comes back to a trained model โ— Human deploys X% models โ— Human reads reddit/hacker news/etc. โ— Presses deploy Bruno Caimi
  • 41. New possible practice (sometimes) โ— Computer kicks off job (probably at an hour boundary because *shrug*) to update model โ— Workflow tool notices new model is available โ— Computer deploys X% models โ— Software looks at monitoring graphs, uses statistical test to see if itโ€™s bad โ— Robot rolls it back & pager goes off โ— Human Presses overrides and deploys anyways Henrique Pinto
  • 42. Extra considerations for ML jobs: โ— Harder to look at output size and say if its good โ— We can look at the cross-validation performance โ— Fixed test set performance โ— Number of iterations / convergence rate โ— Number of features selected / number of features changed in selection โ— (If applicable) delta in model weights or tree size or ... Jennifer C.
  • 43. Cross-validation because saving a test set is effort โ— Trains on X% of the data and tests on Y% โ—‹ Multiple times switching the samples โ— org.apache.spark.ml.tuning has the tools for auto fitting using CB โ—‹ If your going to use this for auto-tuning please please save a test set โ—‹ Otherwise your models will look awesome and perform like a ford pinto (or whatever a crappy car is here. Maybe a renault reliant?) Jonathan Kotta
  • 44. False sense of security: โ— A/B test please even if CV says many many $s โ— Rank based things can have training bias with previous orders โ— Non-displayed options: unlikely to be chosen โ— Sometimes can find previous formulaic corrections โ— Sometimes we can โ€œexperimentallyโ€ determine โ— Other times we just hope itโ€™s better than nothing โ— Try and make sure your ML isnโ€™t evil or re-encoding human biases but stronger
  • 45. The state of serving is generally a mess โ— If itโ€™s not ML models its can be better โ—‹ Reports for everyone! โ—‹ Or database updates for everyone! โ— Big challenge: when something goes wrong - how do I fix it? โ—‹ Something will go wrong eventually - do you have an old snap shot you can roll back to quickly? โ— One project which aims to improve this for ML is KubeFlow โ—‹ Goal is unifying training & serving experiences โ—‹ Despite the name targeting more than just TensorFlow โ—‹ Doesnโ€™t work with Spark yet, but itโ€™s on my TODO list.
  • 46. Updating your model โ— The real world changes โ— Online learning (streaming) is super cool, but hard to version โ—‹ Common kappa-like arch and then revert to checkpoint โ—‹ Slowly degrading models, oh my! โ— Iterative batches: automatically train on new data, deploy model, and A/B test โ— But A/B testing isnโ€™t enough -- bad data can result in wrong or even illegal results (ask me after a bud light lime) Jennifer C.
  • 47. Some ending notes โ— Your validation rules donโ€™t have to be perfect โ—‹ But they should be good enough they alert infrequently โ— You should have a way for the human operator to override. โ— Just like tests, try and make your validation rules specific and actionable โ—‹ # of input rows changed is not a great message - table XYZ grew unexpectedly to Y% โ— While you can use (some of) your tests as a basis for your rules, your rules need tests too โ—‹ e.g. add junk records/pure noise and see if it rejects James Petts
  • 48. Related talks & blog posts โ— Testing Spark Best Practices (Spark Summit 2014) โ— Every Day Iโ€™m Shuffling (Strata 2015) & slides โ— Spark and Spark Streaming Unit Testing โ— Making Spark Unit Testing With Spark Testing Base โ— Testing strategy for Apache Spark jobs โ— The BEAM programming guide Interested in OSS (especially Spark)? โ— Check out my Twitch & Youtube for livestreams - http://twitch.tv/holdenkarau & https://www.youtube.com/user/holdenkarau Becky Lai
  • 49. Related packages โ— spark-testing-base: https://github.com/holdenk/spark-testing-base โ— sscheck: https://github.com/juanrh/sscheck โ— spark-validator: https://github.com/holdenk/spark-validator *Proof of concept, do not actually use* โ— spark-perf - https://github.com/databricks/spark-perf โ— spark-integration-tests - https://github.com/databricks/spark-integration-tests โ— scalacheck - https://www.scalacheck.org/ Becky Lai
  • 50. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Spark in Action High Performance SparkLearning PySpark
  • 51. High Performance Spark! Available today, not a lot on testing and almost nothing on validation, but that should not stop you from buying several copies (if you have an expense account). Catโ€™s love it! Amazon sells it: http://bit.ly/hkHighPerfSpark :D
  • 52. Sign up for the mailing list @ http://www.distributedcomputing4kids.com
  • 53. And some upcoming talks: โ— November โ—‹ Big Data Spain again (tomorrow @ 16:10) โ—‹ Scale By The Bay - San Francisco โ— December โ—‹ ScalaX - London โ— January โ—‹ Data Day Texas โ— February โ—‹ TBD โ— March โ—‹ Strata San Francisco
  • 54. Cat wave photo by Quinn Dombrowski k thnx bye! (or questionsโ€ฆ) If you want to fill out survey: http://bit.ly/holdenTestingSpark I will use update results in & give the talk again the next time Spark adds a major feature. Give feedback on this presentation http://bit.ly/holdenTalkFeedback Have questions? - sli.do: SL18 - Union Grand EF Iโ€™m sadly heading out to Spark Summit right after this but e-mail me: holden@pigscanfly.ca
  • 55. And including spark-testing-base up to spark 2.3.1 sbt: "com.holdenkarau" %% "spark-testing-base" % "2.3.1_0.10.0" % "test" Maven: <dependency> <groupId>com.holdenkarau</groupId> <artifactId>spark-testing-base_2.11</artifactId> <version>${spark.version}_0.10.0</version> <scope>test</scope> </dependency> Vladimir Pustovit
  • 56. Other options for generating data: โ— mapPartitions + Random + custom code โ— RandomRDDs in mllib โ—‹ Uniform, Normal, Possion, Exponential, Gamma, logNormal & Vector versions โ—‹ Different type: implement the RandomDataGenerator interface โ— Random
  • 57. RandomRDDs val zipRDD = RandomRDDs.exponentialRDD(sc, mean = 1000, size = rows).map(_.toInt.toString) val valuesRDD = RandomRDDs.normalVectorRDD(sc, numRows = rows, numCols = numCols).repartition(zipRDD.partitions.size) val keyRDD = sc.parallelize(1L.to(rows), zipRDD.getNumPartitions) keyRDD.zipPartitions(zipRDD, valuesRDD){ (i1, i2, i3) => new Iterator[(Long, String, Vector)] { ...
  • 58. Testing libraries: โ— Spark unit testing โ—‹ spark-testing-base - https://github.com/holdenk/spark-testing-base โ—‹ sscheck - https://github.com/juanrh/sscheck โ— Simplified unit testing (โ€œbusiness logic onlyโ€) โ—‹ kontextfrei - https://github.com/dwestheide/kontextfrei * โ— Integration testing โ—‹ spark-integration-tests (Spark internals) - https://github.com/databricks/spark-integration-tests โ— Performance โ—‹ spark-perf (also for Spark internals) - https://github.com/databricks/spark-perf โ— Spark job validation โ—‹ spark-validator - https://github.com/holdenk/spark-validator * Photo by Mike Mozart *Early stage or work-in progress, or proof of concept
  • 59. Letโ€™s talk about local mode โ— Itโ€™s way better than you would expect* โ— It does its best to try and catch serialization errors โ— Itโ€™s still not the same as running on a โ€œrealโ€ cluster โ— Especially since if we were just local mode, parallelize and collect might be fine Photo by: Bev Sykes
  • 60. Options beyond local mode: โ— Just point at your existing cluster (set master) โ— Start one with your shell scripts & change the master โ—‹ Really easy way to plug into existing integration testing โ— spark-docker - hack in our own tests โ— YarnMiniCluster โ—‹ https://github.com/apache/spark/blob/master/yarn/src/test/scala/org/apache/spark/deploy/yarn/ BaseYarnClusterSuite.scala โ—‹ In Spark Testing Base extend SharedMiniCluster โ–  Not recommended until after SPARK-10812 (e.g. 1.5.2+ or 1.6+) Photo by Richard Masoner
  • 61. Integration testing - docker is awesome โ— Spark-docker, kafka-docker, etc. โ—‹ Not always super up to date sadly - if you are last stable release A-OK, if you build from master - sad pandas โ— Or checkout JuJu Charms (from Canonical) - https://jujucharms.com/ โ—‹ Makes it easy to deploy a bunch of docker containers together & configured in a reasonable way.
  • 62. Setting up integration on Yarn/Mesos โ— So lucky! โ— You can write your tests in the same way as before - just read from your test data sources โ— Missing a data source? โ—‹ Can you sample it or fake it using the techniques from before? โ—‹ If so - do that and save the result to your integration enviroment โ—‹ If notโ€ฆ well good luck โ— Need streaming integration? โ—‹ You will probably need a second Spark (or other) job to generate the test data
  • 63. โ€œBusiness logicโ€ only test w/kontextfrei import com.danielwestheide.kontextfrei.DCollectionOps trait UsersByPopularityProperties[DColl[_]] extends BaseSpec[DColl] { import DCollectionOps.Imports._ property("Each user appears only once") { forAll { starredEvents: List[RepoStarred] => val result = logic.usersByPopularity(unit(starredEvents)).collect().toList result.distinct mustEqual result } } โ€ฆ (continued in example/src/test/scala/com/danielwestheide/kontextfrei/example/)
  • 64. Generating Data with Spark import org.apache.spark.mllib.random.RandomRDDs ... RandomRDDs.exponentialRDD(sc, mean = 1000, size = rows) RandomRDDs.normalVectorRDD(sc, numRows = rows, numCols = numCols)