This talk will present recommended patterns and corresponding anti-patterns for testing data processing pipelines. We will suggest technology and architecture to improve testability, both for batch and streaming processing pipelines. We will primarily focus on testing for the purpose of development productivity and product iteration speed, but briefly also cover data quality testing.
2. www.mapflat.com
Who's talking
● Swedish Institute of Computer. Science. (test & debug tools)
● Sun Microsystems (large machine verification)
● Google (Hangouts, productivity)
● Recorded Future (NLP startup) (data integrations)
● Cinnober Financial Tech. (trading systems)
● Spotify (data processing & modelling, productivity)
● Schibsted Products & Tech (data processing & modelling)
● Mapflat (independent data engineering consultant)
2
3. www.mapflat.com
Agenda
● Data applications from a test perspective
● Testing batch processing products
● Testing stream processing product
● Data quality testing
Main focus is functional, regression testing
Prerequisites: Backend dev testing, basic data experience, reading Scala
3
5. www.mapflat.com
Workflow orchestrator
● Dataset “build tool”
● Run job instance when
○ input is available
○ output missing
○ resources are available
● Backfill for previous failures
● DSL describes DAG
○ Includes ingress & egress
Luigi / Airflow
5
DB
Orchestrator
6. www.mapflat.com
6
Stream pipeline anatomy
● Unified log - bus of all business events
● Pub/sub with history
○ Kafka
○ AWS Kinesis, Google Pub/Sub, Azure
Event Hub
● Decoupled producers/consumers
○ In source/deployment
○ In space
○ In time
● Publish results to log
● Recovery from link failures
● Replay on job bug fix
Job
Ads Search Feed
App App App
StreamStream Stream
Stream Stream Stream
Job
Job
Stream Stream Stream
Job
Job
Data lake
Business
intelligence
Job
8. www.mapflat.com
Online failure vs Offline failure
8
10000s of customers, imprecise feedback
Need low probability =>
Proactive prevention =>
Low ROI
10s of employees, precise feedback
Ok with medium probability =>
Reactive repair =>
High ROI
Risk = probability * impact
9. www.mapflat.com
Value of testing
9
For data-centric (batch) applications, in this order:
● Productivity
○ Move fast without breaking things
● Fast experimentation
○ 10% good ideas, 90% neutral or bad
● Data quality
○ Challenging, more important than
● Technical quality
○ Technical failure =>
■ Operations hassle.
■ Stale data. Often acceptable.
■ No customers were harmed. Usually.
Significant harm to external customer is rare enough to be reactive - data-driven by bug frequency.
11. www.mapflat.com
11
Data pipeline properties
● Output = function(input, code)
○ No external factors => deterministic
○ Easy to craft input, even in large tests
○ Perfect for test!
● Pipeline and job endpoints are stable
○ Correspond to business value
● Internal abstractions are volatile
○ Reslicing in different dimensions is
common
q
12. www.mapflat.com
12
Potential test scopes
● Unit/component
● Single job
● Multiple jobs
● Pipeline, including service
● Full system, including client
Choose stable interfaces
Each scope has a cost
Job
Service
App
Stream
Stream
Job
Stream
Job
14. www.mapflat.com
14
Scopes to avoid
● Unit/Component
○ Few stable interfaces
○ Not necessary
○ Avoid mocks, DI rituals
● Full system, including client
○ Client automation fragile
“Focus on functional system tests, complement
with smaller where you cannot get coverage.”
- Henrik Kniberg
Job
Service
App
Stream
Stream
Job
Stream
Job
15. www.mapflat.com
Testing single batch job
15
Job
Standard Scalatest harness
file://test_input/ file://test_output/
1. Generate input 2. Run in local mode 3. Verify output
f() p()
Runs well in
CI / from IDE
16. www.mapflat.com
Testing batch pipelines - two options
16
Standard Scalatest harness
file://test_input/ file://test_output/
1. Generate input 2. Run custom multi-job
Test job with sequence of jobs
3. Verify output
f() p()
A:
Customised workflow manager setup
+ Runs in CI
+ Runs in IDE
+ Quick setup
- Multi-job
maintenance
p()
+ Tests workflow logic
+ More authentic
- Workflow mgr setup
for testability
- Difficult to debug
- Dataset handling
with Python
f()
B:
● Both can be extended with ingress (Kafka), egress DBs
17. www.mapflat.com
17
Test
data
Test input data is code
● Tied to production code
● Should live in version control
● Duplication to eliminate
● Generation gives more power
○ Randomness can be desirable
○ Larger scale
○ Statistical distribution
● Maintainable
> git add src/resources/test-data.json
userInputFile.overwrite(Json.toJson(
userTemplate.copy(
age = 27,
product = "premium",
country = "NO"))
.toString)
Prod
data Privacy!
18. www.mapflat.com
18
Putting batch tests together
class CleanUserTest extends FlatSpec {
val inFile = (tmpDir / "test_user.json")
val outFile = (tmpDir / "test_output.json")
def writeInput(users: Seq[User]) = inFile.overwrite(users.map(u => Json.toJson(u).toString).mkString("n"))
def readOutput() = outFile.contentAsString.split("n").map(line => Json.fromJson[User](Json.parse(line))).toSeq
def runJob(input: Seq[User]): Seq[User] = {
writeInput(input)
val args = Seq(
"--user", inFile.path.toString,
"--output", outFile.path.toString)
CleanUserJob.main(args) // Works for some processing frameworks, e.g. Spark
readOutput()
}
"Clean users" should "remain untouched" {
val output = runJob(Seq(TestInput.userTemplate))
assert(output === Seq(TestInput.userTemplate))
}
}
19. www.mapflat.com
19
Test oracle
class CleanUserTest extends FlatSpec {
// ...
"Lower case country" should "translate to upper case" {
val input = Seq(userTemplate.copy(country = "se"))
val output = runJob(input)
assert(output.size === 1)
assert(output.head.country === "SE")
val lens = GenLens[User](_.country)
assert(lens.get(output.head) === "SE")
}
}
● Avoid full record comparisons
○ Except for a few tests
● Examine only fields relevant to test
○ Or tests break for unrelated changes
● Lenses are your friend
○ JSON: JsonPath (Java), Play (Scala), ...
○ Case classes: Monocle, Shapeless,
Scalaz, Sauron, Quicklens
20. www.mapflat.com
class CleanUserTest extends FlatSpec {
def runJob(input: Seq[User]): Seq[User] = {
writeInput(input)
val args = Seq("--user", inFile.path.toString,
"--output", outFile.path.toString)
CleanUserJob.main(args)
// Ugly way to expose counters
val counters = Map(
"invalid" -> CleanUserJob.invalidCount,
"upper-cased" -> CleanUserJob.upperCased)
(readOutput(), counters)
}
"Lower case country" should "translate to upper case" {
val input = Seq(userTemplate.copy(country = "se"))
val (output, counters) = runJob(input)
assert(output.size === 1)
assert(output.head.country === "SE")
assert(counters.get("upper-cased") === 1)
assert((counters - "upper-cased").filter(_._2 != 0)
shouldBe empty)
}
}
20
Inspecting counters
● Counters (accumulators in Spark) are
critical for monitoring and quality
● Test that expected counters are bumped
○ But no other counters
21. www.mapflat.com
class CleanUserTest extends FlatSpec {
def validateInvariants(
input: Seq[User],
output: Seq[User],
counters: Map[String, Int]) = {
output.foreach(recordInvariant)
// Dataset invariants
assert(input.size === output.size)
assert(input.size should be >= counters["upper-cased"])
}
def recordInvariant(u: User) =
assert(u.country.size === 2)
def runJob(input: Seq[User]): Seq[User] = {
// Same as before
...
validateInvariants(input, output, counters)
(output, counters)
}
// Test case is the same
}
21
Invariants
● Some things are true
○ For every record
○ For every job invocation
● Not necessarily in production
○ Reuse invariant predicates as quality
probes
22. www.mapflat.com
22
Streaming SUT, example harness
Scalatest Spark Streaming jobs
IDE, CI, debug integration
DB
Topic
Kafka
Test
input
Test
oracle
Docker
IDE / Gradle
Polling
Service
JVM monolith
for test
23. www.mapflat.com
23
Test lifecycle
1. Initiate from IDE / build system
2. Start fixture containers
3. Await fixture ready
4. Allocate test case resources
5. Start jobs
6. Push input data
7. While (!done && !timeout) {
pollDatabase()
sleep(1ms)
}
8. While (moreTests) { Goto 4 }
9. Tear down fixture
8 3
6
1
2
4,9
5
7
24. www.mapflat.com
24
Testing for absence
1. Send test input
2. Send dummy input
3. Await effect of dummy input
4. Verify test output absence
Assumes in-order semantics
1,2 3, 4
25. www.mapflat.com
Testing in the process
1. Integration tests for happy paths.
2. Likely odd cases, e.g.
○ Empty inputs
○ Missing matching records, e.g. in joins
○ Odd encodings
3. Dark corners
○ Motivated for externally generated input
25
● Minimal test cases triggering desired paths
● On production failure:
○ Debug in production
○ Create minimal test case
○ Add to regression suite
● Cannot trigger bug in test?
○ Consider new scope? Significantly
different from old scopes.
○ Staging pipelines?
○ Automatic code inspection?
26. www.mapflat.com
Quality testing variants
● Functional regression
○ Binary, key to productivity
● Golden set
○ Extreme inputs => obvious output
○ No regressions tolerated
● (Saved) production data input
○ Individual regressions ok
○ Weighted sum must not decline
○ Beware of privacy
26
27. www.mapflat.com
27
Quality metrics
● Processing tool (Spark/Hadoop) counters
○ Odd code path => bump counter
● Dedicated quality assessment pipelines
● Workflow quality predicate for
consumption
○ Depends on consumer use case
Hadoop / Spark counters DB
Quality assessment job
Tiny quality metadataset
28. www.mapflat.com
28
Quality testing in the process
● Binary self-contained
○ Validate in CI
● Relative vs history
○ E.g. large drops
○ Precondition for consuming dataset?
● Push aggregates to DB
○ Standard ops: monitor, alert
DB
∆?
Code ∆!
29. www.mapflat.com
29
Data pipeline = yet another program
Don’t veer from best practices
● Regression testing
● Design: Separation of concerns, modularity, etc
● Process: CI/CD, code review, static analysis tools
● Avoid anti-patterns: Global state, hard-coding location, duplication, ...
In data engineering, slipping is in the culture... :-(
● Mix in solid backend engineers
● Document “golden path”
30. www.mapflat.com
Test code = not yet another program
● Shared between data engineers & QA engineers
○ Best result with mutual respect and collaboration
● Readability >> abstractions
○ Create (Scala, Python) DSLs
● Some software bad practices are benign
○ Duplication
○ Inconsistencies
○ Lazy error handling
30
31. www.mapflat.com
Honey traps
The cloud
● PaaS components do not work locally
○ Cloud providers should provide fake
implementations
○ Exceptions: Kubernetes, Cloud SQL,
Relational Database Service, (S3)
● Integrate PaaS service as fixture
component is challenging
○ Distribute access tokens, etc
○ Pay $ or $$$. Much better with per-second
billing.
31
Vendor batch test frameworks
● Spark, Scalding, Crunch variants
● Seam == internal data structure
○ Omits I/O - common bug source
● Vendor lock-in -
when switching batch framework:
○ Need tests for protection
○ Test rewrite is unnecessary burden
Test input
Test
output
Job
32. www.mapflat.com
Top anti-patterns
1. Test as afterthought or in production. Data processing applications are suited for test!
2. Developer testing requires cluster
3. Static test input in version control
4. Exact expected output test oracle
5. Unit testing volatile interfaces
6. Using mocks & dependency injection
7. Tool-specific test framework - vendor lock-in
8. Using wall clock time
9. Embedded fixture components, e.g. in-memory Kafka/Cassandra/RDBMS
10. Performance testing (low ROI for offline)
11. Data quality not measured, not monitored
32
33. www.mapflat.com
Thank you. Questions?
Credits:
Øyvind Løkling, Schibsted Media Group
Images:
Tracey Saxby, Integration and Application Network, University of Maryland Center for Environmental
Science (ian.umces.edu/imagelibrary/).
33