Talk at Philly ETE Apr 28 2017
We will talk about Spotify’s story of migrating our big data infrastructure to Google Cloud. Over the past year or so we moved away from maintaining our own 2500+ node Hadoop cluster to managed services in the cloud. We replaced two key components in our data processing stack, Hive and Scalding, with BigQuery and Scio and are able to iterate at a much faster speed. We will focus the technical aspect of Scio, a Scala API for Apache Beam and Google Cloud Dataflow and how it changed the way we process data.
5. Hadoop at Spotify
● On-premise → Amazon EMR → On-premise
● ~2,500 nodes, largest in EU
● 100PB+ Disk, 100TB+ RAM
● 60TB+ per day log ingestion
● 20K+ jobs per day
6. Data Processing
● Luigi, Python M/R, circa 2011
● Scalding, Spark, circa 2013
● 200+ Scala users, 1K+ unique jobs
● Storm for real-time
● Hive for ad-hoc analysis
8. Hive
● Row-oriented Avro/TSV → full files scan
● Translates to M/R → disk IO between steps
● Slow and resource hog
● Immature Parquet support
● Weak UDF and programmatic API
10. Typical Workloads
Query Type Hive/Hadoop BigQuery
KPI by specific ad
hoc parameter
~1,200s ~10-20s
FB audience list for
social targeting
~4,000s ~15-30s
Top tracks by age,
gender & market
~18,000s ~500s
12. ● Scalding for most pipelines
○ Key metrics, discover weekly, AB testing analysis, ...
○ Stable and proven at Twitter, Stripe, Etsy, eBay, ...
○ M/R disk IO overhead, Tez not ready
○ No streaming (Summingbird?)
13. ● Spark
○ Interactive ML, user behavior modelling & prediction...
○ Batch, streaming, notebook, SQL and ML
○ Separate APIs for batch and streaming
○ Many operation modes, hard to tune and scale
14. ● Storm
○ Ads targeting, new user recommendations
○ Separate cluster and ops
○ Low level API, missing window, join primitives
15. ● Cluster management
● Multi-tenancy, resource utilization and contention
● 3 sets of separate APIs
● Missing Google Cloud connectors
17. The Evolution of Apache Beam
MapReduce
BigTable DremelColossus
FlumeMegastoreSpanner
PubSub
Millwheel
Apache
Beam
Google Cloud
Dataflow
18. What is Apache Beam?
1. The Beam Programming Model
2. SDKs for writing Beam pipelines -- starting with Java
3. Runners for existing distributed processing backends
○ Apache Flink (thanks to data Artisans)
○ Apache Spark (thanks to Cloudera and PayPal)
○ Google Cloud Dataflow (fully managed service)
○ Local runner for testing
19. 19
The Beam Model: Asking the Right Questions
What results are calculated?
Where in event time are results calculated?
When in processing time are results materialized?
How do refinements of results relate?
20. 20
Customizing What Where When How
3
Streaming
4
Streaming
+ Accumulation
1
Classic
Batch
2
Windowed
Batch
21. 21
The Apache Beam Vision
1. End users: who want to write
pipelines in a language that’s familiar.
2. SDK writers: who want to make Beam
concepts available in new languages.
3. Runner writers: who have a
distributed processing environment
and want to support Beam pipelines
Beam Model: Fn Runners
Apache
Flink
Apache
Spark
Beam Model: Pipeline Construction
Other
LanguagesBeam Java
Beam
Python
Execution Execution
Cloud
Dataflow
Execution
22. Data model
Spark
● RDD for batch, DStream for streaming
● Two sets of APIs
● Explicit caching semantics
Beam / Cloud Dataflow
● PCollection for batch and streaming
● One unified API
● Windowed and timestamped values
23. Execution
Spark
● One driver, n executors
● Dynamic execution from driver
● Transforms and actions
Beam / Cloud Dataflow
● No master
● Static execution planning
● Transforms only, no actions
28. ● High level DSL
● Familiarity with Scalding, Spark and Flink
● Functional programming natural fit for data
● Numerical libraries - Breeze, Algebird
● Macros for code generation
Beam and Scala
36. Type safe BigQuery
Macro generated case classes, schemas and converters
@BigQuery.fromQuery("SELECT id, name FROM [users] WHERE ...")
class User // look mom no code!
sc.typedBigQuery[User]().map(u => (u.id, u.name))
@BigQuery.toTable
case class Score(id: String, score: Double)
data.map(kv => Score(kv._1, kv._2)).saveAsTypedBigQuery("table")
37. ● Best of both worlds
● BigQuery for slicing and dicing huge datasets
● Scala for custom logic
● Seamless integration
BigQuery + Scio
38. REPL
$ scio-repl
Welcome to
_____
________________(_)_____
__ ___/ ___/_ /_ __
_(__ )/ /__ _ / / /_/ /
/____/ ___/ /_/ ____/ version 0.3.0-beta3
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121)
Type in expressions to have them evaluated.
Type :help for more information.
Using 'scio-test' as your BigQuery project.
BigQuery client available as 'bq'
Scio context available as 'sc'
scio> _
Available in github.com/spotify/homebrew-public
39. Future based orchestration
// Job 1
val f: Future[Tap[String]] = data1.saveAsTextFile("output")
sc1.close() // submit job
val t: Tap[String] = Await.result(f)
t.value.foreach(println) // Iterator[String]
// Job 2
val sc2 = ScioContext(options)
val data2: SCollection[String] = t.open(sc2)
44. Adoption
● At Spotify
○ 200+ users, 400+ production pipelines (from ~70 6 months ago)
○ Most of them new to Scala and Scio
○ Both batch and streaming jobs
● Externally
○ ~10 companies, several fairly large ones
45. Collaboration
● Open source model
○ Discussion on Slack & Gitter
○ Mailing lists
○ Issue tracking on public Github
○ Community driven feature development
Type safe BigQuery, Bigtable, Datastore, Protobuf
48. Release Radar
● 50 n1-standard-1 workers
● 1 core 3.75GB RAM
● 130GB in - Avro & Bigtable
● 130GB out x 2 - Bigtable in US+EU
● 110M Bigtable mutations
● 120 LOC
49. Fan Insights
● Listener stats
[artist|track] ×
[context|geography|demography] ×
[day|week|month]
● BigQuery, GCS, Datastore
● TBs daily
● 150+ Java jobs → ~10 Scio jobs
50. Master Metadata
● n1-standard-1 workers
● 1 core 3.75GB RAM
● Autoscaling 2-35 workers
● 26 Avro sources - artist, album, track, disc, cover art, ...
● 120GB out, 70M records
● 200 LOC vs original Java 600 LOC
51.
52. BigDiffy
● Pairwise field-level statistical diff
● Diff 2 SCollection[T] given keyFn: T => String
● T: Avro, BigQuery, Protobuf
● Leaf field Δ - numeric, string, vector
● Δ statistics - min, max, μ, σ, etc.
● Non-deterministic fields
○ ignore field
○ treat "repeated" field as unordered list
Part of github.com/spotify/ratatool
53. Dataset Diff
● Diff stats
○ Global: # of SAME, DIFF, MISSING LHS/RHS
○ Key: key → SAME, DIFF, MISSING LHS/RHS
○ Field: field → min, max, μ, σ, etc.
● Use cases
○ Validating pipeline migration
○ Sanity checking ML models
54. Pairwise field-level deltas
val lKeyed = lhs.keyBy(keyFn)
val rKeyed = rhs.keyBy(keyFn)
val deltas = (lKeyed outerJoin rKeyed).map { case (k, (lOpt, rOpt)) =>
(lOpt, rOpt) match {
case (Some(l), Some(r)) =>
val ds = diffy(l, r) // Seq[Delta]
val dt = if (ds.isEmpty) SAME else DIFFERENT
(k, (ds, dt))
case (_, _) =>
val dt = if (lOpt.isDefined) MISSING_RHS else MISSING_LHS
(k, (Nil, dt))
}
}