Data processing at spotify using scio

Data processing @Spotify
using Scio
Julien Tournay | Scala Matsuri 2019 󾓥

about:jto
− 1,5 years Spotiﬁer (🇸🇪 Stockholm )
− Data engineer at #ﬂatmap (Data Infrastructure)
Spotify 勤務は 1.5年

about:spotify
− Audio streaming subscription service
− 100M+ Subscribers
− 217M+ Monthly Active Users
− 50M+ Songs
− 3B+ Playlists
− 79 Markets
https://newsroom.spotify.com/company-info/
Spotify は 1億人のユーザを持つオーディオ系ストリーミン
グサービス

‒ Discover Weekly
‒ Release Radar
‒ Daily Mixes
‒ Your 2018 Wrapped
‒ Fan insights
‒ ...
scio: use cases
scioのユースケース。パーソナライズされた、毎週更新の
プレイリスト、最新曲、おすすめ曲の紹介など。

− Volume of data
− Number of datasets
− Number of data-engineers / data-scientists
scalability
データ量、データセット数、データエンジニア人数など多次
元でのスケーラビリティ

‒ Scheduling
‒ Orchestration
‒ Processing
(Batch & streaming)
challenges
‒ Testing
‒ Storage
‒ 💰 (runtime & storage)
目下の課題はスケジューリング、テスト、ストレージ、予算
など

‒ Scheduling
‒ Orchestration
‒ Lineage
‒ Permissions
‒ Processing (Batch & streaming)
‒ Encryption (GDPR)
‒ Monitoring
‒ Data quality
challenges
‒ Testing
‒ Latency
‒ Incident handling
‒ Storage
‒ Discovery
‒ Lifecycle
‒ Atomicity
‒ 💰💰💰 (runtime, storage, incidents, lateness,
skewness, productivity, ...)
暗号化 (GDPR 対応)、モニタリングなど課題とともに膨らむ
経費

DATA PROCESSING
is
HARD &
EXPENSIVE
データプロセッシングは難しい。そして高い。

☑ It works
☑ It works reliably
☐ It works reliably and eﬃciently
☐ It works reliably, eﬃciently and easily
data-processing @spotify
今は「確実に動く」段階、目指すのは「効率」と「簡単」

− A Scala API for data processing
− Based on Apache Beam
− Uniﬁed batch and streaming
− Open source (Apache v2.0)
− Runs on Dataﬂow, Spark, Flink, ...
val sc = ScioContext()
sc.textFile(sourcePath)
.flatMap { _
.split("s+")
.filter(_.nonEmpty)
}
.countByValue
.saveAsTextFile(target)
sc.close()
about:scio
https://github.com/spotify/scio
Beam に基づいたデータプロセッシング Scala API。バッチ
処理とストリーム処理の統合。

Scio
− Good documentation
− Predictability
− Performant
− Productivity
− Type safety
ドキュメントが豊富。予測性、実行性能、生産性が高い。型
安全。

− Scala for Data engineering
− Scala Center advisory board member
− Open-source :)
https://github.com/spotify/
Scala @spotify
弊社で Scala の取り組み。データエンジニアリング、Scala
Center、OSS など。

Scio @Spotify
− 350+ users
(Data engineers, Data Scientists, Backend engineers)
− 1000s unique production jobs
− Batch and streaming
350人以上の社内 Scio ユーザ、プロダクション環境に何
千ものjobがある。バッチもストリーミングも。

Efﬁciency: serialization
効率性: シリアライズ編

Scio 0.7.0
− Released Jan, 18
− New static Coders
− Redesigned BQ client
− Refactored IO
− Better modularity
− bugﬁxes
− etc.
Scio 0.7.0では、静的なCoderが追加され、BigQueryのク
ライアントを再設計、そしてIOをリファクタリングした。

Coders in 0.7
− Kryo to Typesafe coders
− Predictability
− Testability
− Performance
− Granularity
− Automatic derivation at compile time
− Fallback to Kryo
型安全なコーダーへと移行することで予測性を高めた。コ
ンパイル時に自動導出。

0.6
def map[U: Coder]
(f: T => U): SCollection[U]
Safer, simpler, compile time black magic
0.6 vs 0.7
def map[U: ClassTag]
(f: T => U): SCollection[U]
Unsafe, Kryo based, runtime black magic
0.7
(mostly)
automated migration
using scalaﬁx
0.6 は Kryo が実行時黒魔術。0.7 はより安全なコンパイル
時黒魔術。

Anonymisation job
‒ Encrypt all personal data
‒ Each user has unique keys
‒ Runs hourly
https://labs.spotify.com/2018/09/18/scalable-user-privacy/
匿名化のジョブ。個人情報を暗号化する。

Anonymisation optimisation
‒ Replace Kryo by custom coders for Avro’s GenericRecord
‒ Kryo was really ineﬃcient
‒ Only possible in Scio > 0.7
‒ Scio now has a compile time warning for GenericRecord
匿名化の最適化。非効率的だったKryoを、カスタムコー
ダーへ置換した。コンパイル時の警告も出る。これらも0.7
の成果。

Anonymisation job cost (▼ 60% largest event)
匿名化のジョブのコスト

Anonymisation job runtime (in minutes - largest event)
匿名化のジョブの実行時間

(YMMV but)
DO
NOT
OVERLOOK
SERIALIZATION
シリアライズを見落とさないように

Efﬁciency: joins
効率性: join 編

joins
− Really common use case
− 😎 Large x Small
− 😅 Large x Medium
− 😱 Large x Large(-ish) → 💰💰💰💰💰💰 (Shuﬄe)
よくある話: 巨大なもの同士をjoinするとコストがかかる。

SMB join
− Sort Merge Bucket join
− Store data bucketed (sharding by key)
− Sort the content of each bucket by key
ソフトマージバケット join

Bucketing
29
id
B (number of buckets) = 3
8
7
6
0
3
1
7
2
6
1
4
3
3
4
5
id
0
3
3
3
6
6
id
1
1
4
4
7
7
id
2
5
8

Joining
0,6,6 3,6
4,7 1
2 2,5,8
id
L
6
4
0
2
7
6
L Merge join
R
id
8
3
1
6
2
5
R

SMB join
− Shuﬄe once, join everywhere → Amortized cost
− PR in Apache Beam (https://github.com/apache/beam/pull/8486)
Goal: handle gotchas automatically:
• Store and check bucketing metadata
• handle skewness
• support joining datasets with a different number of buckets
− Bonus: Storage is more eﬃcient (better compression)
一度のシャッフルで何度も join できる。Beam に PR 中。

Ease of use: BeamSQL
簡易性: BeamSQL編

Scio 0.8
− SchemaCoder
(structure aware coders)
− BeamSQL
− Automatic type conversion
− Better coder support for java classes
− Simpler job completion API (remove futures / ExecutionContext)...
− Bugﬁxes
− etc.
Scio 0.8 では SchemaCoder、BeamSQL などを追加

BeamSQL
val coll: SCollection[User] = ???
val r: SCollection[(String, List[String])] =
sql"""
SELECT username, emails
FROM ${coll}
""".as[(String, List[String])]
‒ Is the query valid SQL ?
‒ Are ùsername` and èmails`
valid fields in Ùser` ?
‒ What’s the type of ùsername` ?
‒ What’s the type of èmails` ?
‒ Can the result be converted to the
expected type ?
クエリは妥当な SQL だろうか? String は何を指すのか?

BeamSQL
val r =
- sql"""
+ tsql"""
FROM ${coll}

BeamSQL
tsql"""
FOM ${coll}

BeamSQL
tsql"""
FOM ${coll}
ParseException:
Encountered "PCOLLECTION" at line 4, column 8.
Was expecting one of:
<EOF>
"ORDER" …
"OFFSET" …
"FETCH" …
"FROM" …
"," …
Query:
Select username, emails fom PCOLLECTION
タイポはコンパイル時に検知

BeamSQL
val coll: SCollection[User] = ???
tsql"""
SELECT name, emails
FROM ${coll}
SqlValidatorException:
Column 'name' not found in any table
PCOLLECTION schema:
┌─────────────────────────────┬──────────┬──────────┐
│ NAME │ TYPE │ NULLABLE │
├─────────────────────────────┼──────────┼──────────┤
│ username │ STRING │ NO │
│ emails │ STRING[] │ NO │
└─────────────────────────────┴──────────┴──────────┘
Expected schema:
┌─────────────────────────────┬──────────┬──────────┐
├─────────────────────────────┼──────────┼──────────┤
│ _1 │ STRING │ NO │
│ _2 │ STRING[] │ NO │
└─────────────────────────────┴──────────┴──────────┘
「name カラムはテーブルにありません」

BeamSQL
tsql"""
FROM ${coll}
""".as[(Int, List[String])]

BeamSQL
tsql"""
FROM ${coll}
""".as[(Int, List[String])]
Inferred schema for query is not compatible
with the expected schema.
Query result schema (inferred):
┌─────────────────────────────┬──────────┬──────────┐
├─────────────────────────────┼──────────┼──────────┤
│ username │ STRING │ NO │
│ emails │ STRING[] │ NO │
└─────────────────────────────┴──────────┴──────────┘
Expected schema:
┌─────────────────────────────┬──────────┬──────────┐
├─────────────────────────────┼──────────┼──────────┤
│ _1 │ INT32 │ NO │
│ _2 │ STRING[] │ NO │
└─────────────────────────────┴──────────┴──────────┘
「推論されたスキーマは期待されるスキーマとの互換性が
ありません」

Automatic type conversion
val in: SCollection[A] = ???
val r: SCollection[B] =
in.to[B](To.safe)
− Convert between classes
without boilerplate
− Support Java beans
− Support Scala case classes
− Support Avro SpeciﬁcRecord
自動の型変換。ボイラープレートなしで、Java beansや、
Scalaのcase class、AvroのSpeciﬁcRecordについて対
応。

Type conversion
val in: SCollection[A] = ???
val r: SCollection[B] =
in.to[B](To.safe)
Schemas are not compatible:
A schema:
┌─────────────────────────────┬──────────┬──────────┐
├─────────────────────────────┼──────────┼──────────┤
│ i │ INT32 │ NO │
│ s │ STRING │ NO │
│ e │ ROW │ NO │
│ e.xs │ INT64[] │ NO │
│ e.q │ STRING │ NO │
└─────────────────────────────┴──────────┴──────────┘
B schema:
┌─────────────────────────────┬──────────┬──────────┐
├─────────────────────────────┼──────────┼──────────┤
│ q │ STRING │ NO │
│ xs │ INT64[] │ NO │
└─────────────────────────────┴──────────┴──────────┘
型変換

github.com/spotify/scio
spotify.github.io/scio
@skaalf
gitter.im/spotify/scio
spotifyjobs.com

Data processing at spotify using scio

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Data processing at spotify using scio

Similaire à Data processing at spotify using scio (20)

Data processing at spotify using scio