Spline: Data Lineage For Spark Structured Streaming

Marek Novotny, ABSA
Vaclav Kosar, ABSA
Spline: Data Lineage for
Spark Structured Streaming
#SAISExp18

About Us
•ABSA is a Pan-African financial services provider
– With Apache Spark at the core of its data engineering
•We try to fill gaps in the Hadoop eco-system
•Contributions to Apache Spark
•Spark-related open-source projects (github.com/AbsaOSS)
– ABRiS – Avro SerDe for structured APIs (#SAISDev5)
– Cobrix – Cobol data source
– Atum – Completeness and accuracy library
– Spline – Data lineage tracking and visualization tool (#EUent3)
2#SAISExp18

• How data is calculated?
• What is the schema and format of
streamed data?
3#SAISExp18
01000110101101010

4#SAISExp18
Data Flow
Job 2
Job 3
Job 1
01000110101101010
Topic A
Topic B
Topic Z
Topic D
Path /…/abc

5#SAISExp18
Transformations Job 3 Details
Topic D //path/
Join
colA + colB
Topic Z
Job 2
Job 3
Job 1
01000110101101010
Topic A
Topic B
Topic Z
Topic D
Path /…/abc

6#SAISExp18
Schema A
Schema B
Schema C
Schema D
Schema Z
Schema C
Schema D
Job 2
Job 3
Job 1
Schemas and Formats
01000110101101010
Topic A
Topic B
Topic Z
Topic D
Path /…/abc

Spline
7#SAISExp18
•To make Spark BCBS
(Clarity) compliant
•To communicate with
business people

Spline
8#SAISExp18
Dependencies
(Clarity) compliant
business people
•Online documentation of
–Job dependencies

Spline
9#SAISExp18
Dependencies Details
(Clarity) compliant
business people
–Job dependencies
– Particular Spark SQL jobs

Spline
(Clarity) compliant
business people
–Job dependencies
– Spark SQL job details
– Attributes occurring in the
logic
10#SAISExp18
Dependencies Details Attributes

Lineage Tracking of Batch Jobs
• Dataset-oriented
• Leverages execution plans
• Structured APIs only
– SQL
– Dataframes
– Datasets
• UDFs and lambdas are
considered as black boxes
11#SAISExp18
Job
Dataset A
Lineage A

Lineage Tracking of Streaming Jobs
• Structured Streaming only
• Source-oriented (topic)
• Evolves in time
12#SAISExp18
App
Lineage T1
Topic A
Time
Lineage T3
Lineage T2

Structured Streaming Support
13#SAISExp18
Spark libraries
Transformations
Session
Query
Spark structured streaming job
StreamingQueryManager
• StreamingQueryManager

Start
14#SAISExp18
Spline Streaming Listener
Spark libraries
Transformations
Session
Query
Spark structured streaming job• StreamingQueryManager
– Information about start

Start
15#SAISExp18
Spark libraries
Transformations
Session
Query
Give me exec. plans
– Can provide execution
plans

Start
16#SAISExp18
Spark libraries
Transformations
Session
Query
Lineage Model
Spline UI
Execution Plans
plans

Start
17#SAISExp18
Spark libraries
Transformations
Session
Query
Lineage Model
Spline UI
Event details
ProgressExecution Plans
plans
– Information about progress
• MicroBatch

Interval View
• Displays data flow in fixed interval
18#SAISExp18
Start End Time
progress
Job W1
Job R
Job W2 progress
progress progress
progress progress
Interval
Job W1
Job R
S1
S2
Job W2
S3

Demo – Use Case
19#SAISExp18
What is temperature per hour in Prague?
Station 2 Station NStation 1
?
Prague
…

Demo – Use Case Output
20#SAISExp18
0
5
10
15
20
25
30
35
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Temperature[°C]
Hours Start End
2018-09-24

Demo – Select Interval View
21#SAISExp18

Demo – Select Interval
22#SAISExp18

Demo – Select Sink
23#SAISExp18

Demo – Find Highlighted Sink
24#SAISExp18

Demo – Review The Lineage
25#SAISExp18

Demo – Change The Interval
26#SAISExp18

Demo – Observe New Lineage
27#SAISExp18

Demo – Select A Job
28#SAISExp18

Demo – Drill Down
29#SAISExp18

Demo – Review Job Details
30#SAISExp18

Demo – Select An Operation
31#SAISExp18

Demo – See Operation Attributes
32#SAISExp18

Interval View Limitations
33#SAISExp18
Start End Time
progress
Job W1
Job R
Job W2 progress
progress progress
progress progress
Interval
Job R
S1
S2 S310:21 10:25
10:30 10:35
10:45 10:51

34#SAISExp18
Job W1
Job R
Start End Time
progress
Job W1
Job R
Job W2 progress
progress progress
progress progress
S1
S2
Interval View
Interval
10:21 10:25
10:30 10:35
10:45 10:51

• Edge case (delayed read, early write)
– Job W1 should be linked
– Job W2 should not be linked
35#SAISExp18
Job W1
Job R
Start End Time
progress
Job W1
Job R
Job W2 progress
progress progress
progress progress
S1
S2
Job W2
Job R
S1
S3
Lineage
Interval
Interval View
10:21 10:25
10:30 10:35
10:45 10:51

Beyond The Interval View
• Instead of timestamp use
addresses of rows
• SS has addresses (offsets) on
each source, but not on sinks
• Most sinks are also sources and
thus could return offsets
36#SAISExp18
Source 2
Offsets:
3 - 5
Job
Source 3
Offsets:
12 - 14
Sink
Offsets:
?
Progress Event

Offset-Based Linking
37#SAISExp18
offset
offset
offset
Selected
S3
S2
S1
S1

38#SAISExp18
Job R Progress
offset
offset
offset
S3
S2
S1
Job R
S1
Selected

39#SAISExp18
Job R Progress
offset
offset
offset
3 - 5
12 - 14
S3
S2
S1
Job R
S1
S2 S3
3 - 5 12 - 14
Selected

40#SAISExp18
Job W2 Progress
Job R Progress
offset
offset
offset
3 - 5
12 - 14
S3
S2
S1
Job R
S1
S2
Job W2
S3
3 - 5
9 - 19
12 - 14
Selected

41#SAISExp18
Job W2 Progress
Job W1
Progress
Job R Progress
offset
offset
offset
3 - 5
12 - 14
22 - 27
S3
S2
S1
Job X
Progress
Job W1
Job R
S1
S2
Job W2
S3
3 - 5
22 - 27 9 - 19
12 - 14
Selected

• Jobs are linked when progress offsets overlap
• Offset timestamp doesn’t matter
42#SAISExp18
Job W1
Job R
S1
S2
Job W2
S3
3 - 5
22 - 27 9 - 19
12 - 14
Job W2 Progress
Job R Progress
offset
offset
offset
3 - 5
12 - 14
22 - 27
S3
S2
S1
Job W1
Progress
Job X
Progress
Selected

Conclusion
• Spline: data lineage tracking tool
• New support for Structured Streaming
• Demo POC: Interval View
• Proposed generalization: offset-based linking
43#SAISExp18

Future Plans
• Release Interval View in Spline
• After changes to Spark:
– Offset based linking for micro-batch streaming
– Continuous streaming support
• Support for dataset checkpoints
44#SAISExp18

Questions
• Now is a good time
• Or feel free to contact us
– Marek Novotny
• mn.mikke@gmail.com
– Vaclav Kosar
• admin@vaclavkosar.com
• github.com/AbsaOSS/spline
45#SAISExp18

Spline: Data Lineage For Spark Structured Streaming

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Spline: Data Lineage For Spark Structured Streaming

Similaire à Spline: Data Lineage For Spark Structured Streaming (20)

Plus de Vaclav Kosar

Plus de Vaclav Kosar (6)

Dernier

Dernier (20)

Spline: Data Lineage For Spark Structured Streaming