SlideShare une entreprise Scribd logo
1  sur  45
Marek Novotny, ABSA
Vaclav Kosar, ABSA
Spline: Data Lineage for
Spark Structured Streaming
#SAISExp18
About Us
•ABSA is a Pan-African financial services provider
– With Apache Spark at the core of its data engineering
•We try to fill gaps in the Hadoop eco-system
•Contributions to Apache Spark
•Spark-related open-source projects (github.com/AbsaOSS)
– ABRiS – Avro SerDe for structured APIs (#SAISDev5)
– Cobrix – Cobol data source
– Atum – Completeness and accuracy library
– Spline – Data lineage tracking and visualization tool (#EUent3)
2#SAISExp18
• How data is calculated?
• What is the schema and format of
streamed data?
3#SAISExp18
01000110101101010
4#SAISExp18
Data Flow
Job 2
Job 3
Job 1
01000110101101010
Topic A
Topic B
Topic Z
Topic D
Path /…/abc
5#SAISExp18
Transformations Job 3 Details
Topic D //path/
Join
colA + colB
Topic Z
Job 2
Job 3
Job 1
01000110101101010
Topic A
Topic B
Topic Z
Topic D
Path /…/abc
6#SAISExp18
Schema A
Schema B
Schema C
Schema D
Schema Z
Schema C
Schema D
Job 2
Job 3
Job 1
Schemas and Formats
01000110101101010
Topic A
Topic B
Topic Z
Topic D
Path /…/abc
Spline
7#SAISExp18
•To make Spark BCBS
(Clarity) compliant
•To communicate with
business people
Spline
8#SAISExp18
Dependencies
•To make Spark BCBS
(Clarity) compliant
•To communicate with
business people
•Online documentation of
–Job dependencies
Spline
9#SAISExp18
Dependencies Details
•To make Spark BCBS
(Clarity) compliant
•To communicate with
business people
•Online documentation of
–Job dependencies
– Particular Spark SQL jobs
Spline
•To make Spark BCBS
(Clarity) compliant
•To communicate with
business people
•Online documentation of
–Job dependencies
– Spark SQL job details
– Attributes occurring in the
logic
10#SAISExp18
Dependencies Details Attributes
Lineage Tracking of Batch Jobs
• Dataset-oriented
• Leverages execution plans
• Structured APIs only
– SQL
– Dataframes
– Datasets
• UDFs and lambdas are
considered as black boxes
11#SAISExp18
Job
Dataset A
Lineage A
Lineage Tracking of Streaming Jobs
• Structured Streaming only
• Source-oriented (topic)
• Evolves in time
12#SAISExp18
App
Lineage T1
Topic A
Time
Lineage T3
Lineage T2
Structured Streaming Support
13#SAISExp18
Spark libraries
Transformations
Session
Query
Spark structured streaming job
StreamingQueryManager
• StreamingQueryManager
Start
Structured Streaming Support
14#SAISExp18
StreamingQueryManager
Spline Streaming Listener
Spark libraries
Transformations
Session
Query
Spark structured streaming job• StreamingQueryManager
– Information about start
Start
Structured Streaming Support
15#SAISExp18
StreamingQueryManager
Spline Streaming Listener
Spark libraries
Transformations
Session
Query
Spark structured streaming job
Give me exec. plans
• StreamingQueryManager
– Information about start
– Can provide execution
plans
Start
Structured Streaming Support
16#SAISExp18
StreamingQueryManager
Spline Streaming Listener
Spark libraries
Transformations
Session
Query
Lineage Model
Spline UI
Spark structured streaming job
Execution Plans
• StreamingQueryManager
– Information about start
– Can provide execution
plans
Start
Structured Streaming Support
17#SAISExp18
StreamingQueryManager
Spline Streaming Listener
Spark libraries
Transformations
Session
Query
Lineage Model
Spline UI
Spark structured streaming job
Event details
ProgressExecution Plans
• StreamingQueryManager
– Information about start
– Can provide execution
plans
– Information about progress
• MicroBatch
Interval View
• Displays data flow in fixed interval
18#SAISExp18
Start End Time
progress
Job W1
Job R
Job W2 progress
progress progress
progress progress
Interval
Job W1
Job R
S1
S2
Job W2
S3
Demo – Use Case
19#SAISExp18
What is temperature per hour in Prague?
Station 2 Station NStation 1
?
Prague
…
Demo – Use Case Output
20#SAISExp18
0
5
10
15
20
25
30
35
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Temperature[°C]
Hours Start End
2018-09-24
Demo – Select Interval View
21#SAISExp18
Demo – Select Interval
22#SAISExp18
Demo – Select Sink
23#SAISExp18
Demo – Find Highlighted Sink
24#SAISExp18
Demo – Review The Lineage
25#SAISExp18
Demo – Change The Interval
26#SAISExp18
Demo – Observe New Lineage
27#SAISExp18
Demo – Select A Job
28#SAISExp18
Demo – Drill Down
29#SAISExp18
Demo – Review Job Details
30#SAISExp18
Demo – Select An Operation
31#SAISExp18
Demo – See Operation Attributes
32#SAISExp18
Interval View Limitations
33#SAISExp18
Start End Time
progress
Job W1
Job R
Job W2 progress
progress progress
progress progress
Interval
Job R
S1
S2 S310:21 10:25
10:30 10:35
10:45 10:51
Interval View Limitations
34#SAISExp18
Job W1
Job R
Start End Time
progress
Job W1
Job R
Job W2 progress
progress progress
progress progress
S1
S2
Interval View
Interval
10:21 10:25
10:30 10:35
10:45 10:51
Interval View Limitations
• Edge case (delayed read, early write)
– Job W1 should be linked
– Job W2 should not be linked
35#SAISExp18
Job W1
Job R
Start End Time
progress
Job W1
Job R
Job W2 progress
progress progress
progress progress
S1
S2
Job W2
Job R
S1
S3
Lineage
Interval
Interval View
10:21 10:25
10:30 10:35
10:45 10:51
Beyond The Interval View
• Instead of timestamp use
addresses of rows
• SS has addresses (offsets) on
each source, but not on sinks
• Most sinks are also sources and
thus could return offsets
36#SAISExp18
Source 2
Offsets:
3 - 5
Job
Source 3
Offsets:
12 - 14
Sink
Offsets:
?
Progress Event
Offset-Based Linking
37#SAISExp18
offset
offset
offset
Selected
S3
S2
S1
S1
Offset-Based Linking
38#SAISExp18
Job R Progress
offset
offset
offset
S3
S2
S1
Job R
S1
Selected
Offset-Based Linking
39#SAISExp18
Job R Progress
offset
offset
offset
3 - 5
12 - 14
S3
S2
S1
Job R
S1
S2 S3
3 - 5 12 - 14
Selected
Offset-Based Linking
40#SAISExp18
Job W2 Progress
Job R Progress
offset
offset
offset
3 - 5
12 - 14
S3
S2
S1
Job R
S1
S2
Job W2
S3
3 - 5
9 - 19
12 - 14
Selected
Offset-Based Linking
41#SAISExp18
Job W2 Progress
Job W1
Progress
Job R Progress
offset
offset
offset
3 - 5
12 - 14
22 - 27
S3
S2
S1
Job X
Progress
Job W1
Job R
S1
S2
Job W2
S3
3 - 5
22 - 27 9 - 19
12 - 14
Selected
Offset-Based Linking
• Jobs are linked when progress offsets overlap
• Offset timestamp doesn’t matter
42#SAISExp18
Job W1
Job R
S1
S2
Job W2
S3
3 - 5
22 - 27 9 - 19
12 - 14
Job W2 Progress
Job R Progress
offset
offset
offset
3 - 5
12 - 14
22 - 27
S3
S2
S1
Job W1
Progress
Job X
Progress
Selected
Conclusion
• Spline: data lineage tracking tool
• New support for Structured Streaming
• Demo POC: Interval View
• Proposed generalization: offset-based linking
43#SAISExp18
Future Plans
• Release Interval View in Spline
• After changes to Spark:
– Offset based linking for micro-batch streaming
– Continuous streaming support
• Support for dataset checkpoints
44#SAISExp18
Questions
• Now is a good time
• Or feel free to contact us
– Marek Novotny
• mn.mikke@gmail.com
– Vaclav Kosar
• admin@vaclavkosar.com
• github.com/AbsaOSS/spline
45#SAISExp18

Contenu connexe

Tendances

Tendances (20)

Introduction to Amazon DynamoDB
Introduction to Amazon DynamoDBIntroduction to Amazon DynamoDB
Introduction to Amazon DynamoDB
 
Splunk Search Optimization
Splunk Search OptimizationSplunk Search Optimization
Splunk Search Optimization
 
Actionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data ScienceActionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data Science
 
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSets
 
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 20190-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Graphs for Enterprise Architects
Graphs for Enterprise ArchitectsGraphs for Enterprise Architects
Graphs for Enterprise Architects
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation
 
Apache Atlas: Tracking dataset lineage across Hadoop components
Apache Atlas: Tracking dataset lineage across Hadoop componentsApache Atlas: Tracking dataset lineage across Hadoop components
Apache Atlas: Tracking dataset lineage across Hadoop components
 
Introduction to AWS Glue
Introduction to AWS GlueIntroduction to AWS Glue
Introduction to AWS Glue
 
Data Discoverability at SpotHero
Data Discoverability at SpotHeroData Discoverability at SpotHero
Data Discoverability at SpotHero
 
Amazon QuickSight
Amazon QuickSightAmazon QuickSight
Amazon QuickSight
 
Introduction to Amazon DynamoDB
Introduction to Amazon DynamoDBIntroduction to Amazon DynamoDB
Introduction to Amazon DynamoDB
 
DevOps for Databricks
DevOps for DatabricksDevOps for Databricks
DevOps for Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Apache Flink, AWS Kinesis, Analytics
Apache Flink, AWS Kinesis, Analytics Apache Flink, AWS Kinesis, Analytics
Apache Flink, AWS Kinesis, Analytics
 

Similaire à Spline: Data Lineage For Spark Structured Streaming

Aminullah Assagaf_P7-Ch.9_Project management-32.pptx
Aminullah Assagaf_P7-Ch.9_Project management-32.pptxAminullah Assagaf_P7-Ch.9_Project management-32.pptx
Aminullah Assagaf_P7-Ch.9_Project management-32.pptx
Aminullah Assagaf
 

Similaire à Spline: Data Lineage For Spark Structured Streaming (20)

HBaseCon2017 Data Product at AirBnB
HBaseCon2017 Data Product at AirBnBHBaseCon2017 Data Product at AirBnB
HBaseCon2017 Data Product at AirBnB
 
SplunkSummit 2015 - Update on Splunk Enterprise 6.3 & Hunk 6.3
SplunkSummit 2015 - Update on Splunk Enterprise 6.3 & Hunk 6.3SplunkSummit 2015 - Update on Splunk Enterprise 6.3 & Hunk 6.3
SplunkSummit 2015 - Update on Splunk Enterprise 6.3 & Hunk 6.3
 
Compiling openCypher graph queries with Spark Catalyst
Compiling openCypher graph queries with Spark CatalystCompiling openCypher graph queries with Spark Catalyst
Compiling openCypher graph queries with Spark Catalyst
 
Apache HAWQ Architecture
Apache HAWQ ArchitectureApache HAWQ Architecture
Apache HAWQ Architecture
 
Streaming SQL to unify batch and stream processing: Theory and practice with ...
Streaming SQL to unify batch and stream processing: Theory and practice with ...Streaming SQL to unify batch and stream processing: Theory and practice with ...
Streaming SQL to unify batch and stream processing: Theory and practice with ...
 
Building Stream Processing as a Service
Building Stream Processing as a ServiceBuilding Stream Processing as a Service
Building Stream Processing as a Service
 
Streaming SQL Foundations: Why I ❤ Streams+Tables
Streaming SQL Foundations: Why I ❤ Streams+TablesStreaming SQL Foundations: Why I ❤ Streams+Tables
Streaming SQL Foundations: Why I ❤ Streams+Tables
 
Reactive database access with Slick3
Reactive database access with Slick3Reactive database access with Slick3
Reactive database access with Slick3
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
 
cpmpertmy (1)
cpmpertmy (1)cpmpertmy (1)
cpmpertmy (1)
 
Fascinate with SQL SSIS Parallel processing
Fascinate with SQL SSIS Parallel processing Fascinate with SQL SSIS Parallel processing
Fascinate with SQL SSIS Parallel processing
 
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
 
StructuredStreaming webinar slides.pptx
StructuredStreaming webinar slides.pptxStructuredStreaming webinar slides.pptx
StructuredStreaming webinar slides.pptx
 
Aminullah Assagaf_P7-Ch.9_Project management-32.pptx
Aminullah Assagaf_P7-Ch.9_Project management-32.pptxAminullah Assagaf_P7-Ch.9_Project management-32.pptx
Aminullah Assagaf_P7-Ch.9_Project management-32.pptx
 
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
 
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
 
Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platform
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKC
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
 

Plus de Vaclav Kosar

Plus de Vaclav Kosar (6)

Conversation with-search-engines (Ren et al. 2020)
Conversation with-search-engines (Ren et al. 2020)Conversation with-search-engines (Ren et al. 2020)
Conversation with-search-engines (Ren et al. 2020)
 
FastText Vector Norms And OOV Words
FastText Vector Norms And OOV WordsFastText Vector Norms And OOV Words
FastText Vector Norms And OOV Words
 
Simulation of Soft Photon Calorimeter @ 2011 JINR, Dubna Student Practice
Simulation of Soft Photon Calorimeter @ 2011 JINR, Dubna Student PracticeSimulation of Soft Photon Calorimeter @ 2011 JINR, Dubna Student Practice
Simulation of Soft Photon Calorimeter @ 2011 JINR, Dubna Student Practice
 
Spline 0.3 and Plans for 0.4
Spline 0.3 and Plans for 0.4 Spline 0.3 and Plans for 0.4
Spline 0.3 and Plans for 0.4
 
Spline 0.3 User Guide
Spline 0.3 User GuideSpline 0.3 User Guide
Spline 0.3 User Guide
 
Spline 2 - Vision and Architecture Overview
Spline 2 - Vision and Architecture OverviewSpline 2 - Vision and Architecture Overview
Spline 2 - Vision and Architecture Overview
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Spline: Data Lineage For Spark Structured Streaming

  • 1. Marek Novotny, ABSA Vaclav Kosar, ABSA Spline: Data Lineage for Spark Structured Streaming #SAISExp18
  • 2. About Us •ABSA is a Pan-African financial services provider – With Apache Spark at the core of its data engineering •We try to fill gaps in the Hadoop eco-system •Contributions to Apache Spark •Spark-related open-source projects (github.com/AbsaOSS) – ABRiS – Avro SerDe for structured APIs (#SAISDev5) – Cobrix – Cobol data source – Atum – Completeness and accuracy library – Spline – Data lineage tracking and visualization tool (#EUent3) 2#SAISExp18
  • 3. • How data is calculated? • What is the schema and format of streamed data? 3#SAISExp18 01000110101101010
  • 4. 4#SAISExp18 Data Flow Job 2 Job 3 Job 1 01000110101101010 Topic A Topic B Topic Z Topic D Path /…/abc
  • 5. 5#SAISExp18 Transformations Job 3 Details Topic D //path/ Join colA + colB Topic Z Job 2 Job 3 Job 1 01000110101101010 Topic A Topic B Topic Z Topic D Path /…/abc
  • 6. 6#SAISExp18 Schema A Schema B Schema C Schema D Schema Z Schema C Schema D Job 2 Job 3 Job 1 Schemas and Formats 01000110101101010 Topic A Topic B Topic Z Topic D Path /…/abc
  • 7. Spline 7#SAISExp18 •To make Spark BCBS (Clarity) compliant •To communicate with business people
  • 8. Spline 8#SAISExp18 Dependencies •To make Spark BCBS (Clarity) compliant •To communicate with business people •Online documentation of –Job dependencies
  • 9. Spline 9#SAISExp18 Dependencies Details •To make Spark BCBS (Clarity) compliant •To communicate with business people •Online documentation of –Job dependencies – Particular Spark SQL jobs
  • 10. Spline •To make Spark BCBS (Clarity) compliant •To communicate with business people •Online documentation of –Job dependencies – Spark SQL job details – Attributes occurring in the logic 10#SAISExp18 Dependencies Details Attributes
  • 11. Lineage Tracking of Batch Jobs • Dataset-oriented • Leverages execution plans • Structured APIs only – SQL – Dataframes – Datasets • UDFs and lambdas are considered as black boxes 11#SAISExp18 Job Dataset A Lineage A
  • 12. Lineage Tracking of Streaming Jobs • Structured Streaming only • Source-oriented (topic) • Evolves in time 12#SAISExp18 App Lineage T1 Topic A Time Lineage T3 Lineage T2
  • 13. Structured Streaming Support 13#SAISExp18 Spark libraries Transformations Session Query Spark structured streaming job StreamingQueryManager • StreamingQueryManager
  • 14. Start Structured Streaming Support 14#SAISExp18 StreamingQueryManager Spline Streaming Listener Spark libraries Transformations Session Query Spark structured streaming job• StreamingQueryManager – Information about start
  • 15. Start Structured Streaming Support 15#SAISExp18 StreamingQueryManager Spline Streaming Listener Spark libraries Transformations Session Query Spark structured streaming job Give me exec. plans • StreamingQueryManager – Information about start – Can provide execution plans
  • 16. Start Structured Streaming Support 16#SAISExp18 StreamingQueryManager Spline Streaming Listener Spark libraries Transformations Session Query Lineage Model Spline UI Spark structured streaming job Execution Plans • StreamingQueryManager – Information about start – Can provide execution plans
  • 17. Start Structured Streaming Support 17#SAISExp18 StreamingQueryManager Spline Streaming Listener Spark libraries Transformations Session Query Lineage Model Spline UI Spark structured streaming job Event details ProgressExecution Plans • StreamingQueryManager – Information about start – Can provide execution plans – Information about progress • MicroBatch
  • 18. Interval View • Displays data flow in fixed interval 18#SAISExp18 Start End Time progress Job W1 Job R Job W2 progress progress progress progress progress Interval Job W1 Job R S1 S2 Job W2 S3
  • 19. Demo – Use Case 19#SAISExp18 What is temperature per hour in Prague? Station 2 Station NStation 1 ? Prague …
  • 20. Demo – Use Case Output 20#SAISExp18 0 5 10 15 20 25 30 35 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Temperature[°C] Hours Start End 2018-09-24
  • 21. Demo – Select Interval View 21#SAISExp18
  • 22. Demo – Select Interval 22#SAISExp18
  • 23. Demo – Select Sink 23#SAISExp18
  • 24. Demo – Find Highlighted Sink 24#SAISExp18
  • 25. Demo – Review The Lineage 25#SAISExp18
  • 26. Demo – Change The Interval 26#SAISExp18
  • 27. Demo – Observe New Lineage 27#SAISExp18
  • 28. Demo – Select A Job 28#SAISExp18
  • 29. Demo – Drill Down 29#SAISExp18
  • 30. Demo – Review Job Details 30#SAISExp18
  • 31. Demo – Select An Operation 31#SAISExp18
  • 32. Demo – See Operation Attributes 32#SAISExp18
  • 33. Interval View Limitations 33#SAISExp18 Start End Time progress Job W1 Job R Job W2 progress progress progress progress progress Interval Job R S1 S2 S310:21 10:25 10:30 10:35 10:45 10:51
  • 34. Interval View Limitations 34#SAISExp18 Job W1 Job R Start End Time progress Job W1 Job R Job W2 progress progress progress progress progress S1 S2 Interval View Interval 10:21 10:25 10:30 10:35 10:45 10:51
  • 35. Interval View Limitations • Edge case (delayed read, early write) – Job W1 should be linked – Job W2 should not be linked 35#SAISExp18 Job W1 Job R Start End Time progress Job W1 Job R Job W2 progress progress progress progress progress S1 S2 Job W2 Job R S1 S3 Lineage Interval Interval View 10:21 10:25 10:30 10:35 10:45 10:51
  • 36. Beyond The Interval View • Instead of timestamp use addresses of rows • SS has addresses (offsets) on each source, but not on sinks • Most sinks are also sources and thus could return offsets 36#SAISExp18 Source 2 Offsets: 3 - 5 Job Source 3 Offsets: 12 - 14 Sink Offsets: ? Progress Event
  • 38. Offset-Based Linking 38#SAISExp18 Job R Progress offset offset offset S3 S2 S1 Job R S1 Selected
  • 39. Offset-Based Linking 39#SAISExp18 Job R Progress offset offset offset 3 - 5 12 - 14 S3 S2 S1 Job R S1 S2 S3 3 - 5 12 - 14 Selected
  • 40. Offset-Based Linking 40#SAISExp18 Job W2 Progress Job R Progress offset offset offset 3 - 5 12 - 14 S3 S2 S1 Job R S1 S2 Job W2 S3 3 - 5 9 - 19 12 - 14 Selected
  • 41. Offset-Based Linking 41#SAISExp18 Job W2 Progress Job W1 Progress Job R Progress offset offset offset 3 - 5 12 - 14 22 - 27 S3 S2 S1 Job X Progress Job W1 Job R S1 S2 Job W2 S3 3 - 5 22 - 27 9 - 19 12 - 14 Selected
  • 42. Offset-Based Linking • Jobs are linked when progress offsets overlap • Offset timestamp doesn’t matter 42#SAISExp18 Job W1 Job R S1 S2 Job W2 S3 3 - 5 22 - 27 9 - 19 12 - 14 Job W2 Progress Job R Progress offset offset offset 3 - 5 12 - 14 22 - 27 S3 S2 S1 Job W1 Progress Job X Progress Selected
  • 43. Conclusion • Spline: data lineage tracking tool • New support for Structured Streaming • Demo POC: Interval View • Proposed generalization: offset-based linking 43#SAISExp18
  • 44. Future Plans • Release Interval View in Spline • After changes to Spark: – Offset based linking for micro-batch streaming – Continuous streaming support • Support for dataset checkpoints 44#SAISExp18
  • 45. Questions • Now is a good time • Or feel free to contact us – Marek Novotny • mn.mikke@gmail.com – Vaclav Kosar • admin@vaclavkosar.com • github.com/AbsaOSS/spline 45#SAISExp18