The document discusses Precima's analytics processes and pipeline. It describes moving from on-premise systems like SAS and shell scripting to using AWS services like S3, Control-M, Luigi, and Redshift. It outlines considerations for pipeline design and reviews both past and current systems. The future vision involves using Databricks for data pipelines and Snowflake for queries, allowing decoupled, scalable computing and storage.
4. Pipeline Design Considerations
Product pipeline is easy to test, debug and monitor. There are clear solutions for
replaying, rerunning and interrupting tasks or dataflow in production ready pipeline.
There are several teams involved in product pipeline ( e.g., security, development,
support and etc.) ; however , there is a clear chain of responsibility and protocol for
when things go wrong.
The pipeline design are reviewed under business/stakeholder use cases and our
pipeline are designed to be highly configurable and scalable.
5. 2 Years Ago – Legacy Stack in a Data Center
cron
10. Control-M Scheduler
Coordinate dependencies between disparate
servers and platforms
Central dashboard of execution status
We have gone from a handful of servers to
hundreds
Understand what runs when and for how long
Comparison of jobs to historical runtimes
11. Spot Fleet
Run hundreds of
independent jobs
concurrently
Each job gets it’s own server
Compute cost is about
$0.10/hr
Shared storage
Servers automatically
shutdown when jobs
complete
12. Redshift
Pros
Flexibility over Data Center
Very quick to onboard new clients
Can provide very fast query times
over large datasets
Cons
Concurrency issues – Leader node
Inconsistent job runtimes based
on overall workloads
Need to scale for largest expected
workload
Storage coupled to compute
Not quick to scale
AWS Only
14. Databricks and Snowflake
Databricks for Data Pipelines and Data Science
Snowflake for high performance data warehouse queries
Benefits
Decouple compute from storage
Jobs don’t interfere with each other
Virtually unlimited compute scaling
Virtually unlimited low cost storage
Spot pricing for nodes
Time Travel features allow for repeatable fast dry runs on live or nearly live data
Notebook interface including Python, SQL, Scala, R and Markdown for comments
Multi-cloud support
15. Vision for the Future
ETL with Databricks Spark jobs built using Object Oriented Python
Take advantage of inheritance and configuration
Quickly map new data feeds to our standard data model for our Precima products
Built-in validation and conversion for data fields
DRY – Don’t repeat yourself
Data Science pipeline using Databricks notebook workflow
Notebook Workflows allow user to include another notebook within a notebook. Users can
concatenate various notebooks that represent key ETL steps, Spark analysis steps, or ad-hoc
exploration. However, it lacks the ability to build more complex data pipelines.
Airflow provides tight integration between Databricks and Airflow. Luigi also provides an interface to
accommodate Apache spark jobs
16. Pipeline Design Considerations
Product pipeline is easy to test, debug and monitor. There are clear solutions for
replaying, rerunning and interrupting tasks or dataflow in production ready pipeline.
Workflow management frameworks helped us to achieve most of the desired feature
for data pipeline
There are several teams involved in product pipeline ( e.g., security, development,
support and etc.) ; however , there is a clear chain of responsibility and protocol for
when things go wrong.
The pipeline design are reviewed under business/stakeholder use cases and our
pipeline are designed to be highly configurable and scalable.
Move to AWS unlocked our ability to scale
Moving toward options that decouple storage from compute in order to scale efficiently
Have made good progress on embracing configuration
Moving toward fully configurable
17.
18. Appendix: Qualities of Ideal Data Pipelines
The desired quality of data pipeline include
Idempotent with state handling
Scalable and resilient
Replaceable or programmable
Testable and traceable
Documented and automated
Notes de l'éditeur
Partly Reference from SlideShare material https://www.slideshare.net/InfoQ/effective-data-pipelines-data-mngmt-from-chaos
Partly Reference from SlideShare material https://www.slideshare.net/InfoQ/effective-data-pipelines-data-mngmt-from-chaos
Reference from SlideShare material https://www.slideshare.net/InfoQ/effective-data-pipelines-data-mngmt-from-chaos