SlideShare une entreprise Scribd logo
1  sur  18
Analytic Processes at Precima
Overview
 Analytics pipeline design considerations
 The old way
 The current way
 Looking to the future
Pipeline Design Considerations
 Product pipeline is easy to test, debug and monitor. There are clear solutions for
replaying, rerunning and interrupting tasks or dataflow in production ready pipeline.
 There are several teams involved in product pipeline ( e.g., security, development,
support and etc.) ; however , there is a clear chain of responsibility and protocol for
when things go wrong.
 The pipeline design are reviewed under business/stakeholder use cases and our
pipeline are designed to be highly configurable and scalable.
2 Years Ago – Legacy Stack in a Data Center
cron
SAS Enterprise Guide and Scripting
Shell Scripting and Crontab Scheduling
Current Analytics Stack in AWS
Amazon S3
Control-M
Luigi and Control M
Control-M Scheduler
 Coordinate dependencies between disparate
servers and platforms
 Central dashboard of execution status
 We have gone from a handful of servers to
hundreds
 Understand what runs when and for how long
 Comparison of jobs to historical runtimes
Spot Fleet
 Run hundreds of
independent jobs
concurrently
 Each job gets it’s own server
 Compute cost is about
$0.10/hr
 Shared storage
 Servers automatically
shutdown when jobs
complete
Redshift
Pros
 Flexibility over Data Center
 Very quick to onboard new clients
 Can provide very fast query times
over large datasets
Cons
 Concurrency issues – Leader node
 Inconsistent job runtimes based
on overall workloads
 Need to scale for largest expected
workload
 Storage coupled to compute
 Not quick to scale
 AWS Only
Future Precima Analytics Stack
Amazon S3
Control-M
Databricks and Snowflake
 Databricks for Data Pipelines and Data Science
 Snowflake for high performance data warehouse queries
Benefits
 Decouple compute from storage
 Jobs don’t interfere with each other
 Virtually unlimited compute scaling
 Virtually unlimited low cost storage
 Spot pricing for nodes
 Time Travel features allow for repeatable fast dry runs on live or nearly live data
 Notebook interface including Python, SQL, Scala, R and Markdown for comments
 Multi-cloud support
Vision for the Future
 ETL with Databricks Spark jobs built using Object Oriented Python
 Take advantage of inheritance and configuration
 Quickly map new data feeds to our standard data model for our Precima products
 Built-in validation and conversion for data fields
 DRY – Don’t repeat yourself
 Data Science pipeline using Databricks notebook workflow
 Notebook Workflows allow user to include another notebook within a notebook. Users can
concatenate various notebooks that represent key ETL steps, Spark analysis steps, or ad-hoc
exploration. However, it lacks the ability to build more complex data pipelines.
 Airflow provides tight integration between Databricks and Airflow. Luigi also provides an interface to
accommodate Apache spark jobs
Pipeline Design Considerations
 Product pipeline is easy to test, debug and monitor. There are clear solutions for
replaying, rerunning and interrupting tasks or dataflow in production ready pipeline.
 Workflow management frameworks helped us to achieve most of the desired feature
for data pipeline
 There are several teams involved in product pipeline ( e.g., security, development,
support and etc.) ; however , there is a clear chain of responsibility and protocol for
when things go wrong.
 The pipeline design are reviewed under business/stakeholder use cases and our
pipeline are designed to be highly configurable and scalable.
 Move to AWS unlocked our ability to scale
 Moving toward options that decouple storage from compute in order to scale efficiently
 Have made good progress on embracing configuration
 Moving toward fully configurable
Appendix: Qualities of Ideal Data Pipelines
The desired quality of data pipeline include
 Idempotent with state handling
 Scalable and resilient
 Replaceable or programmable
 Testable and traceable
 Documented and automated

Contenu connexe

Tendances

Dan Querimit - BI Portfolio
Dan Querimit - BI PortfolioDan Querimit - BI Portfolio
Dan Querimit - BI Portfolio
querimit
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
Databricks
 
How to Build Modern Data Architectures Both On Premises and in the Cloud
How to Build Modern Data Architectures Both On Premises and in the CloudHow to Build Modern Data Architectures Both On Premises and in the Cloud
How to Build Modern Data Architectures Both On Premises and in the Cloud
VMware Tanzu
 
Cloud-based Linked Data Management for Self-service Application Development
Cloud-based Linked Data Management for Self-service Application DevelopmentCloud-based Linked Data Management for Self-service Application Development
Cloud-based Linked Data Management for Self-service Application Development
Peter Haase
 

Tendances (20)

Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...
Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...
Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...
 
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
Delivering Insights from 20M+ Smart Homes with 500M+ DevicesDelivering Insights from 20M+ Smart Homes with 500M+ Devices
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
 
Time Series Analysis Using an Event Streaming Platform
 Time Series Analysis Using an Event Streaming Platform Time Series Analysis Using an Event Streaming Platform
Time Series Analysis Using an Event Streaming Platform
 
Transform Your Mainframe Data for the Cloud with Precisely and Apache Kafka
Transform Your Mainframe Data for the Cloud with Precisely and Apache KafkaTransform Your Mainframe Data for the Cloud with Precisely and Apache Kafka
Transform Your Mainframe Data for the Cloud with Precisely and Apache Kafka
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
 
Dan Querimit - BI Portfolio
Dan Querimit - BI PortfolioDan Querimit - BI Portfolio
Dan Querimit - BI Portfolio
 
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast Data
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with Spark
 
How to Build Modern Data Architectures Both On Premises and in the Cloud
How to Build Modern Data Architectures Both On Premises and in the CloudHow to Build Modern Data Architectures Both On Premises and in the Cloud
How to Build Modern Data Architectures Both On Premises and in the Cloud
 
Cloud-based Linked Data Management for Self-service Application Development
Cloud-based Linked Data Management for Self-service Application DevelopmentCloud-based Linked Data Management for Self-service Application Development
Cloud-based Linked Data Management for Self-service Application Development
 
Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data Warehouse
 
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
 
Deploying ETL to Cloud
Deploying ETL to CloudDeploying ETL to Cloud
Deploying ETL to Cloud
 
Alteryx Architecture
Alteryx ArchitectureAlteryx Architecture
Alteryx Architecture
 
How Kafka and Modern Databases Benefit Apps and Analytics
How Kafka and Modern Databases Benefit Apps and AnalyticsHow Kafka and Modern Databases Benefit Apps and Analytics
How Kafka and Modern Databases Benefit Apps and Analytics
 
Change Data Capture - Scale by the Bay 2019
Change Data Capture - Scale by the Bay 2019Change Data Capture - Scale by the Bay 2019
Change Data Capture - Scale by the Bay 2019
 
Building a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with DatabricksBuilding a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with Databricks
 

Similaire à Tordatasci meetup-precima-retail-analytics-201901

A complete-guide-to-oracle-to-redshift-migration
A complete-guide-to-oracle-to-redshift-migrationA complete-guide-to-oracle-to-redshift-migration
A complete-guide-to-oracle-to-redshift-migration
bindu1512
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
Databricks
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics Workshop
Databricks
 

Similaire à Tordatasci meetup-precima-retail-analytics-201901 (20)

Qubole Pipeline Services - A Complete Stream Processing Service - Data Sheets
Qubole Pipeline Services - A Complete Stream Processing Service - Data SheetsQubole Pipeline Services - A Complete Stream Processing Service - Data Sheets
Qubole Pipeline Services - A Complete Stream Processing Service - Data Sheets
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKC
 
Azure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdfAzure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdf
 
Estimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics PlatformEstimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics Platform
 
A complete-guide-to-oracle-to-redshift-migration
A complete-guide-to-oracle-to-redshift-migrationA complete-guide-to-oracle-to-redshift-migration
A complete-guide-to-oracle-to-redshift-migration
 
Azure SQL DB Managed Instances Built to easily modernize application data layer
Azure SQL DB Managed Instances Built to easily modernize application data layerAzure SQL DB Managed Instances Built to easily modernize application data layer
Azure SQL DB Managed Instances Built to easily modernize application data layer
 
Airflow techtonic template
Airflow   techtonic templateAirflow   techtonic template
Airflow techtonic template
 
Webinar How to Achieve True Scalability in SaaS Applications
Webinar How to Achieve True Scalability in SaaS ApplicationsWebinar How to Achieve True Scalability in SaaS Applications
Webinar How to Achieve True Scalability in SaaS Applications
 
Big Data Engineering for Machine Learning
Big Data Engineering for Machine LearningBig Data Engineering for Machine Learning
Big Data Engineering for Machine Learning
 
Introduction to GCP Data Flow Presentation
Introduction to GCP Data Flow PresentationIntroduction to GCP Data Flow Presentation
Introduction to GCP Data Flow Presentation
 
Introduction to GCP DataFlow Presentation
Introduction to GCP DataFlow PresentationIntroduction to GCP DataFlow Presentation
Introduction to GCP DataFlow Presentation
 
C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
 
Data Modernization_Harinath Susairaj.pptx
Data Modernization_Harinath Susairaj.pptxData Modernization_Harinath Susairaj.pptx
Data Modernization_Harinath Susairaj.pptx
 
Databus - LinkedIn's Change Data Capture Pipeline
Databus - LinkedIn's Change Data Capture PipelineDatabus - LinkedIn's Change Data Capture Pipeline
Databus - LinkedIn's Change Data Capture Pipeline
 
Windows on AWS
Windows on AWSWindows on AWS
Windows on AWS
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics Workshop
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
 
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration) SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
 

Plus de WeCloudData

Plus de WeCloudData (16)

Data Engineer Intro - WeCloudData
Data Engineer Intro - WeCloudDataData Engineer Intro - WeCloudData
Data Engineer Intro - WeCloudData
 
AWS Well Architected-Info Session WeCloudData
AWS Well Architected-Info Session WeCloudDataAWS Well Architected-Info Session WeCloudData
AWS Well Architected-Info Session WeCloudData
 
Data Engineering Course Syllabus - WeCloudData
Data Engineering Course Syllabus - WeCloudDataData Engineering Course Syllabus - WeCloudData
Data Engineering Course Syllabus - WeCloudData
 
Machine learning in Healthcare - WeCloudData
Machine learning in Healthcare - WeCloudDataMachine learning in Healthcare - WeCloudData
Machine learning in Healthcare - WeCloudData
 
Deep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudDataDeep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudData
 
Big Data for Data Scientists - WeCloudData
Big Data for Data Scientists - WeCloudDataBig Data for Data Scientists - WeCloudData
Big Data for Data Scientists - WeCloudData
 
Introduction to Machine Learning - WeCloudData
Introduction to Machine Learning - WeCloudDataIntroduction to Machine Learning - WeCloudData
Introduction to Machine Learning - WeCloudData
 
Data Science with Python - WeCloudData
Data Science with Python - WeCloudDataData Science with Python - WeCloudData
Data Science with Python - WeCloudData
 
SQL for Data Science
SQL for Data ScienceSQL for Data Science
SQL for Data Science
 
Introduction to Python by WeCloudData
Introduction to Python by WeCloudDataIntroduction to Python by WeCloudData
Introduction to Python by WeCloudData
 
Data Science Career Insights by WeCloudData
Data Science Career Insights by WeCloudDataData Science Career Insights by WeCloudData
Data Science Career Insights by WeCloudData
 
Web scraping project aritza-compressed
Web scraping project   aritza-compressedWeb scraping project   aritza-compressed
Web scraping project aritza-compressed
 
Applied Machine Learning Course - Jodie Zhu (WeCloudData)
Applied Machine Learning Course - Jodie Zhu (WeCloudData)Applied Machine Learning Course - Jodie Zhu (WeCloudData)
Applied Machine Learning Course - Jodie Zhu (WeCloudData)
 
Introduction to Machine Learning - WeCloudData
Introduction to Machine Learning - WeCloudDataIntroduction to Machine Learning - WeCloudData
Introduction to Machine Learning - WeCloudData
 
Big Data for Data Scientists - Info Session
Big Data for Data Scientists - Info SessionBig Data for Data Scientists - Info Session
Big Data for Data Scientists - Info Session
 
WeCloudData Toronto Open311 Workshop - Matthew Reyes
WeCloudData Toronto Open311 Workshop - Matthew ReyesWeCloudData Toronto Open311 Workshop - Matthew Reyes
WeCloudData Toronto Open311 Workshop - Matthew Reyes
 

Dernier

Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 

Dernier (20)

Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 

Tordatasci meetup-precima-retail-analytics-201901

  • 1.
  • 3. Overview  Analytics pipeline design considerations  The old way  The current way  Looking to the future
  • 4. Pipeline Design Considerations  Product pipeline is easy to test, debug and monitor. There are clear solutions for replaying, rerunning and interrupting tasks or dataflow in production ready pipeline.  There are several teams involved in product pipeline ( e.g., security, development, support and etc.) ; however , there is a clear chain of responsibility and protocol for when things go wrong.  The pipeline design are reviewed under business/stakeholder use cases and our pipeline are designed to be highly configurable and scalable.
  • 5. 2 Years Ago – Legacy Stack in a Data Center cron
  • 6. SAS Enterprise Guide and Scripting
  • 7. Shell Scripting and Crontab Scheduling
  • 8. Current Analytics Stack in AWS Amazon S3 Control-M
  • 10. Control-M Scheduler  Coordinate dependencies between disparate servers and platforms  Central dashboard of execution status  We have gone from a handful of servers to hundreds  Understand what runs when and for how long  Comparison of jobs to historical runtimes
  • 11. Spot Fleet  Run hundreds of independent jobs concurrently  Each job gets it’s own server  Compute cost is about $0.10/hr  Shared storage  Servers automatically shutdown when jobs complete
  • 12. Redshift Pros  Flexibility over Data Center  Very quick to onboard new clients  Can provide very fast query times over large datasets Cons  Concurrency issues – Leader node  Inconsistent job runtimes based on overall workloads  Need to scale for largest expected workload  Storage coupled to compute  Not quick to scale  AWS Only
  • 13. Future Precima Analytics Stack Amazon S3 Control-M
  • 14. Databricks and Snowflake  Databricks for Data Pipelines and Data Science  Snowflake for high performance data warehouse queries Benefits  Decouple compute from storage  Jobs don’t interfere with each other  Virtually unlimited compute scaling  Virtually unlimited low cost storage  Spot pricing for nodes  Time Travel features allow for repeatable fast dry runs on live or nearly live data  Notebook interface including Python, SQL, Scala, R and Markdown for comments  Multi-cloud support
  • 15. Vision for the Future  ETL with Databricks Spark jobs built using Object Oriented Python  Take advantage of inheritance and configuration  Quickly map new data feeds to our standard data model for our Precima products  Built-in validation and conversion for data fields  DRY – Don’t repeat yourself  Data Science pipeline using Databricks notebook workflow  Notebook Workflows allow user to include another notebook within a notebook. Users can concatenate various notebooks that represent key ETL steps, Spark analysis steps, or ad-hoc exploration. However, it lacks the ability to build more complex data pipelines.  Airflow provides tight integration between Databricks and Airflow. Luigi also provides an interface to accommodate Apache spark jobs
  • 16. Pipeline Design Considerations  Product pipeline is easy to test, debug and monitor. There are clear solutions for replaying, rerunning and interrupting tasks or dataflow in production ready pipeline.  Workflow management frameworks helped us to achieve most of the desired feature for data pipeline  There are several teams involved in product pipeline ( e.g., security, development, support and etc.) ; however , there is a clear chain of responsibility and protocol for when things go wrong.  The pipeline design are reviewed under business/stakeholder use cases and our pipeline are designed to be highly configurable and scalable.  Move to AWS unlocked our ability to scale  Moving toward options that decouple storage from compute in order to scale efficiently  Have made good progress on embracing configuration  Moving toward fully configurable
  • 17.
  • 18. Appendix: Qualities of Ideal Data Pipelines The desired quality of data pipeline include  Idempotent with state handling  Scalable and resilient  Replaceable or programmable  Testable and traceable  Documented and automated

Notes de l'éditeur

  1. Partly Reference from SlideShare material https://www.slideshare.net/InfoQ/effective-data-pipelines-data-mngmt-from-chaos
  2. Partly Reference from SlideShare material https://www.slideshare.net/InfoQ/effective-data-pipelines-data-mngmt-from-chaos
  3. Reference from SlideShare material https://www.slideshare.net/InfoQ/effective-data-pipelines-data-mngmt-from-chaos