Anatomy of a data driven architecture - Tamir Dresher

Anatomy of a Data Driven
Architecture
@tamir_dresher
System Architect @ Payoneer

2
System Architect @ @tamir_dresher
Tamir Dresher
My Books:
Software Engineering Lecturer
Ruppin Academic Center
tamirdr@payoneer.com
https://www.israelclouds.com/iasaisrael

3
The need for DATA
Your Data
Data-Driven Decision Making Data-powered product
• What markets are leading and where can I expand?
• What’s slowing my process?
• Is there a correlation between the time invested
in a sale and the income from the tenant?
• What products should I recommend to this user?
• Is this action fraudulent ?
• Should I suggest a discount to this user to
raise the chance for a purchase?

4
ETL vs. ELT vs. Streaming
Transform LoadExtract
Transactional DB (OLTP) Analytical DB (OLAP)
LoadExtract
Transactional DB (OLTP)
Analytical DB (OLAP) / Storage
Transform
Events
Event
Stream
Stream
Processor
Real-time
insights

5
Lambda (λ) & Kappa (ϰ) Architectures
Speed Layer
Batch Layer Serving Layer
Real-time
views
Batched
views
Query
Data
Lambda
Speed/Streaming
Layer
Serving
Layer
Speed
views
QueryData
Kappa
Processing

6
The Data Pipeline
Architecture

7Cross Cutting and Ops
The Data Pipeline Architecture
Data Sources
Ingestion
&
Transformation
Storage
Query
&
Processing
Consumption
(Visualization,
Alerting,
Insights)
Discovery/Catalog/
Metadata Management
Security Quality&Testing Observability

8
Data Sources
Ingestion
&
Transformation
Storage
Query
&
Processing
Consumption
Databases Files Events APIs
• CDC - change data
capture
• Object Storage
• Structured/
Unstructured
• Customer tracking
• WebHooks
• Domain Events
• External Services
• Enrichment

9
Data Sources
Ingestion
&
Transformation
Storage
Query
&
Processing
Consumption
Data Integration Platforms
• Connectors to sources and destinations
• E.g. fivtran, stitchdata, rivery
Data Sources

10
Data Sources
Ingestion
&
Transformation
Storage
Query
&
Processing
Consumption
• E.g. fivtran, stitch

11
Data Sources
Ingestion
&
Transformation
Storage
Query
&
Processing
Consumption
Data Sources
Customized batching/micro-batching
• Spark jobs, Data libraries (Pandas, boto), Hive
• Workflows – Airflow, Dagster, Luigi

12
Data Sources
Ingestion
&
Transformation
Storage
Query
&
Processing
Consumption
Data Sources
• Spark jobs, Data libraries (panda, boto), Hive
https://towardsdatascience.com/building-a-production-level-etl-pipeline-platform-using-apache-airflow-a4cf34203fbd

13
Data Sources
Ingestion
&
Transformation
Storage
Query
&
Processing
Consumption
Data Sources
Event streaming and processing
• Messaging - Kafka, Pulsar, Kinesis, Event Hub
• Processing – Spark, Flink, Samza, Kafka
Streams, Azure Stream Analytics

14
Data Sources
Ingestion
&
Transformation
Storage
Query
&
Processing
Consumption
Data Sources
Event streaming and processing
• Messaging - Kafka, Pulsar, Kinesis, Event Hub
• Processing – Spark, Flink, Samza, Kafka
Streams, Azure Stream Analytics
-- Continuously aggregating a stream into a table with a ksqlDB push query.
CREATE STREAM locationUpdatesStream ...;
CREATE TABLE locationsPerUser AS
SELECT username, COUNT(*)
FROM locationUpdatesStream
GROUP BY username
EMIT CHANGES;
// Continuously aggregating to table
KStream<String, String> locationUpdatesStream = ...;
KTable<String, Long> locationsPerUser
= locationUpdatesStream
.groupBy((k, v) -> v.username)
.count();
https://www.confluent.io/blog/kafka-streams-tables-part-1-event-streaming/

15
Data Sources
Ingestion
&
Transformation
Storage
Query
&
Processing
Consumption
Data Warehouse
• Structured format
• Designed to quickly generate
insights based on SQL like
queries
• Modern cloud base offerings –
Redshift, BigQuery, Snowflake,
Azure Synapse
Data Lake
• Structured and non-structured
data – CSV, Parquet, Images,
Audio
• Raw and historical data
• Designed to be used by data
scientists and create models by
various languages

16
Data Sources
Ingestion
&
Transformation
Storage
Query
&
Processing
Consumption
Predictive
• Generate model
• Data Science and ML libraries –
Pandas, Numpy, R, scikit etc
• The model is periodically refreshed
Storage
Retrospective (Historical)
• Deriving Intelligence based on
statistics
• Built-in engine (Data Warehouse)
OR
• Query Engines – Presto, Impala
Real Time Analytics
• Run analytical queries
over big volumes of data
with interactive latencies.
• Apache Pinot,
Clickhouse, Druid
Data Science Platforms
• Helps managing the workflows,
productization and operations
- SageMaker, Iguazio, DataBricks etc.

17
Data Sources
Ingestion
&
Transformation
Storage
Query
&
Processing
Consumption
Custom Apps
• Execute the model – how is the model
reachable?
• Translate user/system actions to
queries
• Visualize – custom (Plotly Dash,
Streamlit etc) or embedded (PowerBI,
Looker etc)
External Apps
• Augmented Analytics - External
services to generate insights and
explain them
(e.g. Anomaly Detection with Anodot/
CrunchMetrics/outlier.ai)
• Customizable Reports and Dashboard
(e.g. Looker, Tableu, PowerBI, Sisense)

18
Summary
Data
at Rest
Data in
Motion Event/Dat
a Stream
Workflow Engine
Stream
Processor
Real-time
insights
Analytical
Storage
/Lake
Query
Engine
Model
Engine
Model
Accessible API
Application
Data Sources Ingestion&Transformation Storage Query&Processing Consume

19
Data at
Rest
Data in
Motion Event/Data
Stream
Workflow Engine
Stream
Processor
Real-time
insights
Analytical
Storage
/Lake
Query
Engine
Model
Engine
Model
Accessible API
Application
Data Sources Ingestion&Transformation Storage Query&Processing Consume
Thank You!
@tamir_dresher

Anatomy of a data driven architecture - Tamir Dresher

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Anatomy of a data driven architecture - Tamir Dresher

Similaire à Anatomy of a data driven architecture - Tamir Dresher (20)

Plus de Tamir Dresher

Plus de Tamir Dresher (20)

Dernier

Dernier (20)

Anatomy of a data driven architecture - Tamir Dresher

Notes de l'éditeur