SlideShare une entreprise Scribd logo
1  sur  101
Télécharger pour lire hors ligne
Cloud Native Data
Pipelines
1
Sid Anand (@r39132)
DataEngConf SF 2017
About Me
2
Work [ed | s] @
Committer &
PPMC on
Father of 2
Co-Chair for
Apache Airflow
Agari
3
What We Do!
Agari : What We Do
4
5
Agari : What We Do
6
Agari : What We Do
7
Agari : What We Do
8
Agari : What We Do
9
Enterprise
Customers
email
metadata
apply
trust
models
email md +
trust score
Agari’s Previous EP Version
Agari : What We Do
Batch
10
email
metadata
apply
trust
models
email md +
trust score
Agari’s Current EP VersionEnterprise
Customers
Agari : What We Do
Near-real
time
Quarantine,
Label,
PassThrough
Data Pipelines
BI vs Predictive
11
Data Pipelines (BI)
12
Web	Servers	
OLTP	
DB	
Data	
Warehouse	
Repor6ng	
Tools	
Query	
Browsers	
ETL	(batch)	
MySQL,	
Oracle,	
Cassandra	
Terradata,	
RedShi;	
BigQuery
OLTP	DB	
or	cache	
ETL	(batch	or	streaming)	
MySQL,	
Oracle,	
Cassandra,	
Redis	
Spark,	
Flink,	
Beam,	
Storm	
Web	Servers	
Data	Products	
Ranking	(Search,	News	Feed),	
Recommender	Products,	
Fraud	DetecGon	/	PrevenGon	
Data	
Source	
Data Pipelines (Predictive)
13
Data Products
14
BI Predictive
Common Focus of this talk
Data Pipelines
15
Web	Servers	
OLTP	
DB	
Data	
Warehouse	
Repor6ng	
Tools	
Query	
Browsers	
ETL	(batch)	
MySQL,	
Oracle,	
Cassandra	
Terradata,	
RedShi;	
BigQuery	
OLTP	DB	
or	cache	
ETL	(batch	or	streaming)	
MySQL,	
Oracle,	
Cassandra,	
Redis	
Spark,	
Flink,	
Beam,	
Storm	
Web	Servers	
Ranking	(Search,	News	Feed),	
Recommender	Products,	
Fraud	DetecGon	/	PrevenGon	
Data	
Source
Motivation
Cloud Native Data Pipelines
16
Cloud Native Data Pipelines
17
Big Data Companies like LinkedIn, Facebook, Twitter, & Google
have large teams to manage their data pipelines

Most start-ups run in the public cloud. Can they leverage
aspects of the public cloud to build comparable pipelines?
Cloud Native Data Pipelines
18
Cloud Native
Techniques

Open Source
Technogies
Data Pipelines seen
in Big Data companies

~
Design Goals
Desirable Qualities of a Resilient Data Pipeline
19
20
Desirable Qualities of a Resilient
Data Pipeline
OperabilityCorrectness
Timeliness Cost
21
Desirable Qualities of a Resilient
Data Pipeline
OperabilityCorrectness
Timeliness Cost
• Data Integrity (no loss, etc…)
• Expected data distributions
• All output within time-bound SLAs
• Minimize Operational Fatigue /
Automate Everything
• Fine-grained Monitoring & Alerting of
Correctness & Timeliness SLAs
• Quick Recoverability
• Pay-as-you-go
Predictive Analytics @ Agari
Use Cases
22
Use Cases
23
Apply trust models
(message scoring)
batch + near real
time
Build trust models
batch
(Enterprise Protect)
Use Cases
24
Apply trust models
(message scoring)
batch + near real
time
Build trust models
batch
(Enterprise Protect)
Focus of this talk
Use-Case : Message
Scoring (batch)
Batch Pipeline Architecture
25
Use-Case : Message Scoring
26
enterprise A
enterprise B
enterprise C
S3
S3 uploads an Avro file
every 15 minutes
Use-Case : Message Scoring
27
enterprise A
enterprise B
enterprise C
S3
Airflow kicks of a Spark
message scoring job
every hour (EMR)
Use-Case : Message Scoring
28
enterprise A
enterprise B
enterprise C
S3
Spark job writes scored
messages and stats to
another S3 bucket
S3
Use-Case : Message Scoring
29
enterprise A
enterprise B
enterprise C
S3
This triggers SNS/SQS
messages events
S3
SNS
SQS
Use-Case : Message Scoring
30
enterprise A
enterprise B
enterprise C
S3
An Autoscale Group
(ASG) of Importers spins
up when it detects SQS
messages
S3
SNS
SQS
Importers
ASG
31
enterprise A
enterprise B
enterprise C
S3
The importers rapidly ingest scored
messages and aggregate statistics into
the DB
S3
SNS
SQS
Importers
ASG
DB
Use-Case : Message Scoring
32
enterprise A
enterprise B
enterprise C
S3
Users receive alerts of
untrusted emails &
can review them in
the web app
S3
SNS
SQS
Importers
ASG
DB
Use-Case : Message Scoring
33
enterprise A
enterprise B
enterprise C
S3 S3
SNS
SQS
Importers
ASG
DB
Airflow manages the entire process
Use-Case : Message Scoring
34
Architectural Components
Component Role Uses Salient Features Operability Model
Data Lake
• All data stored in S3
• All processing uses S3
Scalable, Available,
Performant
Serverless
Messaging
• Reliable, Transactional,
Pub/Sub
Scalable, Available,
Performant
Serverless
ASG
General
Processing
• Used for importing,
data cleansing,
business logic
Scalable, Available,
Performant
Managed
Data Science
Processing
• Aggregation
• Model Building
• Scoring
Nice programming
model at the cost of
debugging complexity
We Operate
Workflow
Engine
• Coordinates all Spark
Jobs & complex flows
Lightweight, DAGs as
Code, Steep learning
curve
We Operate
DB
Persistence for
WebApp
• Holds subset of data
needed for Web App
Rails + Postgres
‘nuff said
We Operate
S3
SNS SQS
Tackling Cost & Timeliness
Leveraging the AWS Cloud
35
Tackling Cost
36
Between Daily Runs During Daily Runs
When running daily, for 23 hours of a day, we didn’t
pay for instances in the ASG or EMR
Tackling Cost
37
Between Hourly Runs During Hourly Runs
When running daily, for 23 hours of a day, we didn’t pay for
instances in the ASG or EMR
This does not help when runs are hourly since AWS charges at
an hourly rate for EC2 instances!
Tackling Timeliness
Auto Scaling Group (ASG)
38
ASG - Overview
39
What is it?
A means to automatically scale out/in clusters to handle
variable load/traffic
A means to keep a cluster/service of a fixed size always up
ASG - Data Pipeline
40
importer
importer
importer
importer
Importer
ASG
scaleout/in
SQS
DB
41
Sent
CPU
ACKd/Recvd
CPU-based auto-scaling is
good at scaling in/out to
keep the average CPU
constant
ASG : CPU-based
ASG : CPU-based
42
Sent
CPU
Recv
Premature
Scale-in
Premature Scale-in:
• The CPU drops to noise-levels before all messages are
consumed
• This causes scale in to occur while the last few
messages are still being committed
43
Scale-out: When Visible Messages > 0 (a.k.a. when queue depth > 0)
Scale-in: When Invisible Messages = 0 (a.k.a. when the last in-flight
message is ACK’d)
This causes the
ASG to grow
This causes the
ASG to shrink
ASG : Queue-based
Auto Scaling Groups
Build & Deploy
44
ASG - Build & Deploy
45
Component Role Details
Spins up Cloud Resources
• Spins up SQS, Kinesis, EC2, ASG,
ELB, etc.. and associate them
using Terraform
• A better version of Chef &
Puppet
• Sets up an EC2 instance
• Agentless, idempotent, &
declarative tool to set up EC2
instances, by installing &
configuring packages, and more
• Spins up an EC2 instance
for the purposes of building
an AMI!
• Can be used with Ansible &
Terraform to bake AMIs & Launch
Auto-Scaling Groups
ASG - Build & Deploy
46
EC2 Step 1 : Packer spins up a temporary
EC2 node - a blank canvas!
EC2
ASG - Build & Deploy
47
EC2 Step 1 : Packer spins up a temporary
EC2 node - a blank canvas!
Step 2 : Packer runs an Ansible role against the
EC2 node to set it up.
EC2
ASG - Build & Deploy
48
EC2
Step 2 : Packer runs an Ansible role against the
EC2 node to set it up.
Step 3 : Snapshots the machine & register the
AMI.EC2
Step 1 : Packer spins up a temporary
EC2 node - a blank canvas!
EC2
ASG - Build & Deploy
49
EC2
Step 2 : Packer runs an Ansible role against the
EC2 node to set it up.
Step 3 : Snapshots the machine & register the
AMI.EC2
Step 4 : Terminates the EC2 instance!
Step 1 : Packer spins up a temporary
EC2 node - a blank canvas!
EC2
ASG - Build & Deploy
50
EC2
Step 2 : Packer runs an Ansible role against the
EC2 node to set it up.
Step 3 : Snapshots the machine & register the
AMI.EC2
Step 4 : Terminates the EC2 instance!
Step 5 : Using the AMI, Terraform spins up an
auto-scaled compute cluster (ASG)
Step 1 : Packer spins up a temporary
EC2 node - a blank canvas!
ASG
51
Desirable Qualities of a Resilient
Data Pipeline
OperabilityCorrectness
Timeliness Cost
• ASG
• EMR Spark
Daily
• ASG
• EMR Spark
Hourly ASG
• No Cost Savings
Tackling Operability &
Correctness
Leveraging Tooling
52
53
A simple way to author, configure, manage workflows
Provides visual insight into the state & performance of workflow
runs
Integrates with our alerting and monitoring tools
Tackling Operability : Requirements
Apache Airflow
Workflow Automation & Scheduling
54
55
Airflow: Author DAGs in Python! No need to bundle many config files!
Apache Airflow - Authoring DAGs
56
Airflow: Visualizing a DAG
Apache Airflow - Authoring DAGs
57
Airflow: It’s easy to manage multiple DAGs
Apache Airflow - Managing DAGs
Apache Airflow - Perf. Insights
58
Airflow: Gantt chart view reveals the slowest tasks for a run!
59
Apache Airflow - Perf. Insights
Airflow: Task Duration chart view show task completion time trends!
60
Airflow: …And easy to integrate with Ops tools!
Apache Airflow - Alerting
61
Apache Airflow - Correctness
62
Desirable Qualities of a Resilient
Data Pipeline
OperabilityCorrectness
Timeliness Cost
Use-Case : Message
Scoring (near-real time)
NRT Pipeline Architecture
63
Use-Case : Message Scoring
64
enterprise A
enterprise B
enterprise C
Kinesis batch put every
second
K
Use-Case : Message Scoring
65
enterprise A
enterprise B
enterprise C
K
As ASG of scorers is
scaled up to one process
per core per kinesis shard
Scorers
ASG
Use-Case : Message Scoring
66
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Scorers apply the trust
model and send scored
messages downstream
Use-Case : Message Scoring
67
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
As ASG of importers is
scaled up to rapidly
import messages
DB
Use-Case : Message Scoring
68
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
Imported messages are
also consumed by the
alerter
DB
K
Alerters
ASG
Use-Case : Message Scoring
69
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
Imported messages are
also consumed by the
alerter
DB
K
Alerters
ASG
Quarantine Email
70
Stream Processing Architecture
Component Role Details Pros Operability Model
Data Lake
• All data stored in S3 via
Kinesis Firehose
Scalable, Available,
Performant, Serverless
Serverless
Kinesis Messaging
• Streaming transport
modeled on Kafka
Scalable, Available,
Serverless
Serverless
General
Processing
• ASG Replacement except
for Rails Apps
Scalable, Available,
Serverless
Serverless
ASG
General
Processing
• Used for importing, data
cleansing, business logic
Scalable, Available,
Managed
Managed
Data Science
Processing
• Model Building
We Operate
Workflow Engine
• Nightly model builds +
some classic Ops cron
workloads
Lightweight, DAGs as
Code
We Operate
DB
Persistence for
WebApp
• Holds smaller subset of
data needed for Web App
Rails + Postgres
‘nuff said
We Operate
Persistence for
WebApp
• Aggregation + Search
moved from DB to ES
• Model Building queries
moved to Elasticache
Redis
Faster. more accurate for
aggregates, frees up
headroom for DB (polyglot
persistence)
Managed
S3
Innovations
NRT Pipeline Architecture
71
Apache Avro
What is Avro?
72
73
What is Avro?
Avro is a self-describing serialization format that supports
primitive data types : int, long, boolean, float, string, bytes, etc…
complex data types : records, arrays, unions, maps, enums, etc…
many language bindings : Java, Scala, Python, Ruby, etc…
74
What is Avro?
Avro is a self-describing serialization format that supports
primitive data types : int, long, boolean, float, string, bytes, etc…
complex data types : records, arrays, unions, maps, enums, etc…
many language bindings : Java, Scala, Python, Ruby, etc…
The most common format for storing structured Big Data at rest in
HDFS, S3, Google Cloud Storage, etc…
Supports Schema Evolution!
Apache Avro
Why is it useful?
75
76
Why is Avro Useful?
Agari is an IoT company!
Agari Sensors, deployed at customer sites, stream data to Agari’s
Cloud SAAS
Data is sent via Kinesis!
enterprise A
enterprise B
enterprise C Kinesis
Agari SAAS
in AWS
77
Why is Avro Useful?
enterprise A :
enterprise B :
enterprise C : Kinesis
v1
v2
v3
Agari is an IoT company!
Agari Sensors, deployed at customer sites, stream data to Agari’s
Cloud SAAS
Data is sent via Kinesis!
At any point in time, customers run different versions of the Agari
Sensor
Agari SAAS
in AWS
78
Why is Avro Useful?
enterprise A :
enterprise B :
enterprise C : Kinesis
v1
v2
v3
Agari is an IoT company!
Agari Sensors, deployed at customer sites, stream data to
Agari’s Cloud SAAS
Data is sent via Kinesis!
At any point in time, customers run different versions of the
Agari Sensor
These Sensors might send different format versions of the
data!
Agari SAAS
in AWS
79
Why is Avro Useful?
enterprise A :
enterprise B :
enterprise C : Kinesis
v1
v2
v3
Agari SAAS
in AWS
v4
Agari is an IoT company!
Agari Sensors, deployed at customer sites, stream data to
Agari’s Cloud SAAS
Data is sent via Kinesis!
At any point in time, customers run different versions of the
Agari Sensor
These Sensors might send different format versions of the
data!
80
Why is Avro Useful?
enterprise A :
enterprise B :
enterprise C :
v1
v2
v3
Avro allows Agari to seamlessly handle different IoT data format
versions
Agari SAAS
in AWS
Kinesis v4
datum_reader = DatumReader( writers_schema = writers_schema,
readers_schema = readers_schema)
Requirements:
• Schemas are backward-compatible
81
Why is Avro Useful?
Agari SAAS in AWS
S1 S2 S3
s3 Spark
Avro Everywhere!
Avro is so useful, we don’t just to communicate between our
Sensors & our SAAS infrastructure
We also use it as the common data-interchange format between all
services (streaming & batch) within our AWS deployment
82
Why is Avro Useful?
Agari SAAS in AWS
S1 S2 S3
s3 Spark
Avro Everywhere!
Good Language Bindings :
Data Pipelines services are written in Java, Ruby, & Python
Apache Avro
By Example
83
84
Avro Schema Example
{"namespace": "agari",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
85
{"namespace": "agari",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
complex type (record)
Avro Schema Example
86
{"namespace": "agari",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
complex type (record)
Schema name : User
Avro Schema Example
87
{"namespace": "agari",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
complex type (record)
Schema name : User
3 fields in the record: 1 required, 2
optional
Avro Schema Example
88
{"namespace": "agari",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
Data
x 1,000,000,000
Avro Schema Data File Example
Schema
Data
0.0001 %
99.999 %
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
89
{"namespace": "agari",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
Binary Data block
Avro Schema Streaming Example
Schema
Data
99 %
1 %
Data
90
{"namespace": "agari",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
Binary Data block
Avro Schema Streaming Example
Schema
Data
99 %
1 %
Data
OVERHEAD!!
Apache Avro
Schema Registry
91
92
Schema
Registry
(Lambda)
Avro Schema Registry
{"namespace": "agari",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
register_schema
Message
Producer (P)
93
Schema
Registry
(Lambda)
register_schema returns a UUID
Message
Producer (P)
Avro Schema Registry
94
Schema
Registry
(Lambda)
Message Producer sends UUID +
Message
Producer (P)
Data
Message
Consumer (C)
Avro Schema Registry
95
Schema
Registry
(Lambda)
Message
Producer (P)
Data
Message
Consumer (C)
getSchemaById (UUID)
Avro Schema Registry
96
Schema
Registry
(Lambda)
Message
Producer (P)
Data
Message
Consumer (C)
getSchemaById (UUID)
{"namespace": "agari",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
Avro Schema Registry
97
Schema
Registry
(Lambda)
Message
Producer (P)
Message
Consumer (C)
getSchemaById (UUID)
{"namespace": "agari",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
Message Consumers
• download & cache the schema
• then decode the data
Avro Schema Registry
98
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
Imported messages are
also consumed by the
alerter
DB
K
Alerters
ASG
SR
SR
SR
Avro Schema Registry
99
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
Imported messages are
also consumed by the
alerter
DB
K
Alerters
ASG
SR
SR
SR
Avro Schema Registry
Acknowledgments
100
• Vidur Apparao
• Stephen Cattaneo
• Jon Chase
• Andrew Flury
• William Forrester
• Chris Haag
• Chris Buchanan
• Neil Chapin
• Wil Collins
• Don Spencer
• Scot Kennedy
• Natia Chachkhiani
• Patrick Cockwell
• Kevin Mandich
• Gabriel Ortiz
• Jacob Rideout
• Josh Yang
• Julian Mehnle
• Gabriel Poon
• Spencer Sun
• Nathan Bryant
None of this work would be possible without the
essential contributions of the team below
Questions? (@r39132)
101

Contenu connexe

Tendances

Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Chris Fregly
 
Building a Modern Website for Scale (QCon NY 2013)
Building a Modern Website for Scale (QCon NY 2013)Building a Modern Website for Scale (QCon NY 2013)
Building a Modern Website for Scale (QCon NY 2013)Sid Anand
 
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & DataductAmazon Web Services
 
BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012
BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012
BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012Amazon Web Services
 
Next Generation Big Data Platform at Netflix 2014
Next Generation Big Data Platform at Netflix 2014Next Generation Big Data Platform at Netflix 2014
Next Generation Big Data Platform at Netflix 2014Eva Tse
 
Introduction to AWS Step Functions:
Introduction to AWS Step Functions: Introduction to AWS Step Functions:
Introduction to AWS Step Functions: Amazon Web Services
 
AWS May Webinar Series - Streaming Data Processing with Amazon Kinesis and AW...
AWS May Webinar Series - Streaming Data Processing with Amazon Kinesis and AW...AWS May Webinar Series - Streaming Data Processing with Amazon Kinesis and AW...
AWS May Webinar Series - Streaming Data Processing with Amazon Kinesis and AW...Amazon Web Services
 
Serverless Architectural Patterns & Best Practices
Serverless Architectural Patterns & Best PracticesServerless Architectural Patterns & Best Practices
Serverless Architectural Patterns & Best PracticesDaniel Zivkovic
 
SF Big Analytics: Machine Learning with Presto by Christopher Berner
SF Big Analytics: Machine Learning with Presto by Christopher BernerSF Big Analytics: Machine Learning with Presto by Christopher Berner
SF Big Analytics: Machine Learning with Presto by Christopher BernerChester Chen
 
New AWS Services for Bioinformatics
New AWS Services for BioinformaticsNew AWS Services for Bioinformatics
New AWS Services for BioinformaticsLynn Langit
 
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech TalksTackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech TalksAmazon Web Services
 
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014Amazon Web Services
 
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...Amazon Web Services
 
Presto @ Netflix: Interactive Queries at Petabyte Scale
Presto @ Netflix: Interactive Queries at Petabyte ScalePresto @ Netflix: Interactive Queries at Petabyte Scale
Presto @ Netflix: Interactive Queries at Petabyte ScaleDataWorks Summit
 
AWS Kinesis - Streams, Firehose, Analytics
AWS Kinesis - Streams, Firehose, AnalyticsAWS Kinesis - Streams, Firehose, Analytics
AWS Kinesis - Streams, Firehose, AnalyticsSerhat Can
 
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...Amazon Web Services
 
Streaming Data Analytics with Amazon Redshift and Kinesis Firehose
Streaming Data Analytics with Amazon Redshift and Kinesis FirehoseStreaming Data Analytics with Amazon Redshift and Kinesis Firehose
Streaming Data Analytics with Amazon Redshift and Kinesis FirehoseAmazon Web Services
 
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)Amazon Web Services
 

Tendances (20)

Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
 
Building a Modern Website for Scale (QCon NY 2013)
Building a Modern Website for Scale (QCon NY 2013)Building a Modern Website for Scale (QCon NY 2013)
Building a Modern Website for Scale (QCon NY 2013)
 
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
 
BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012
BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012
BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012
 
Next Generation Big Data Platform at Netflix 2014
Next Generation Big Data Platform at Netflix 2014Next Generation Big Data Platform at Netflix 2014
Next Generation Big Data Platform at Netflix 2014
 
Introduction to AWS Step Functions:
Introduction to AWS Step Functions: Introduction to AWS Step Functions:
Introduction to AWS Step Functions:
 
AWS May Webinar Series - Streaming Data Processing with Amazon Kinesis and AW...
AWS May Webinar Series - Streaming Data Processing with Amazon Kinesis and AW...AWS May Webinar Series - Streaming Data Processing with Amazon Kinesis and AW...
AWS May Webinar Series - Streaming Data Processing with Amazon Kinesis and AW...
 
Serverless Architectural Patterns & Best Practices
Serverless Architectural Patterns & Best PracticesServerless Architectural Patterns & Best Practices
Serverless Architectural Patterns & Best Practices
 
SF Big Analytics: Machine Learning with Presto by Christopher Berner
SF Big Analytics: Machine Learning with Presto by Christopher BernerSF Big Analytics: Machine Learning with Presto by Christopher Berner
SF Big Analytics: Machine Learning with Presto by Christopher Berner
 
New AWS Services for Bioinformatics
New AWS Services for BioinformaticsNew AWS Services for Bioinformatics
New AWS Services for Bioinformatics
 
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech TalksTackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks
 
Serverless - State Of the Union
Serverless - State Of the UnionServerless - State Of the Union
Serverless - State Of the Union
 
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014
 
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
 
Presto @ Netflix: Interactive Queries at Petabyte Scale
Presto @ Netflix: Interactive Queries at Petabyte ScalePresto @ Netflix: Interactive Queries at Petabyte Scale
Presto @ Netflix: Interactive Queries at Petabyte Scale
 
AWS Kinesis - Streams, Firehose, Analytics
AWS Kinesis - Streams, Firehose, AnalyticsAWS Kinesis - Streams, Firehose, Analytics
AWS Kinesis - Streams, Firehose, Analytics
 
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
 
Introduction to AWS Glue
Introduction to AWS GlueIntroduction to AWS Glue
Introduction to AWS Glue
 
Streaming Data Analytics with Amazon Redshift and Kinesis Firehose
Streaming Data Analytics with Amazon Redshift and Kinesis FirehoseStreaming Data Analytics with Amazon Redshift and Kinesis Firehose
Streaming Data Analytics with Amazon Redshift and Kinesis Firehose
 
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
 

Similaire à Cloud Native Data Pipelines (DataEngConf SF 2017)

Cloud Native Data Pipelines
Cloud Native Data PipelinesCloud Native Data Pipelines
Cloud Native Data PipelinesBill Liu
 
Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Sid Anand
 
Airflow @ Agari
Airflow @ Agari Airflow @ Agari
Airflow @ Agari Sid Anand
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCMark Smith
 
Aws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon ElishaAws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon ElishaHelen Rogers
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
 
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPSimpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPDaniel Zivkovic
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowSid Anand
 
Automate all your EMR related activities
Automate all your EMR related activitiesAutomate all your EMR related activities
Automate all your EMR related activitiesEitan Sela
 
Serverless Data Lake on AWS
Serverless Data Lake on AWSServerless Data Lake on AWS
Serverless Data Lake on AWSThanh Nguyen
 
AWS Startup Webinar | Developing on AWS
AWS Startup Webinar | Developing on AWSAWS Startup Webinar | Developing on AWS
AWS Startup Webinar | Developing on AWSAmazon Web Services
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRAmazon Web Services
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Precisely
 
Data analytics master class: predict hotel revenue
Data analytics master class: predict hotel revenueData analytics master class: predict hotel revenue
Data analytics master class: predict hotel revenueKris Peeters
 
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreAzure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreDataStax Academy
 
Serverless in production, an experience report (FullStack 2018)
Serverless in production, an experience report (FullStack 2018)Serverless in production, an experience report (FullStack 2018)
Serverless in production, an experience report (FullStack 2018)Yan Cui
 
Introduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftIntroduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftAmazon Web Services
 

Similaire à Cloud Native Data Pipelines (DataEngConf SF 2017) (20)

Cloud Native Data Pipelines
Cloud Native Data PipelinesCloud Native Data Pipelines
Cloud Native Data Pipelines
 
Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016
 
Airflow @ Agari
Airflow @ Agari Airflow @ Agari
Airflow @ Agari
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKC
 
Aws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon ElishaAws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon Elisha
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
Big Data on AWS
Big Data on AWSBig Data on AWS
Big Data on AWS
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPSimpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
 
Automate all your EMR related activities
Automate all your EMR related activitiesAutomate all your EMR related activities
Automate all your EMR related activities
 
Serverless Data Lake on AWS
Serverless Data Lake on AWSServerless Data Lake on AWS
Serverless Data Lake on AWS
 
AWS Startup Webinar | Developing on AWS
AWS Startup Webinar | Developing on AWSAWS Startup Webinar | Developing on AWS
AWS Startup Webinar | Developing on AWS
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
 
Data analytics master class: predict hotel revenue
Data analytics master class: predict hotel revenueData analytics master class: predict hotel revenue
Data analytics master class: predict hotel revenue
 
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreAzure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
 
Serverless in production, an experience report (FullStack 2018)
Serverless in production, an experience report (FullStack 2018)Serverless in production, an experience report (FullStack 2018)
Serverless in production, an experience report (FullStack 2018)
 
Introduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftIntroduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF Loft
 

Plus de Sid Anand

Building High Fidelity Data Streams (QCon London 2023)
Building High Fidelity Data Streams (QCon London 2023)Building High Fidelity Data Streams (QCon London 2023)
Building High Fidelity Data Streams (QCon London 2023)Sid Anand
 
Building & Operating High-Fidelity Data Streams - QCon Plus 2021
Building & Operating High-Fidelity Data Streams - QCon Plus 2021Building & Operating High-Fidelity Data Streams - QCon Plus 2021
Building & Operating High-Fidelity Data Streams - QCon Plus 2021Sid Anand
 
YOW! Data Keynote (2021)
YOW! Data Keynote (2021)YOW! Data Keynote (2021)
YOW! Data Keynote (2021)Sid Anand
 
Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)Sid Anand
 
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Resilient Predictive Data Pipelines (GOTO Chicago 2016)Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Resilient Predictive Data Pipelines (GOTO Chicago 2016)Sid Anand
 
Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)Sid Anand
 
Software Developer and Architecture @ LinkedIn (QCon SF 2014)
Software Developer and Architecture @ LinkedIn (QCon SF 2014)Software Developer and Architecture @ LinkedIn (QCon SF 2014)
Software Developer and Architecture @ LinkedIn (QCon SF 2014)Sid Anand
 
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)Sid Anand
 
Hands On with Maven
Hands On with MavenHands On with Maven
Hands On with MavenSid Anand
 
Learning git
Learning gitLearning git
Learning gitSid Anand
 
LinkedIn Data Infrastructure Slides (Version 2)
LinkedIn Data Infrastructure Slides (Version 2)LinkedIn Data Infrastructure Slides (Version 2)
LinkedIn Data Infrastructure Slides (Version 2)Sid Anand
 
LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)Sid Anand
 
Linked in nosql_atnetflix_2012_v1
Linked in nosql_atnetflix_2012_v1Linked in nosql_atnetflix_2012_v1
Linked in nosql_atnetflix_2012_v1Sid Anand
 
Keeping Movies Running Amid Thunderstorms!
Keeping Movies Running Amid Thunderstorms!Keeping Movies Running Amid Thunderstorms!
Keeping Movies Running Amid Thunderstorms!Sid Anand
 
OSCON Data 2011 -- NoSQL @ Netflix, Part 2
OSCON Data 2011 -- NoSQL @ Netflix, Part 2OSCON Data 2011 -- NoSQL @ Netflix, Part 2
OSCON Data 2011 -- NoSQL @ Netflix, Part 2Sid Anand
 
Intuit CTOF 2011 - Netflix for Mobile in the Cloud
Intuit CTOF 2011 - Netflix for Mobile in the CloudIntuit CTOF 2011 - Netflix for Mobile in the Cloud
Intuit CTOF 2011 - Netflix for Mobile in the CloudSid Anand
 
Svccg nosql 2011_v4
Svccg nosql 2011_v4Svccg nosql 2011_v4
Svccg nosql 2011_v4Sid Anand
 
Netflix's Transition to High-Availability Storage (QCon SF 2010)
Netflix's Transition to High-Availability Storage (QCon SF 2010)Netflix's Transition to High-Availability Storage (QCon SF 2010)
Netflix's Transition to High-Availability Storage (QCon SF 2010)Sid Anand
 

Plus de Sid Anand (18)

Building High Fidelity Data Streams (QCon London 2023)
Building High Fidelity Data Streams (QCon London 2023)Building High Fidelity Data Streams (QCon London 2023)
Building High Fidelity Data Streams (QCon London 2023)
 
Building & Operating High-Fidelity Data Streams - QCon Plus 2021
Building & Operating High-Fidelity Data Streams - QCon Plus 2021Building & Operating High-Fidelity Data Streams - QCon Plus 2021
Building & Operating High-Fidelity Data Streams - QCon Plus 2021
 
YOW! Data Keynote (2021)
YOW! Data Keynote (2021)YOW! Data Keynote (2021)
YOW! Data Keynote (2021)
 
Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)
 
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Resilient Predictive Data Pipelines (GOTO Chicago 2016)Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
 
Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)
 
Software Developer and Architecture @ LinkedIn (QCon SF 2014)
Software Developer and Architecture @ LinkedIn (QCon SF 2014)Software Developer and Architecture @ LinkedIn (QCon SF 2014)
Software Developer and Architecture @ LinkedIn (QCon SF 2014)
 
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
 
Hands On with Maven
Hands On with MavenHands On with Maven
Hands On with Maven
 
Learning git
Learning gitLearning git
Learning git
 
LinkedIn Data Infrastructure Slides (Version 2)
LinkedIn Data Infrastructure Slides (Version 2)LinkedIn Data Infrastructure Slides (Version 2)
LinkedIn Data Infrastructure Slides (Version 2)
 
LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)
 
Linked in nosql_atnetflix_2012_v1
Linked in nosql_atnetflix_2012_v1Linked in nosql_atnetflix_2012_v1
Linked in nosql_atnetflix_2012_v1
 
Keeping Movies Running Amid Thunderstorms!
Keeping Movies Running Amid Thunderstorms!Keeping Movies Running Amid Thunderstorms!
Keeping Movies Running Amid Thunderstorms!
 
OSCON Data 2011 -- NoSQL @ Netflix, Part 2
OSCON Data 2011 -- NoSQL @ Netflix, Part 2OSCON Data 2011 -- NoSQL @ Netflix, Part 2
OSCON Data 2011 -- NoSQL @ Netflix, Part 2
 
Intuit CTOF 2011 - Netflix for Mobile in the Cloud
Intuit CTOF 2011 - Netflix for Mobile in the CloudIntuit CTOF 2011 - Netflix for Mobile in the Cloud
Intuit CTOF 2011 - Netflix for Mobile in the Cloud
 
Svccg nosql 2011_v4
Svccg nosql 2011_v4Svccg nosql 2011_v4
Svccg nosql 2011_v4
 
Netflix's Transition to High-Availability Storage (QCon SF 2010)
Netflix's Transition to High-Availability Storage (QCon SF 2010)Netflix's Transition to High-Availability Storage (QCon SF 2010)
Netflix's Transition to High-Availability Storage (QCon SF 2010)
 

Dernier

%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in sowetomasabamasaba
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxAnnaArtyushina1
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfonteinmasabamasaba
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...masabamasaba
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Bert Jan Schrijver
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastPapp Krisztián
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationJuha-Pekka Tolvanen
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburgmasabamasaba
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park masabamasaba
 

Dernier (20)

%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaS
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 

Cloud Native Data Pipelines (DataEngConf SF 2017)

  • 1. Cloud Native Data Pipelines 1 Sid Anand (@r39132) DataEngConf SF 2017
  • 2. About Me 2 Work [ed | s] @ Committer & PPMC on Father of 2 Co-Chair for Apache Airflow
  • 4. Agari : What We Do 4
  • 9. 9 Enterprise Customers email metadata apply trust models email md + trust score Agari’s Previous EP Version Agari : What We Do Batch
  • 10. 10 email metadata apply trust models email md + trust score Agari’s Current EP VersionEnterprise Customers Agari : What We Do Near-real time Quarantine, Label, PassThrough
  • 11. Data Pipelines BI vs Predictive 11
  • 15. BI Predictive Common Focus of this talk Data Pipelines 15 Web Servers OLTP DB Data Warehouse Repor6ng Tools Query Browsers ETL (batch) MySQL, Oracle, Cassandra Terradata, RedShi; BigQuery OLTP DB or cache ETL (batch or streaming) MySQL, Oracle, Cassandra, Redis Spark, Flink, Beam, Storm Web Servers Ranking (Search, News Feed), Recommender Products, Fraud DetecGon / PrevenGon Data Source
  • 17. Cloud Native Data Pipelines 17 Big Data Companies like LinkedIn, Facebook, Twitter, & Google have large teams to manage their data pipelines Most start-ups run in the public cloud. Can they leverage aspects of the public cloud to build comparable pipelines?
  • 18. Cloud Native Data Pipelines 18 Cloud Native Techniques Open Source Technogies Data Pipelines seen in Big Data companies ~
  • 19. Design Goals Desirable Qualities of a Resilient Data Pipeline 19
  • 20. 20 Desirable Qualities of a Resilient Data Pipeline OperabilityCorrectness Timeliness Cost
  • 21. 21 Desirable Qualities of a Resilient Data Pipeline OperabilityCorrectness Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions • All output within time-bound SLAs • Minimize Operational Fatigue / Automate Everything • Fine-grained Monitoring & Alerting of Correctness & Timeliness SLAs • Quick Recoverability • Pay-as-you-go
  • 22. Predictive Analytics @ Agari Use Cases 22
  • 23. Use Cases 23 Apply trust models (message scoring) batch + near real time Build trust models batch (Enterprise Protect)
  • 24. Use Cases 24 Apply trust models (message scoring) batch + near real time Build trust models batch (Enterprise Protect) Focus of this talk
  • 25. Use-Case : Message Scoring (batch) Batch Pipeline Architecture 25
  • 26. Use-Case : Message Scoring 26 enterprise A enterprise B enterprise C S3 S3 uploads an Avro file every 15 minutes
  • 27. Use-Case : Message Scoring 27 enterprise A enterprise B enterprise C S3 Airflow kicks of a Spark message scoring job every hour (EMR)
  • 28. Use-Case : Message Scoring 28 enterprise A enterprise B enterprise C S3 Spark job writes scored messages and stats to another S3 bucket S3
  • 29. Use-Case : Message Scoring 29 enterprise A enterprise B enterprise C S3 This triggers SNS/SQS messages events S3 SNS SQS
  • 30. Use-Case : Message Scoring 30 enterprise A enterprise B enterprise C S3 An Autoscale Group (ASG) of Importers spins up when it detects SQS messages S3 SNS SQS Importers ASG
  • 31. 31 enterprise A enterprise B enterprise C S3 The importers rapidly ingest scored messages and aggregate statistics into the DB S3 SNS SQS Importers ASG DB Use-Case : Message Scoring
  • 32. 32 enterprise A enterprise B enterprise C S3 Users receive alerts of untrusted emails & can review them in the web app S3 SNS SQS Importers ASG DB Use-Case : Message Scoring
  • 33. 33 enterprise A enterprise B enterprise C S3 S3 SNS SQS Importers ASG DB Airflow manages the entire process Use-Case : Message Scoring
  • 34. 34 Architectural Components Component Role Uses Salient Features Operability Model Data Lake • All data stored in S3 • All processing uses S3 Scalable, Available, Performant Serverless Messaging • Reliable, Transactional, Pub/Sub Scalable, Available, Performant Serverless ASG General Processing • Used for importing, data cleansing, business logic Scalable, Available, Performant Managed Data Science Processing • Aggregation • Model Building • Scoring Nice programming model at the cost of debugging complexity We Operate Workflow Engine • Coordinates all Spark Jobs & complex flows Lightweight, DAGs as Code, Steep learning curve We Operate DB Persistence for WebApp • Holds subset of data needed for Web App Rails + Postgres ‘nuff said We Operate S3 SNS SQS
  • 35. Tackling Cost & Timeliness Leveraging the AWS Cloud 35
  • 36. Tackling Cost 36 Between Daily Runs During Daily Runs When running daily, for 23 hours of a day, we didn’t pay for instances in the ASG or EMR
  • 37. Tackling Cost 37 Between Hourly Runs During Hourly Runs When running daily, for 23 hours of a day, we didn’t pay for instances in the ASG or EMR This does not help when runs are hourly since AWS charges at an hourly rate for EC2 instances!
  • 39. ASG - Overview 39 What is it? A means to automatically scale out/in clusters to handle variable load/traffic A means to keep a cluster/service of a fixed size always up
  • 40. ASG - Data Pipeline 40 importer importer importer importer Importer ASG scaleout/in SQS DB
  • 41. 41 Sent CPU ACKd/Recvd CPU-based auto-scaling is good at scaling in/out to keep the average CPU constant ASG : CPU-based
  • 42. ASG : CPU-based 42 Sent CPU Recv Premature Scale-in Premature Scale-in: • The CPU drops to noise-levels before all messages are consumed • This causes scale in to occur while the last few messages are still being committed
  • 43. 43 Scale-out: When Visible Messages > 0 (a.k.a. when queue depth > 0) Scale-in: When Invisible Messages = 0 (a.k.a. when the last in-flight message is ACK’d) This causes the ASG to grow This causes the ASG to shrink ASG : Queue-based
  • 45. ASG - Build & Deploy 45 Component Role Details Spins up Cloud Resources • Spins up SQS, Kinesis, EC2, ASG, ELB, etc.. and associate them using Terraform • A better version of Chef & Puppet • Sets up an EC2 instance • Agentless, idempotent, & declarative tool to set up EC2 instances, by installing & configuring packages, and more • Spins up an EC2 instance for the purposes of building an AMI! • Can be used with Ansible & Terraform to bake AMIs & Launch Auto-Scaling Groups
  • 46. ASG - Build & Deploy 46 EC2 Step 1 : Packer spins up a temporary EC2 node - a blank canvas!
  • 47. EC2 ASG - Build & Deploy 47 EC2 Step 1 : Packer spins up a temporary EC2 node - a blank canvas! Step 2 : Packer runs an Ansible role against the EC2 node to set it up.
  • 48. EC2 ASG - Build & Deploy 48 EC2 Step 2 : Packer runs an Ansible role against the EC2 node to set it up. Step 3 : Snapshots the machine & register the AMI.EC2 Step 1 : Packer spins up a temporary EC2 node - a blank canvas!
  • 49. EC2 ASG - Build & Deploy 49 EC2 Step 2 : Packer runs an Ansible role against the EC2 node to set it up. Step 3 : Snapshots the machine & register the AMI.EC2 Step 4 : Terminates the EC2 instance! Step 1 : Packer spins up a temporary EC2 node - a blank canvas!
  • 50. EC2 ASG - Build & Deploy 50 EC2 Step 2 : Packer runs an Ansible role against the EC2 node to set it up. Step 3 : Snapshots the machine & register the AMI.EC2 Step 4 : Terminates the EC2 instance! Step 5 : Using the AMI, Terraform spins up an auto-scaled compute cluster (ASG) Step 1 : Packer spins up a temporary EC2 node - a blank canvas! ASG
  • 51. 51 Desirable Qualities of a Resilient Data Pipeline OperabilityCorrectness Timeliness Cost • ASG • EMR Spark Daily • ASG • EMR Spark Hourly ASG • No Cost Savings
  • 53. 53 A simple way to author, configure, manage workflows Provides visual insight into the state & performance of workflow runs Integrates with our alerting and monitoring tools Tackling Operability : Requirements
  • 55. 55 Airflow: Author DAGs in Python! No need to bundle many config files! Apache Airflow - Authoring DAGs
  • 56. 56 Airflow: Visualizing a DAG Apache Airflow - Authoring DAGs
  • 57. 57 Airflow: It’s easy to manage multiple DAGs Apache Airflow - Managing DAGs
  • 58. Apache Airflow - Perf. Insights 58 Airflow: Gantt chart view reveals the slowest tasks for a run!
  • 59. 59 Apache Airflow - Perf. Insights Airflow: Task Duration chart view show task completion time trends!
  • 60. 60 Airflow: …And easy to integrate with Ops tools! Apache Airflow - Alerting
  • 61. 61 Apache Airflow - Correctness
  • 62. 62 Desirable Qualities of a Resilient Data Pipeline OperabilityCorrectness Timeliness Cost
  • 63. Use-Case : Message Scoring (near-real time) NRT Pipeline Architecture 63
  • 64. Use-Case : Message Scoring 64 enterprise A enterprise B enterprise C Kinesis batch put every second K
  • 65. Use-Case : Message Scoring 65 enterprise A enterprise B enterprise C K As ASG of scorers is scaled up to one process per core per kinesis shard Scorers ASG
  • 66. Use-Case : Message Scoring 66 enterprise A enterprise B enterprise C K Scorers ASG Kinesis Scorers apply the trust model and send scored messages downstream
  • 67. Use-Case : Message Scoring 67 enterprise A enterprise B enterprise C K Scorers ASG Kinesis Importers ASG As ASG of importers is scaled up to rapidly import messages DB
  • 68. Use-Case : Message Scoring 68 enterprise A enterprise B enterprise C K Scorers ASG Kinesis Importers ASG Imported messages are also consumed by the alerter DB K Alerters ASG
  • 69. Use-Case : Message Scoring 69 enterprise A enterprise B enterprise C K Scorers ASG Kinesis Importers ASG Imported messages are also consumed by the alerter DB K Alerters ASG Quarantine Email
  • 70. 70 Stream Processing Architecture Component Role Details Pros Operability Model Data Lake • All data stored in S3 via Kinesis Firehose Scalable, Available, Performant, Serverless Serverless Kinesis Messaging • Streaming transport modeled on Kafka Scalable, Available, Serverless Serverless General Processing • ASG Replacement except for Rails Apps Scalable, Available, Serverless Serverless ASG General Processing • Used for importing, data cleansing, business logic Scalable, Available, Managed Managed Data Science Processing • Model Building We Operate Workflow Engine • Nightly model builds + some classic Ops cron workloads Lightweight, DAGs as Code We Operate DB Persistence for WebApp • Holds smaller subset of data needed for Web App Rails + Postgres ‘nuff said We Operate Persistence for WebApp • Aggregation + Search moved from DB to ES • Model Building queries moved to Elasticache Redis Faster. more accurate for aggregates, frees up headroom for DB (polyglot persistence) Managed S3
  • 73. 73 What is Avro? Avro is a self-describing serialization format that supports primitive data types : int, long, boolean, float, string, bytes, etc… complex data types : records, arrays, unions, maps, enums, etc… many language bindings : Java, Scala, Python, Ruby, etc…
  • 74. 74 What is Avro? Avro is a self-describing serialization format that supports primitive data types : int, long, boolean, float, string, bytes, etc… complex data types : records, arrays, unions, maps, enums, etc… many language bindings : Java, Scala, Python, Ruby, etc… The most common format for storing structured Big Data at rest in HDFS, S3, Google Cloud Storage, etc… Supports Schema Evolution!
  • 75. Apache Avro Why is it useful? 75
  • 76. 76 Why is Avro Useful? Agari is an IoT company! Agari Sensors, deployed at customer sites, stream data to Agari’s Cloud SAAS Data is sent via Kinesis! enterprise A enterprise B enterprise C Kinesis Agari SAAS in AWS
  • 77. 77 Why is Avro Useful? enterprise A : enterprise B : enterprise C : Kinesis v1 v2 v3 Agari is an IoT company! Agari Sensors, deployed at customer sites, stream data to Agari’s Cloud SAAS Data is sent via Kinesis! At any point in time, customers run different versions of the Agari Sensor Agari SAAS in AWS
  • 78. 78 Why is Avro Useful? enterprise A : enterprise B : enterprise C : Kinesis v1 v2 v3 Agari is an IoT company! Agari Sensors, deployed at customer sites, stream data to Agari’s Cloud SAAS Data is sent via Kinesis! At any point in time, customers run different versions of the Agari Sensor These Sensors might send different format versions of the data! Agari SAAS in AWS
  • 79. 79 Why is Avro Useful? enterprise A : enterprise B : enterprise C : Kinesis v1 v2 v3 Agari SAAS in AWS v4 Agari is an IoT company! Agari Sensors, deployed at customer sites, stream data to Agari’s Cloud SAAS Data is sent via Kinesis! At any point in time, customers run different versions of the Agari Sensor These Sensors might send different format versions of the data!
  • 80. 80 Why is Avro Useful? enterprise A : enterprise B : enterprise C : v1 v2 v3 Avro allows Agari to seamlessly handle different IoT data format versions Agari SAAS in AWS Kinesis v4 datum_reader = DatumReader( writers_schema = writers_schema, readers_schema = readers_schema) Requirements: • Schemas are backward-compatible
  • 81. 81 Why is Avro Useful? Agari SAAS in AWS S1 S2 S3 s3 Spark Avro Everywhere! Avro is so useful, we don’t just to communicate between our Sensors & our SAAS infrastructure We also use it as the common data-interchange format between all services (streaming & batch) within our AWS deployment
  • 82. 82 Why is Avro Useful? Agari SAAS in AWS S1 S2 S3 s3 Spark Avro Everywhere! Good Language Bindings : Data Pipelines services are written in Java, Ruby, & Python
  • 84. 84 Avro Schema Example {"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }
  • 85. 85 {"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } complex type (record) Avro Schema Example
  • 86. 86 {"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } complex type (record) Schema name : User Avro Schema Example
  • 87. 87 {"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } complex type (record) Schema name : User 3 fields in the record: 1 required, 2 optional Avro Schema Example
  • 88. 88 {"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } Data x 1,000,000,000 Avro Schema Data File Example Schema Data 0.0001 % 99.999 % Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data
  • 89. 89 {"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } Binary Data block Avro Schema Streaming Example Schema Data 99 % 1 % Data
  • 90. 90 {"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } Binary Data block Avro Schema Streaming Example Schema Data 99 % 1 % Data OVERHEAD!!
  • 92. 92 Schema Registry (Lambda) Avro Schema Registry {"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } register_schema Message Producer (P)
  • 93. 93 Schema Registry (Lambda) register_schema returns a UUID Message Producer (P) Avro Schema Registry
  • 94. 94 Schema Registry (Lambda) Message Producer sends UUID + Message Producer (P) Data Message Consumer (C) Avro Schema Registry
  • 96. 96 Schema Registry (Lambda) Message Producer (P) Data Message Consumer (C) getSchemaById (UUID) {"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } Avro Schema Registry
  • 97. 97 Schema Registry (Lambda) Message Producer (P) Message Consumer (C) getSchemaById (UUID) {"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } Message Consumers • download & cache the schema • then decode the data Avro Schema Registry
  • 98. 98 enterprise A enterprise B enterprise C K Scorers ASG Kinesis Importers ASG Imported messages are also consumed by the alerter DB K Alerters ASG SR SR SR Avro Schema Registry
  • 99. 99 enterprise A enterprise B enterprise C K Scorers ASG Kinesis Importers ASG Imported messages are also consumed by the alerter DB K Alerters ASG SR SR SR Avro Schema Registry
  • 100. Acknowledgments 100 • Vidur Apparao • Stephen Cattaneo • Jon Chase • Andrew Flury • William Forrester • Chris Haag • Chris Buchanan • Neil Chapin • Wil Collins • Don Spencer • Scot Kennedy • Natia Chachkhiani • Patrick Cockwell • Kevin Mandich • Gabriel Ortiz • Jacob Rideout • Josh Yang • Julian Mehnle • Gabriel Poon • Spencer Sun • Nathan Bryant None of this work would be possible without the essential contributions of the team below