SlideShare une entreprise Scribd logo
1  sur  73
Télécharger pour lire hors ligne
ADVENTURES IN SCALING
FROM ZERO TO 5 BILLION
DATA POINTS PER DAY
Dave Torok
Distinguished Architect
Comcast Corporation
2 April, 2019
Flink Forward – San Francisco 2019
2
COMCAST CUSTOMER RELATIONSHIPS
30.3 MILLION OVERALL CUSTOMER
RELATIONSHIPS AT 2018 YEAR END
25.1 MILLION RESIDENTIAL HIGH-SPEED
INTERNET CUSTOMERS AT 2018 YEAR
END
1.2 MILLION RESIDENTIAL HIGH-SPEED
INTERNET CUSTOMER NET ADDITIONS IN
2018
3
DELIVER THE ULTIMATE CUSTOMER EXPERIENCE
IS THE CUSTOMER HAVING A GOOD EXPERIENCE
FOR HIGH SPEED DATA (HSD) SERVICE?
IF THE CUSTOMER ENGAGES US DIGITALLY, CAN
WE OFFER A SELF-SERVICE SOLUTION?
IF THERE IS AN ISSUE CAN WE OFFER OUR AGENTS
AND TECHNICIANS A DIAGNOSIS TO HELP SOLVE
THE PROBLEM QUICKER?
REDUCE THE TIME TO RESOLVE ISSUES
REDUCE COST TO THE BUSINESS AND THE
CUSTOMER
4
HIGH LEVEL CONCEPT
Customer
Experience
Diagnosis
Platform
Service
Delivery
Systems
Predictions /
Diagnosis /
Treatments
CUSTOMER
ENGAGEMENT
PLATFORM
CUSTOMER
EXPERIENCE
INDICATORS
Comcast collects, stores, and uses all data in accordance with our privacy disclosures to users and applicable laws.
5
CUSTOMER EXPERIENCE INDICATORS
Comcast collects, stores, and uses all data in accordance with our privacy disclosures to users and applicable laws.
6
NINE ADVENTURES IN SCALING
THE
TRIGGER AND
DIAGNOSIS
PROBLEM
THE
FEATURE STORE
PROBLEM
THE
CHECKPOINT
PROBLEM
THE
REST
PROBLEM
THE
VOLUME
PROBLEM
THE
TRIGGER AND
DIAGNOSIS
PROBLEM
THE
INEFFICIENT
OBJECT HANDLING
PROBLEM
THE
CUSTOMER STATE
PROBLEM
THE
REALLY HIGH VOLUME
AND FEATURE STORE
PROBLEM #2
7
ML FRAMEWORK ARCHITECTURE – 2018
” E M B E D D I N G F L I N K T H R O U G H O U T AN O P E R AT I O N AL I Z E D S T R E AM I N G M L L I F E C Y C L E ” – S F F L I N K F O RWAR D 2 0 1 8
REQUESTING
APPLICATION
REQUESTING
APPLICATION
Feature Creation
Pipelines
ML FRAMEWORK
Online Feature
Store
Model/ Feature
Metadata
Feature Check
Async IO
Feature
Assembly
Model
Execution
Prediction /
Outcome Store
Prediction Store
Sink
Prediction
Store
REST Service
Feature Store
REST Service
Model
Execution
Request
REST
Service
Model
Prediction
REST Service
Model Execute
Async IO
FEATURE
SOURCES /
DATA AND
SERVICES
THE
TRIGGER AND
DIAGNOSIS
PROBLEM
9
MAY 2018 – INITIAL VOLUMES
INDICATOR #1
9 MILLION
INDICATOR #2
166 MILLION
INDICATOR #3
1.2 MILLION
INDICATOR #4
2500 (SMALL TRIAL)
TOTAL
175
MILLION / DAY
Comcast collects, stores, and uses all data in accordance with our privacy disclosures to users and applicable laws.
1 0
175 MILLION PREDICTIONS?
175 MILLION
DATA
POINTS
??????
PREDICTION /
DIAGNOSIS
OUTPUT
STORE
1 1
INTRODUCE A ‘TRIGGER’ CONDITION
GREEN
GREEN
GREEN
RED
GREEN
GREEN
1 2
INTRODUCE A ‘TRIGGER’ CONDITION
GREEN
GREEN
GREEN
RED
GREEN
GREEN
GREEN
RED
GREEN
RED
1 3
INTRODUCE A ‘TRIGGER’ CONDITION
GREEN
GREEN
GREEN
RED
GREEN
GREEN
GREEN
RED
GREEN
REDDETECT STATE CHANGE
1 4
INTRODUCE A ‘TRIGGER’ CONDITION
GREEN
GREEN
GREEN
RED
GREEN
GREEN
GREEN
GREEN
GREEN
RED
GREEN
RED
GREEN
RED
GREEN
REDDETECT STATE CHANGE
SEND FOR
DIAGNOSIS
1 5
State
Flow
And
Trigger
INTRODUCE FLINK LOCAL STATE
STATE MANAGER AND TRIGGER
175 MILLION
DATA
POINTS
Customer
Experience
Data Point
Producers
Customer
Experience
Diagnosis
Platform
“A FEW
MILLION”
TRIGGERS
FLINK
LOCAL
STATE PREDICTION /
DIAGNOSIS
OUTPUT
STORE
THE
REST
PROBLEM
1 7
ML FRAMEWORK ARCHITECTURE – 2018
” E M B E D D I N G F L I N K T H R O U G H O U T AN O P E R AT I O N AL I Z E D S T R E AM I N G M L L I F E C Y C L E ” – S F F L I N K F O RWAR D 2 0 1 8
REQUESTING
APPLICATION
REQUESTING
APPLICATION
Feature Creation
Pipelines
ML FRAMEWORK
Online Feature
Store
Model/ Feature
Metadata
Feature Check
Async IO
Feature
Assembly
Model
Execution
Prediction /
Outcome Store
Prediction Store
Sink
Prediction
Store
REST Service
Feature Store
REST Service
Model
Execution
Request
REST
Service
Model
Prediction
REST Service
Model Execute
Async IO
FEATURE
SOURCES /
DATA AND
SERVICES
1 8 ML FRAMEWORK
State Flow
And Model
Execution
Trigger
INITIAL PLATFORM ARCHITECTURE
Online Feature
Store
Model/ Feature
Metadata
Feature Getter
Async IO
Feature
Assembly
Feature
Sink
Model
Execution
Prediction /
Outcome Store
Prediction Store
Sink
Prediction
Store
REST Service
Feature Store
REST Service
Customer
Experience
Data Point
Producers
Model
Execution
Request
REST
Service
Model
Prediction
REST Service
FLINK
LOCAL
STATE
Model Execute
Async IO
REQUESTING
APPLICATION
USE CASE
1 9
ORIGINAL ML FRAMEWORK ARCHITECTURE
EVERY COMMUNICATION
WAS A REST SERVICE
ML FRAMEWORK
ISOLATED FROM DATA
POINT USE CASE
BY DESIGN
FLINK ASYNC + REST IS
DIFFICULT TO SCALE
ELASTICALLY
HTTP
REQUEST
RESPONSE
2 0
MAX THROUGHPUT PER SECOND = (SLOTS) * (THREADS) * (1000 / TASK DURATION MSEC)
(12 SLOTS) * (20 THREADS) * (1000 / 300) = 800 / SECOND
TUNING ASYNC I/O – LATENCY AND VOLUME
APACHE HTTP CLIENT / EXECUTOR THREADS
setMaxTotal
setDefaultMaxPerRoute (THREADS)
FLINK ASYNC MAX OUTSTANDING
RichAsyncFunction “Capacity”
AsyncDataStream.unorderedWait() with max capacity
parameter
RATE THROTTLING
if (numberOfRequestsPerPeriod > 0) {
int numberOfProcesses = getRuntimeContext()
.getNumberOfParallelSubtasks();
numberOfRequestsPerPeriodPerProcess =
Math.max(1, numberOfRequestsPerPeriod /
numberOfProcesses);
}
2 1
State Flow
And Model
Execution
Trigger
REPLACE REST CALLS WITH KAFKA (I)
Online Feature
Store
Model/ Feature
Metadata
Feature Getter
Async IO
Feature
Assembly
Prediction /
Outcome Store
Prediction
Store
REST Service
Feature Store
REST Service
Customer
Experience
Data Point
Producers
FLINK
LOCAL
STATE
REQUESTING
APPLICATION
Feature Sink
(JDBCAppendTableSink)
Model
Execution
Prediction Store Sink
(JDBCAppendTableSink)
2 2
REPLACE REST CALL WITH KAFKA (II)
Online Feature
Store
Model/ Feature
Metadata
Feature
Assembly
Feature Sink Flow
(JDBCAppendTableSink)
Model
Execution
Prediction /
Outcome Store
Prediction Store Sink
(JDBCAppendTableSink)
Customer
Experience
Data Point
Producers
FLINK
LOCAL
STATE
State Flow
And Trigger
FLINK
LOCAL
STATE
FLINK
LOCAL
STATE
Feature Store
Manager
FLINK
LOCAL
STATE
THE
INEFFICIENT
OBJECT HANDLING
PROBLEM
2 4
HAS THIS EVER HAPPENED TO YOU?
New “JOLT” JSON Transformer
(but don’t cache it)
 New ObjectMapper()
(but don’t cache it)
 Parse JSON string into Map
 Load and parse “Transform” JSON
from the resource path
(but don’t cache it)
 Serialize transformed Map back
into string result
String transformedJson =
new JoltJsonTransformer()
.transformJson(myJsonValue,
pTool.getRequired(SPEC_PATH));
JSONObject inputPayload = new
JSONObject(transformedJson);
try {
aType = inputPayload.getJSONObject("info")
.getJSONObject(“aType").
getString("string");
} catch (Exception e) {
// ignore, invalid object
aType = STATUS_INVALID;
}
2 5
HAS THIS EVER HAPPENED TO YOU?
New “JOLT” JSON Transformer
(but don’t cache it)
 New ObjectMapper()
(but don’t cache it)
 Parse JSON string into Map
 Load and parse “Transform” JSON
from the resource path
(but don’t cache it)
 Serialize transformed Map back
into string result
Then let’s make a NEW JSONObject
from that string
String transformedJson =
new JoltJsonTransformer()
.transformJson(myJsonValue,
pTool.getRequired(SPEC_PATH));
JSONObject inputPayload = new
JSONObject(transformedJson);
try {
aType = inputPayload.getJSONObject("info")
.getJSONObject(“aType").
getString("string");
} catch (Exception e) {
// ignore, invalid object
aType = STATUS_INVALID;
}
2 6
HAS THIS EVER HAPPENED TO YOU?
New “JOLT” JSON Transformer
(but don’t cache it)
 New ObjectMapper()
(but don’t cache it)
 Parse JSON string into Map
 Load and parse “Transform” JSON
from the resource path
(but don’t cache it)
 Serialize transformed Map back
into string result
Then let’s make a NEW JSONObject
from that string
Go deep into the path to get value
String transformedJson =
new JoltJsonTransformer()
.transformJson(myJsonValue,
pTool.getRequired(SPEC_PATH));
JSONObject inputPayload = new
JSONObject(transformedJson);
try {
aType = inputPayload.getJSONObject("info")
.getJSONObject(“aType").
getString("string");
} catch (Exception e) {
// ignore, invalid object
aType = STATUS_INVALID;
}
2 7
HAS THIS EVER HAPPENED TO YOU?
New “JOLT” JSON Transformer
(but don’t cache it)
 New ObjectMapper()
(but don’t cache it)
 Parse JSON string into Map
 Load and parse “Transform” JSON
from the resource path
(but don’t cache it)
 Serialize transformed Map back
into string result
Then let’s make a NEW JSONObject
from that string
Go deep into the path to get value
Cast exception? STATUS_INVALID
String transformedJson =
new JoltJsonTransformer()
.transformJson(myJsonValue,
pTool.getRequired(SPEC_PATH));
JSONObject inputPayload = new
JSONObject(transformedJson);
try {
aType = inputPayload.getJSONObject("info")
.getJSONObject(“aType").
getString("string");
} catch (Exception e) {
// ignore, invalid object
aType = STATUS_INVALID;
}
2 8
SLIGHTLY MORE EFFICIENT SOLUTION
Cache our new ObjectMapper()
Parse JSON string into Map ONCE
Lightweight Map Traversal utility
Java Stream syntax
Optional.ofNullable
Optional.orElse
if (mapper == null)
mapper = new ObjectMapper();
Map<?, ?> resultMap =
JsonUtils.readValue(mapper, (String)
result, Map.class);
MapTreeWalker mapWalker = new
MapTreeWalker(resultMap);
final String aType =
mapWalker.step("info")
.step(“aSchemaInfo")
.step(“aType")
.step("string")
.<String>get().
orElse(STATUS_INVALID);
2 9
FLINK OBJECT EFFICIENCIES
REDUCE OBJECT
CREATION AND
GARBAGE
COLLECTION
BE CAREFUL OF
SIDE EFFECTS
WITHIN
OPERATORS
StreamExecutionEnvironment env;
env.getConfig().enableObjectReuse();
SEMANTIC ANNOTATIONS
@NonForwardedFields
@ForwardedFields
THE
FEATURE STORE
PROBLEM
3 1
WHAT IS A FEATURE STORE?
“FEATURE” IS A DATA POINT INPUT TO ML MODEL
WE NEED TO:
• STORE ALL THE INPUT DATA POINTS
• SNAPSHOT AT ANY MOMENT IN TIME
• ASSOCIATE WITH A DIAGNOSIS (MODEL OUTPUT)
• HAVE ACCESS TO THE DATA POINTS (API FOR OTHER APPS)
• STORE ALL DATA POINTS FOR ML MODEL TRAINING
Feature Creation
Pipeline
Historical
Feature Store
Online
Feature Store
Prediction
Phase
Model
Training
Phase
AppendOverwrite
3 2
FIRST TRY: AWS AURORA RDBMS
“HEY IT SHOULD GET US UP TO 10,000 / SECOND”
…THE UPSERT PROBLEM
DIDN’T WORK MUCH PAST 3,000 / SECOND
…SOON OUR INPUT RATE WAS 20,000 / SECOND
3 3
MITIGATION: STORE ONLY AT DIAGNOSIS TIME
STOPPED STORING ALL RAW DATA POINTS
ONLY STORE DATA POINTS ALONGSIDE
DIAGNOSIS
CON: ONLY A SMALL % OF DATA POINTS
CON: DATA NOT CURRENT, ONLY AS OF
RECENT DIAGNOSIS
CON: LIMITED USEFULNESS
Feature
Sink
Model
Execution
Feature Store
Prediction /
Outcome Store
Prediction Store
Sink
3 4
OPTIMIZING THE WRITE PATH
WHICH FLINK FLOW WRITES TO FEATURE STORE?
FEATURE STORE NOT NEEDED FOR TRIGGER
FEATURE STORE NOT NEEDED FOR MODEL EXECUTION
SEPARATE ‘TRANSFORM AND SINK’ FLOW
Online Feature
Store
Feature
Manager
Feature Store
REST Service
Online Feature
Store
Feature Sink
Flow
Feature
Manager
FLINK
LOCAL
STATE
Upstream
Flow
Model
Execution
BEFORE
AFTER
3 5
ALTERNATIVE DATA STORES
CASSANDRA
COST AND COMPLEXITY?
TIME SERIES DB
DRUID
INFLUX
REDIS
STORE ONLY LATEST DATA POINTS?
FLINK QUERYABLE STATE
PRODUCTION HARDENING?
READ VOLUME?
3 6
REDIS
WE ONLY NEED < 7 DAYS OF DATA, ONLY LATEST
VALUE, SO USE REDIS AS THE KV STORE
ACCESSIBLE BY OTHER CONSUMERS OR VIA A
SERVICE
COULDN'T PUT MORE THAN 8,000 / SECOND FOR A
'REASONABLE' CLUSTER SIZE
FLINK SINK CONNECTOR IS NOT CLUSTER-ENABLED
ONE OBJECT PUT PER CALL
JEDIS VS LETTUCE FRAMEWORK AND PIPELINING
SOME OPTIMIZATION IF ABLE TO BATCH REDIS
SHARDS BY PRE-COMPUTING
3 7
FLINK QUERYABLE STATE
REUSE OUR FEATURE REQUEST/RESPONSE
FRAMEWORK
SCALABLE, FAST
OUR READ LOAD IS LOW (10 / SECOND)
EXPENSIVE (RELATIVELY)
NOT A “PRODUCTION” CAPABILITY
NOT A “DATABASE” TECHNOLOGY
3 8
QUERYABLE STATE - ONLINE FEATURE STORE
Feature
Manager
FLINK
LOCAL
STATE
Upstream
Flow
Model
Execution
Online Feature
Store
FLINK
LOCAL
STATE
Feature
REST Service
(Queryable
State Client)
REQUESTING
APPLICATION
Upstream
Flow
Customer
Experience
Data Point
Producers
ONLINE FEATURE STORE PATH
MODEL EXECUTION PATH
THE
VOLUME
PROBLEM
4 0
RAMPED UP PRODUCTION – JULY 2018
4 1
ADDED NEW DATA POINT TYPES
600+
Million
4 2
INCREASED DATA POINT FREQUENCY
ONCE EVERY 24 HOURS
4 3
INCREASED DATA POINT FREQUENCY
ONCE EVERY 24 HOURS
ONCE EVERY 4 HOURS
4 4
INCREASED DATA POINT FREQUENCY
ONCE EVERY 24 HOURS
ONCE EVERY 4 HOURS
ONCE EVERY 5 MINUTES ???
4 5
FLINK CLUSTER COULDN’T KEEP UP WITH KAFKA VOLUME
SYMPTOM: KAFKA LATENCY INCREASING
TIME
CONSUMERLAG
4 6
KAFKA OPTIMIZATION
INCREASE PARTITIONS
1 KAFKA PARTITION
READ BY
1 FLINK TASK THREAD
MATCH PARALLELISM TO
PARTITIONS (OR EVEN
MULTIPLE)
100 KAFKA PARTITIONS
50 SLOT PARALLELISM
50 KAFKA PARTITIONS
50 SLOT PARALLELISM
ADJUST KAFKA CONFIGURATION
# KAFKA CONSUMER SETTINGS
MAX.POLL.RECORDS = 500 10000
FETCH.MIN.BYTES = 1 102400
FETCH.MAX.WAIT.MS = 500 1000
# KAFKA PRODUCER SETTINGS
BUFFER.MEMORY = 268435456
BATCH.SIZE = 131072 524288
LINGER.MS = 0 50
REQUEST.TIMEOUT.MS = 90000
600000
4 7
MODEL EXECUTION ORCHESTRATION
MODEL EXECUTION BACKUP PROBLEM
GENERAL SLOWNESS BACKS UP MODEL EXECUTION PIPELINE
Model Execution
AsyncDataStream
60 second
timeout
Model
Execution
REST Service
Feature Assembly
5 Minute Global Window
Feature Getter
4 8
NETWORK
IN
CLUSTER NETWORK I/O TRAFFIC
KAFKABROKERS
PARTITION 1
PARTITION 2
PARTITION 3
PARTITION 4
PARTITION 5
PARTITION 6
FlinkKafkaConsumer
FlinkKafkaConsumer
FlinkKafkaConsumer
FlinkKafkaConsumer
FlinkKafkaConsumer
FlinkKafkaConsumer
NODE 1
NODE 2
NODE 3
KeyedStreamOperator
KeyedStreamOperator
KeyedStreamOperator
KeyedStreamOperator
KeyedStreamOperator
KeyedStreamOperator
P1
P2
P3
P4
NETWORK
OUT
keyBy()
SHUFFLE
4 9
GOOD READ: HTTPS://WWW.VERVERICA.COM/BLOG/HOW-TO-SIZE-YOUR-APACHE-FLINK-CLUSTER-GENERAL-GUIDELINES
CLUSTER NETWORK I/O VOLUME
(10,000 MESSAGES / SECOND) * (1KB / MESSAGE) =
10 MB / SECOND
100 MBITS/SEC NETWORK
SOLUTION – LARGER NODE TYPES WITH MORE
VCPU AND HIGHER PARALLELISM… FEWER
NODES IN CLUSTER
SWEET SPOT SEEMED TO BE ~20 NODES
THE
CUSTOMER STATE
PROBLEM
5 1
KEEPING TRACK OF DATA POINT STATE
QUERYABLE STATE FEATURE STORE
UP TO 15KB (JSON) PER DATA POINT
10KB * 25 MILLION = 250 GB
1KB * 25 MILLION = 25 GB
APPROXIMATE TOTAL = 1 TB
RED / GREEN “TRIGGER”
MAPSTATE () / KEYBY (ENTITY ID)
[ENTITY ID] (DATA POINT TYPE, STATE)
25 MILLION CUSTOMERS
10+ DATA POINT TYPES
AVG 33 BYTES PER ENTITY ID
~1GB TOTAL
25+ MILLION HIGH SPEED INTERNET CUSTOMERS
10+ DATA POINT TYPES
TWO FLINK STATES TO MANAGE:
THE
CHECKPOINT
PROBLEM
5 3
CHECKPOINT TUNING
FEATURE MANAGER – UP TO 1TB OF STATE
2.5 MINUTES WRITING A CHECKPOINT EVERY 10 MINUTES
TURNED OFF FOR AWHILE!
(RATIONALE: WOULD ONLY TAKE A FEW HOURS TO RECOVER ALL STATE)
INCREMENTAL CHECKPOINTING HELPED
LONG ALIGNMENT TIMES ON HIGH PARALLELISM
REDUCED THE SIZE OF STATE FOR 4 AND 24 HOUR WINDOWS USING FEATURE LOOKUP
TABLE
THE
HA
PROBLEM
5 5
AMAZON EMR AND HIGH AVAILABILITY
EMR FLINK CLUSTER
SINGLE AVAILABILITY ZONE (AZ)
SINGLE MASTER NODE
NOT HA
AWS REGION
AZ1
AZ2
EMR
CLUSTER
1
M T
T
T
T
EMR
CLUSTER
2
M T
T
T
T
S3
Checkpoints
5 6
AMAZON AWS EMR
SESSIONS KEPT DYING
TMP DIR IS PERIODICALLY CLEARED
BREAKS ‘BLOB STORE’ ON LONG RUNNING JOBS
SOLUTION - CONFIGURE TO NOT BE IN /TMP
jobmanager.web.tmpdir
blob.storage.directory
EMR FLINK VERSION LAGS
APACHE RELEASE BY 1-2 MONTHS
5 7
TECHNOLOGY OPTIONS
FLINK ON KUBERNETES
AMAZON EKS (ELASTIC KUBERNETES SERVICE)
SELF-MANAGED KUBERNETES ON EC2
DATA ARTISANS VERVERICA PLATFORM
Bonus:
Reduced Cost / Overhead by
having fewer Master Nodes
Less overhead for smaller (low
parallelism) flows
THE
REALLY HIGH
VOLUME AND
FEATURE STORE
PROBLEM #2
5 9
DATA POINT VOLUME MAY–NOVEMBER 2018
6 0
DATA POINT VOLUME MAY–NOVEMBER 2018
OCTOBER 2018 – 1 BILLION
6 1
DATA POINT VOLUME MAY–NOVEMBER 2018
OCTOBER 2018 – 1 BILLION
NOVEMBER 2018 – 2 BILLION
6 2
LET’S ADD ANOTHER 3 BILLION!
15 MILLION DEVICES
3 BILLION DATA POINTS
6 3
SOLUTION APPROACH
ISOLATE TRIGGER FOR HIGH-
VOLUME DATA POINT TYPES
ISOLATE ONLINE FEATURE
STORE FOR HIGH- VOLUME
DATA POINT TYPES
ISOLATE DEDICATED FEATURE
STORAGE FLOW TO HIGH-SPEED
FEATURE STORE
SEPARATE HISTORICAL
FEATURE STORAGE (S3 WRITER)
FROM REAL TIME MODEL
EXECUTION FLOWS
2
3
4
1
6 4
1: SEPARATE HIGH VOLUME TRIGGER FLOWS
State Flow
And Trigger
FLINK
LOCAL
STATE
HIGH VOLUME
State / Trigger 2
HIGH
VOLUME
Data Point
Producer
2
FLINK
LOCAL
STATE
HIGH VOLUME
State / Trigger 1
HIGH
VOLUME
Data Point
Producer
1
FLINK
LOCAL
STATE
ONLINE FEATURE STORE
MODEL EXECUTION
3B
IN
50M
OUT
6 5
2: FEDERATED ONLINE FEATURE STORE
Feature
REST Service
(Queryable
State Client)
REQUESTING
APPLICATION
Online Feature
Store
FLINK
LOCAL
STATE
Online Feature
Store 2
HIGH
VOLUME
Data Point
Producer
2
FLINK
LOCAL
STATE
Online Feature
Store 1
HIGH
VOLUME
Data Point
Producer
1
FLINK
LOCAL
STATE
FEDERATEDFEATURESTORE
Model/ Feature
Metadata
6 6
3+4: HISTORICAL FEATURE STORE
Feature
REST Service
(Queryable
State Client)
REQUESTING
APPLICATION
Online Feature
Store
HIGH
VOLUME
Data Point
Producer
1
FLINK
LOCAL
STATE
Historical
Feature Store
(S3)
HIGH VOLUME
State / Trigger
FLINK
LOCAL
STATE
Feature
Producer
Feature
Sink
MODEL EXECUTION
6 7
MODEL EXECUTION
FEDERATED
ONLINE
FEATURE
STORE
State Flow
And Trigger
CURRENT ARCHITECTURE
High Volume
Feature Producer
Feature Sink
State Trigger
Prediction /
Outcome Store
Prediction
Store
REST Service
Feature Store
REST Service
Customer
Experience
Data Point
Producers
FLINK
LOCAL
STATE
HIGH
VOLUME
Data Point
Producer
High Volume
Feature Producer
Feature Sink
State Trigger
HIGH
VOLUME
Data Point
Producer
Historical
Feature Store
Feature
Sink
FEATURE
MANAGER
Feature
Manager
Feature
Assembly
Model
Execution
Prediction
Sink
Feature
Manager
Feature
Assembly
6 8
WHERE ARE WE TODAY
INDICATORS HIGH-VOLUME
INDICATORS TOTAL
725+
MILLION
DATA
POINTS
PER DAY
6 BILLION
PER DAY
~7 BILLION
DATA POINTS
PER DAY
6 9
WHERE ARE WE TODAY – FLINK CLUSTERS
CLUSTERS
14
INSTANCES
150
VCPU
1100
RAM
5.8 TB
7 0
FUTURE WORK
EXPAND
CUSTOMER EXPERIENCE
INDICATORS TO OTHER
COMCAST PRODUCTS
IMPROVED ML MODELS
BASED ON RESULTS AND
CONTINUOUS
RETRAINING
MODULAR
MODEL EXECUTION
VIA KUBEFLOW
AND GRPC CALLS
7 1
WE’RE
HIRING!
PHILADELPHIA
WASHINGTON, D.C.
MOUNTAIN VIEW
DENVER
THANK YOU!
Flink Forward San Francisco 2019: Adventures in Scaling from Zero to 5 Billion Data Points per Day - Dave Torok

Contenu connexe

Tendances

Building a GraphQL API in PHP
Building a GraphQL API in PHPBuilding a GraphQL API in PHP
Building a GraphQL API in PHPAndrew Rota
 
Realtime Computation with Storm
Realtime Computation with StormRealtime Computation with Storm
Realtime Computation with Stormboorad
 
Using Kafka to integrate DWH and Cloud Based big data systems
Using Kafka to integrate DWH and Cloud Based big data systemsUsing Kafka to integrate DWH and Cloud Based big data systems
Using Kafka to integrate DWH and Cloud Based big data systemsconfluent
 
Tutorial: Building a GraphQL API in PHP
Tutorial: Building a GraphQL API in PHPTutorial: Building a GraphQL API in PHP
Tutorial: Building a GraphQL API in PHPAndrew Rota
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0Petr Zapletal
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Stream Computing & Analytics at Uber
Stream Computing & Analytics at UberStream Computing & Analytics at Uber
Stream Computing & Analytics at UberSudhir Tonse
 
Real-Time Analytics at Uber Scale
Real-Time Analytics at Uber ScaleReal-Time Analytics at Uber Scale
Real-Time Analytics at Uber ScaleSingleStore
 
Interactive Session on Sparkling Water
Interactive Session on Sparkling WaterInteractive Session on Sparkling Water
Interactive Session on Sparkling WaterSri Ambati
 
Photon Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedPhoton Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedDatabricks
 
Streaming Analytics in Uber
Streaming Analytics in Uber Streaming Analytics in Uber
Streaming Analytics in Uber Danny Yuan
 
Crossing the Streams: Rethinking Stream Processing with Kafka Streams and KSQL
Crossing the Streams: Rethinking Stream Processing with Kafka Streams and KSQLCrossing the Streams: Rethinking Stream Processing with Kafka Streams and KSQL
Crossing the Streams: Rethinking Stream Processing with Kafka Streams and KSQLconfluent
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
Reactive programming on Android
Reactive programming on AndroidReactive programming on Android
Reactive programming on AndroidTomáš Kypta
 
Rediscovering the Value of Apache Kafka® in Modern Data Architecture
Rediscovering the Value of Apache Kafka® in Modern Data ArchitectureRediscovering the Value of Apache Kafka® in Modern Data Architecture
Rediscovering the Value of Apache Kafka® in Modern Data Architectureconfluent
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/TridentJulian Hyde
 
How to Improve the Observability of Apache Cassandra and Kafka applications...
How to Improve the Observability of Apache Cassandra and Kafka applications...How to Improve the Observability of Apache Cassandra and Kafka applications...
How to Improve the Observability of Apache Cassandra and Kafka applications...Paul Brebner
 

Tendances (20)

Building a GraphQL API in PHP
Building a GraphQL API in PHPBuilding a GraphQL API in PHP
Building a GraphQL API in PHP
 
Realtime Computation with Storm
Realtime Computation with StormRealtime Computation with Storm
Realtime Computation with Storm
 
Using Kafka to integrate DWH and Cloud Based big data systems
Using Kafka to integrate DWH and Cloud Based big data systemsUsing Kafka to integrate DWH and Cloud Based big data systems
Using Kafka to integrate DWH and Cloud Based big data systems
 
Tutorial: Building a GraphQL API in PHP
Tutorial: Building a GraphQL API in PHPTutorial: Building a GraphQL API in PHP
Tutorial: Building a GraphQL API in PHP
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Stream Computing & Analytics at Uber
Stream Computing & Analytics at UberStream Computing & Analytics at Uber
Stream Computing & Analytics at Uber
 
Real-Time Analytics at Uber Scale
Real-Time Analytics at Uber ScaleReal-Time Analytics at Uber Scale
Real-Time Analytics at Uber Scale
 
Interactive Session on Sparkling Water
Interactive Session on Sparkling WaterInteractive Session on Sparkling Water
Interactive Session on Sparkling Water
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Photon Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedPhoton Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think Vectorized
 
Streaming Analytics in Uber
Streaming Analytics in Uber Streaming Analytics in Uber
Streaming Analytics in Uber
 
Scalding
ScaldingScalding
Scalding
 
Crossing the Streams: Rethinking Stream Processing with Kafka Streams and KSQL
Crossing the Streams: Rethinking Stream Processing with Kafka Streams and KSQLCrossing the Streams: Rethinking Stream Processing with Kafka Streams and KSQL
Crossing the Streams: Rethinking Stream Processing with Kafka Streams and KSQL
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Reactive programming on Android
Reactive programming on AndroidReactive programming on Android
Reactive programming on Android
 
Rediscovering the Value of Apache Kafka® in Modern Data Architecture
Rediscovering the Value of Apache Kafka® in Modern Data ArchitectureRediscovering the Value of Apache Kafka® in Modern Data Architecture
Rediscovering the Value of Apache Kafka® in Modern Data Architecture
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 
How to Improve the Observability of Apache Cassandra and Kafka applications...
How to Improve the Observability of Apache Cassandra and Kafka applications...How to Improve the Observability of Apache Cassandra and Kafka applications...
How to Improve the Observability of Apache Cassandra and Kafka applications...
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 

Similaire à Flink Forward San Francisco 2019: Adventures in Scaling from Zero to 5 Billion Data Points per Day - Dave Torok

LendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGateLendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGateRajit Saha
 
Oracle Goldengate for Big Data - LendingClub Implementation
Oracle Goldengate for Big Data - LendingClub ImplementationOracle Goldengate for Big Data - LendingClub Implementation
Oracle Goldengate for Big Data - LendingClub ImplementationVengata Guruswamy
 
Ft10 de smet
Ft10 de smetFt10 de smet
Ft10 de smetnkaluva
 
Going Native: Leveraging the New JSON Native Datatype in Oracle 21c
Going Native: Leveraging the New JSON Native Datatype in Oracle 21cGoing Native: Leveraging the New JSON Native Datatype in Oracle 21c
Going Native: Leveraging the New JSON Native Datatype in Oracle 21cJim Czuprynski
 
Evolution of a big data project
Evolution of a big data projectEvolution of a big data project
Evolution of a big data projectMichael Peacock
 
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...InfluxData
 
How to solve complex business requirements with Oracle Data Integrator?
How to solve complex business requirements with Oracle Data Integrator?How to solve complex business requirements with Oracle Data Integrator?
How to solve complex business requirements with Oracle Data Integrator?Gurcan Orhan
 
SRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal Health
SRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal HealthSRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal Health
SRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal HealthAmazon Web Services
 
Agile Database Development with JSON
Agile Database Development with JSONAgile Database Development with JSON
Agile Database Development with JSONChris Saxon
 
Working With XML in IDS Applications
Working With XML in IDS ApplicationsWorking With XML in IDS Applications
Working With XML in IDS ApplicationsKeshav Murthy
 
Declarative Multilingual Information Extraction with SystemT
Declarative Multilingual Information Extraction with SystemTDeclarative Multilingual Information Extraction with SystemT
Declarative Multilingual Information Extraction with SystemTLaura Chiticariu
 
[Velocity Conf 2017 NY] How Twitter built a framework to improve infrastructu...
[Velocity Conf 2017 NY] How Twitter built a framework to improve infrastructu...[Velocity Conf 2017 NY] How Twitter built a framework to improve infrastructu...
[Velocity Conf 2017 NY] How Twitter built a framework to improve infrastructu...Vinu Charanya
 
Using SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced AnalyticsUsing SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced AnalyticsTeradata Aster
 
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...Flink Forward
 
How lagom helps to build real world microservice systems
How lagom helps to build real world microservice systemsHow lagom helps to build real world microservice systems
How lagom helps to build real world microservice systemsMarkus Eisele
 
Microservices Manchester: How Lagom Helps to Build Real World Microservice Sy...
Microservices Manchester: How Lagom Helps to Build Real World Microservice Sy...Microservices Manchester: How Lagom Helps to Build Real World Microservice Sy...
Microservices Manchester: How Lagom Helps to Build Real World Microservice Sy...OpenCredo
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandFrançois Garillot
 

Similaire à Flink Forward San Francisco 2019: Adventures in Scaling from Zero to 5 Billion Data Points per Day - Dave Torok (20)

LendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGateLendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGate
 
Oracle Goldengate for Big Data - LendingClub Implementation
Oracle Goldengate for Big Data - LendingClub ImplementationOracle Goldengate for Big Data - LendingClub Implementation
Oracle Goldengate for Big Data - LendingClub Implementation
 
126 Golconde
126 Golconde126 Golconde
126 Golconde
 
Ft10 de smet
Ft10 de smetFt10 de smet
Ft10 de smet
 
Going Native: Leveraging the New JSON Native Datatype in Oracle 21c
Going Native: Leveraging the New JSON Native Datatype in Oracle 21cGoing Native: Leveraging the New JSON Native Datatype in Oracle 21c
Going Native: Leveraging the New JSON Native Datatype in Oracle 21c
 
Evolution of a big data project
Evolution of a big data projectEvolution of a big data project
Evolution of a big data project
 
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
 
dfl
dfldfl
dfl
 
How to solve complex business requirements with Oracle Data Integrator?
How to solve complex business requirements with Oracle Data Integrator?How to solve complex business requirements with Oracle Data Integrator?
How to solve complex business requirements with Oracle Data Integrator?
 
SRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal Health
SRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal HealthSRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal Health
SRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal Health
 
Agile Database Development with JSON
Agile Database Development with JSONAgile Database Development with JSON
Agile Database Development with JSON
 
Working With XML in IDS Applications
Working With XML in IDS ApplicationsWorking With XML in IDS Applications
Working With XML in IDS Applications
 
Declarative Multilingual Information Extraction with SystemT
Declarative Multilingual Information Extraction with SystemTDeclarative Multilingual Information Extraction with SystemT
Declarative Multilingual Information Extraction with SystemT
 
Big Data Seervices in Danaos Use Case
Big Data Seervices in Danaos Use CaseBig Data Seervices in Danaos Use Case
Big Data Seervices in Danaos Use Case
 
[Velocity Conf 2017 NY] How Twitter built a framework to improve infrastructu...
[Velocity Conf 2017 NY] How Twitter built a framework to improve infrastructu...[Velocity Conf 2017 NY] How Twitter built a framework to improve infrastructu...
[Velocity Conf 2017 NY] How Twitter built a framework to improve infrastructu...
 
Using SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced AnalyticsUsing SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced Analytics
 
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
 
How lagom helps to build real world microservice systems
How lagom helps to build real world microservice systemsHow lagom helps to build real world microservice systems
How lagom helps to build real world microservice systems
 
Microservices Manchester: How Lagom Helps to Build Real World Microservice Sy...
Microservices Manchester: How Lagom Helps to Build Real World Microservice Sy...Microservices Manchester: How Lagom Helps to Build Real World Microservice Sy...
Microservices Manchester: How Lagom Helps to Build Real World Microservice Sy...
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
 

Plus de Flink Forward

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorFlink Forward
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkFlink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink Forward
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraFlink Forward
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022Flink Forward
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 

Plus de Flink Forward (20)

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 

Dernier

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 

Dernier (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Flink Forward San Francisco 2019: Adventures in Scaling from Zero to 5 Billion Data Points per Day - Dave Torok

  • 1. ADVENTURES IN SCALING FROM ZERO TO 5 BILLION DATA POINTS PER DAY Dave Torok Distinguished Architect Comcast Corporation 2 April, 2019 Flink Forward – San Francisco 2019
  • 2. 2 COMCAST CUSTOMER RELATIONSHIPS 30.3 MILLION OVERALL CUSTOMER RELATIONSHIPS AT 2018 YEAR END 25.1 MILLION RESIDENTIAL HIGH-SPEED INTERNET CUSTOMERS AT 2018 YEAR END 1.2 MILLION RESIDENTIAL HIGH-SPEED INTERNET CUSTOMER NET ADDITIONS IN 2018
  • 3. 3 DELIVER THE ULTIMATE CUSTOMER EXPERIENCE IS THE CUSTOMER HAVING A GOOD EXPERIENCE FOR HIGH SPEED DATA (HSD) SERVICE? IF THE CUSTOMER ENGAGES US DIGITALLY, CAN WE OFFER A SELF-SERVICE SOLUTION? IF THERE IS AN ISSUE CAN WE OFFER OUR AGENTS AND TECHNICIANS A DIAGNOSIS TO HELP SOLVE THE PROBLEM QUICKER? REDUCE THE TIME TO RESOLVE ISSUES REDUCE COST TO THE BUSINESS AND THE CUSTOMER
  • 4. 4 HIGH LEVEL CONCEPT Customer Experience Diagnosis Platform Service Delivery Systems Predictions / Diagnosis / Treatments CUSTOMER ENGAGEMENT PLATFORM CUSTOMER EXPERIENCE INDICATORS Comcast collects, stores, and uses all data in accordance with our privacy disclosures to users and applicable laws.
  • 5. 5 CUSTOMER EXPERIENCE INDICATORS Comcast collects, stores, and uses all data in accordance with our privacy disclosures to users and applicable laws.
  • 6. 6 NINE ADVENTURES IN SCALING THE TRIGGER AND DIAGNOSIS PROBLEM THE FEATURE STORE PROBLEM THE CHECKPOINT PROBLEM THE REST PROBLEM THE VOLUME PROBLEM THE TRIGGER AND DIAGNOSIS PROBLEM THE INEFFICIENT OBJECT HANDLING PROBLEM THE CUSTOMER STATE PROBLEM THE REALLY HIGH VOLUME AND FEATURE STORE PROBLEM #2
  • 7. 7 ML FRAMEWORK ARCHITECTURE – 2018 ” E M B E D D I N G F L I N K T H R O U G H O U T AN O P E R AT I O N AL I Z E D S T R E AM I N G M L L I F E C Y C L E ” – S F F L I N K F O RWAR D 2 0 1 8 REQUESTING APPLICATION REQUESTING APPLICATION Feature Creation Pipelines ML FRAMEWORK Online Feature Store Model/ Feature Metadata Feature Check Async IO Feature Assembly Model Execution Prediction / Outcome Store Prediction Store Sink Prediction Store REST Service Feature Store REST Service Model Execution Request REST Service Model Prediction REST Service Model Execute Async IO FEATURE SOURCES / DATA AND SERVICES
  • 9. 9 MAY 2018 – INITIAL VOLUMES INDICATOR #1 9 MILLION INDICATOR #2 166 MILLION INDICATOR #3 1.2 MILLION INDICATOR #4 2500 (SMALL TRIAL) TOTAL 175 MILLION / DAY Comcast collects, stores, and uses all data in accordance with our privacy disclosures to users and applicable laws.
  • 10. 1 0 175 MILLION PREDICTIONS? 175 MILLION DATA POINTS ?????? PREDICTION / DIAGNOSIS OUTPUT STORE
  • 11. 1 1 INTRODUCE A ‘TRIGGER’ CONDITION GREEN GREEN GREEN RED GREEN GREEN
  • 12. 1 2 INTRODUCE A ‘TRIGGER’ CONDITION GREEN GREEN GREEN RED GREEN GREEN GREEN RED GREEN RED
  • 13. 1 3 INTRODUCE A ‘TRIGGER’ CONDITION GREEN GREEN GREEN RED GREEN GREEN GREEN RED GREEN REDDETECT STATE CHANGE
  • 14. 1 4 INTRODUCE A ‘TRIGGER’ CONDITION GREEN GREEN GREEN RED GREEN GREEN GREEN GREEN GREEN RED GREEN RED GREEN RED GREEN REDDETECT STATE CHANGE SEND FOR DIAGNOSIS
  • 15. 1 5 State Flow And Trigger INTRODUCE FLINK LOCAL STATE STATE MANAGER AND TRIGGER 175 MILLION DATA POINTS Customer Experience Data Point Producers Customer Experience Diagnosis Platform “A FEW MILLION” TRIGGERS FLINK LOCAL STATE PREDICTION / DIAGNOSIS OUTPUT STORE
  • 17. 1 7 ML FRAMEWORK ARCHITECTURE – 2018 ” E M B E D D I N G F L I N K T H R O U G H O U T AN O P E R AT I O N AL I Z E D S T R E AM I N G M L L I F E C Y C L E ” – S F F L I N K F O RWAR D 2 0 1 8 REQUESTING APPLICATION REQUESTING APPLICATION Feature Creation Pipelines ML FRAMEWORK Online Feature Store Model/ Feature Metadata Feature Check Async IO Feature Assembly Model Execution Prediction / Outcome Store Prediction Store Sink Prediction Store REST Service Feature Store REST Service Model Execution Request REST Service Model Prediction REST Service Model Execute Async IO FEATURE SOURCES / DATA AND SERVICES
  • 18. 1 8 ML FRAMEWORK State Flow And Model Execution Trigger INITIAL PLATFORM ARCHITECTURE Online Feature Store Model/ Feature Metadata Feature Getter Async IO Feature Assembly Feature Sink Model Execution Prediction / Outcome Store Prediction Store Sink Prediction Store REST Service Feature Store REST Service Customer Experience Data Point Producers Model Execution Request REST Service Model Prediction REST Service FLINK LOCAL STATE Model Execute Async IO REQUESTING APPLICATION USE CASE
  • 19. 1 9 ORIGINAL ML FRAMEWORK ARCHITECTURE EVERY COMMUNICATION WAS A REST SERVICE ML FRAMEWORK ISOLATED FROM DATA POINT USE CASE BY DESIGN FLINK ASYNC + REST IS DIFFICULT TO SCALE ELASTICALLY HTTP REQUEST RESPONSE
  • 20. 2 0 MAX THROUGHPUT PER SECOND = (SLOTS) * (THREADS) * (1000 / TASK DURATION MSEC) (12 SLOTS) * (20 THREADS) * (1000 / 300) = 800 / SECOND TUNING ASYNC I/O – LATENCY AND VOLUME APACHE HTTP CLIENT / EXECUTOR THREADS setMaxTotal setDefaultMaxPerRoute (THREADS) FLINK ASYNC MAX OUTSTANDING RichAsyncFunction “Capacity” AsyncDataStream.unorderedWait() with max capacity parameter RATE THROTTLING if (numberOfRequestsPerPeriod > 0) { int numberOfProcesses = getRuntimeContext() .getNumberOfParallelSubtasks(); numberOfRequestsPerPeriodPerProcess = Math.max(1, numberOfRequestsPerPeriod / numberOfProcesses); }
  • 21. 2 1 State Flow And Model Execution Trigger REPLACE REST CALLS WITH KAFKA (I) Online Feature Store Model/ Feature Metadata Feature Getter Async IO Feature Assembly Prediction / Outcome Store Prediction Store REST Service Feature Store REST Service Customer Experience Data Point Producers FLINK LOCAL STATE REQUESTING APPLICATION Feature Sink (JDBCAppendTableSink) Model Execution Prediction Store Sink (JDBCAppendTableSink)
  • 22. 2 2 REPLACE REST CALL WITH KAFKA (II) Online Feature Store Model/ Feature Metadata Feature Assembly Feature Sink Flow (JDBCAppendTableSink) Model Execution Prediction / Outcome Store Prediction Store Sink (JDBCAppendTableSink) Customer Experience Data Point Producers FLINK LOCAL STATE State Flow And Trigger FLINK LOCAL STATE FLINK LOCAL STATE Feature Store Manager FLINK LOCAL STATE
  • 24. 2 4 HAS THIS EVER HAPPENED TO YOU? New “JOLT” JSON Transformer (but don’t cache it)  New ObjectMapper() (but don’t cache it)  Parse JSON string into Map  Load and parse “Transform” JSON from the resource path (but don’t cache it)  Serialize transformed Map back into string result String transformedJson = new JoltJsonTransformer() .transformJson(myJsonValue, pTool.getRequired(SPEC_PATH)); JSONObject inputPayload = new JSONObject(transformedJson); try { aType = inputPayload.getJSONObject("info") .getJSONObject(“aType"). getString("string"); } catch (Exception e) { // ignore, invalid object aType = STATUS_INVALID; }
  • 25. 2 5 HAS THIS EVER HAPPENED TO YOU? New “JOLT” JSON Transformer (but don’t cache it)  New ObjectMapper() (but don’t cache it)  Parse JSON string into Map  Load and parse “Transform” JSON from the resource path (but don’t cache it)  Serialize transformed Map back into string result Then let’s make a NEW JSONObject from that string String transformedJson = new JoltJsonTransformer() .transformJson(myJsonValue, pTool.getRequired(SPEC_PATH)); JSONObject inputPayload = new JSONObject(transformedJson); try { aType = inputPayload.getJSONObject("info") .getJSONObject(“aType"). getString("string"); } catch (Exception e) { // ignore, invalid object aType = STATUS_INVALID; }
  • 26. 2 6 HAS THIS EVER HAPPENED TO YOU? New “JOLT” JSON Transformer (but don’t cache it)  New ObjectMapper() (but don’t cache it)  Parse JSON string into Map  Load and parse “Transform” JSON from the resource path (but don’t cache it)  Serialize transformed Map back into string result Then let’s make a NEW JSONObject from that string Go deep into the path to get value String transformedJson = new JoltJsonTransformer() .transformJson(myJsonValue, pTool.getRequired(SPEC_PATH)); JSONObject inputPayload = new JSONObject(transformedJson); try { aType = inputPayload.getJSONObject("info") .getJSONObject(“aType"). getString("string"); } catch (Exception e) { // ignore, invalid object aType = STATUS_INVALID; }
  • 27. 2 7 HAS THIS EVER HAPPENED TO YOU? New “JOLT” JSON Transformer (but don’t cache it)  New ObjectMapper() (but don’t cache it)  Parse JSON string into Map  Load and parse “Transform” JSON from the resource path (but don’t cache it)  Serialize transformed Map back into string result Then let’s make a NEW JSONObject from that string Go deep into the path to get value Cast exception? STATUS_INVALID String transformedJson = new JoltJsonTransformer() .transformJson(myJsonValue, pTool.getRequired(SPEC_PATH)); JSONObject inputPayload = new JSONObject(transformedJson); try { aType = inputPayload.getJSONObject("info") .getJSONObject(“aType"). getString("string"); } catch (Exception e) { // ignore, invalid object aType = STATUS_INVALID; }
  • 28. 2 8 SLIGHTLY MORE EFFICIENT SOLUTION Cache our new ObjectMapper() Parse JSON string into Map ONCE Lightweight Map Traversal utility Java Stream syntax Optional.ofNullable Optional.orElse if (mapper == null) mapper = new ObjectMapper(); Map<?, ?> resultMap = JsonUtils.readValue(mapper, (String) result, Map.class); MapTreeWalker mapWalker = new MapTreeWalker(resultMap); final String aType = mapWalker.step("info") .step(“aSchemaInfo") .step(“aType") .step("string") .<String>get(). orElse(STATUS_INVALID);
  • 29. 2 9 FLINK OBJECT EFFICIENCIES REDUCE OBJECT CREATION AND GARBAGE COLLECTION BE CAREFUL OF SIDE EFFECTS WITHIN OPERATORS StreamExecutionEnvironment env; env.getConfig().enableObjectReuse(); SEMANTIC ANNOTATIONS @NonForwardedFields @ForwardedFields
  • 31. 3 1 WHAT IS A FEATURE STORE? “FEATURE” IS A DATA POINT INPUT TO ML MODEL WE NEED TO: • STORE ALL THE INPUT DATA POINTS • SNAPSHOT AT ANY MOMENT IN TIME • ASSOCIATE WITH A DIAGNOSIS (MODEL OUTPUT) • HAVE ACCESS TO THE DATA POINTS (API FOR OTHER APPS) • STORE ALL DATA POINTS FOR ML MODEL TRAINING Feature Creation Pipeline Historical Feature Store Online Feature Store Prediction Phase Model Training Phase AppendOverwrite
  • 32. 3 2 FIRST TRY: AWS AURORA RDBMS “HEY IT SHOULD GET US UP TO 10,000 / SECOND” …THE UPSERT PROBLEM DIDN’T WORK MUCH PAST 3,000 / SECOND …SOON OUR INPUT RATE WAS 20,000 / SECOND
  • 33. 3 3 MITIGATION: STORE ONLY AT DIAGNOSIS TIME STOPPED STORING ALL RAW DATA POINTS ONLY STORE DATA POINTS ALONGSIDE DIAGNOSIS CON: ONLY A SMALL % OF DATA POINTS CON: DATA NOT CURRENT, ONLY AS OF RECENT DIAGNOSIS CON: LIMITED USEFULNESS Feature Sink Model Execution Feature Store Prediction / Outcome Store Prediction Store Sink
  • 34. 3 4 OPTIMIZING THE WRITE PATH WHICH FLINK FLOW WRITES TO FEATURE STORE? FEATURE STORE NOT NEEDED FOR TRIGGER FEATURE STORE NOT NEEDED FOR MODEL EXECUTION SEPARATE ‘TRANSFORM AND SINK’ FLOW Online Feature Store Feature Manager Feature Store REST Service Online Feature Store Feature Sink Flow Feature Manager FLINK LOCAL STATE Upstream Flow Model Execution BEFORE AFTER
  • 35. 3 5 ALTERNATIVE DATA STORES CASSANDRA COST AND COMPLEXITY? TIME SERIES DB DRUID INFLUX REDIS STORE ONLY LATEST DATA POINTS? FLINK QUERYABLE STATE PRODUCTION HARDENING? READ VOLUME?
  • 36. 3 6 REDIS WE ONLY NEED < 7 DAYS OF DATA, ONLY LATEST VALUE, SO USE REDIS AS THE KV STORE ACCESSIBLE BY OTHER CONSUMERS OR VIA A SERVICE COULDN'T PUT MORE THAN 8,000 / SECOND FOR A 'REASONABLE' CLUSTER SIZE FLINK SINK CONNECTOR IS NOT CLUSTER-ENABLED ONE OBJECT PUT PER CALL JEDIS VS LETTUCE FRAMEWORK AND PIPELINING SOME OPTIMIZATION IF ABLE TO BATCH REDIS SHARDS BY PRE-COMPUTING
  • 37. 3 7 FLINK QUERYABLE STATE REUSE OUR FEATURE REQUEST/RESPONSE FRAMEWORK SCALABLE, FAST OUR READ LOAD IS LOW (10 / SECOND) EXPENSIVE (RELATIVELY) NOT A “PRODUCTION” CAPABILITY NOT A “DATABASE” TECHNOLOGY
  • 38. 3 8 QUERYABLE STATE - ONLINE FEATURE STORE Feature Manager FLINK LOCAL STATE Upstream Flow Model Execution Online Feature Store FLINK LOCAL STATE Feature REST Service (Queryable State Client) REQUESTING APPLICATION Upstream Flow Customer Experience Data Point Producers ONLINE FEATURE STORE PATH MODEL EXECUTION PATH
  • 40. 4 0 RAMPED UP PRODUCTION – JULY 2018
  • 41. 4 1 ADDED NEW DATA POINT TYPES 600+ Million
  • 42. 4 2 INCREASED DATA POINT FREQUENCY ONCE EVERY 24 HOURS
  • 43. 4 3 INCREASED DATA POINT FREQUENCY ONCE EVERY 24 HOURS ONCE EVERY 4 HOURS
  • 44. 4 4 INCREASED DATA POINT FREQUENCY ONCE EVERY 24 HOURS ONCE EVERY 4 HOURS ONCE EVERY 5 MINUTES ???
  • 45. 4 5 FLINK CLUSTER COULDN’T KEEP UP WITH KAFKA VOLUME SYMPTOM: KAFKA LATENCY INCREASING TIME CONSUMERLAG
  • 46. 4 6 KAFKA OPTIMIZATION INCREASE PARTITIONS 1 KAFKA PARTITION READ BY 1 FLINK TASK THREAD MATCH PARALLELISM TO PARTITIONS (OR EVEN MULTIPLE) 100 KAFKA PARTITIONS 50 SLOT PARALLELISM 50 KAFKA PARTITIONS 50 SLOT PARALLELISM ADJUST KAFKA CONFIGURATION # KAFKA CONSUMER SETTINGS MAX.POLL.RECORDS = 500 10000 FETCH.MIN.BYTES = 1 102400 FETCH.MAX.WAIT.MS = 500 1000 # KAFKA PRODUCER SETTINGS BUFFER.MEMORY = 268435456 BATCH.SIZE = 131072 524288 LINGER.MS = 0 50 REQUEST.TIMEOUT.MS = 90000 600000
  • 47. 4 7 MODEL EXECUTION ORCHESTRATION MODEL EXECUTION BACKUP PROBLEM GENERAL SLOWNESS BACKS UP MODEL EXECUTION PIPELINE Model Execution AsyncDataStream 60 second timeout Model Execution REST Service Feature Assembly 5 Minute Global Window Feature Getter
  • 48. 4 8 NETWORK IN CLUSTER NETWORK I/O TRAFFIC KAFKABROKERS PARTITION 1 PARTITION 2 PARTITION 3 PARTITION 4 PARTITION 5 PARTITION 6 FlinkKafkaConsumer FlinkKafkaConsumer FlinkKafkaConsumer FlinkKafkaConsumer FlinkKafkaConsumer FlinkKafkaConsumer NODE 1 NODE 2 NODE 3 KeyedStreamOperator KeyedStreamOperator KeyedStreamOperator KeyedStreamOperator KeyedStreamOperator KeyedStreamOperator P1 P2 P3 P4 NETWORK OUT keyBy() SHUFFLE
  • 49. 4 9 GOOD READ: HTTPS://WWW.VERVERICA.COM/BLOG/HOW-TO-SIZE-YOUR-APACHE-FLINK-CLUSTER-GENERAL-GUIDELINES CLUSTER NETWORK I/O VOLUME (10,000 MESSAGES / SECOND) * (1KB / MESSAGE) = 10 MB / SECOND 100 MBITS/SEC NETWORK SOLUTION – LARGER NODE TYPES WITH MORE VCPU AND HIGHER PARALLELISM… FEWER NODES IN CLUSTER SWEET SPOT SEEMED TO BE ~20 NODES
  • 51. 5 1 KEEPING TRACK OF DATA POINT STATE QUERYABLE STATE FEATURE STORE UP TO 15KB (JSON) PER DATA POINT 10KB * 25 MILLION = 250 GB 1KB * 25 MILLION = 25 GB APPROXIMATE TOTAL = 1 TB RED / GREEN “TRIGGER” MAPSTATE () / KEYBY (ENTITY ID) [ENTITY ID] (DATA POINT TYPE, STATE) 25 MILLION CUSTOMERS 10+ DATA POINT TYPES AVG 33 BYTES PER ENTITY ID ~1GB TOTAL 25+ MILLION HIGH SPEED INTERNET CUSTOMERS 10+ DATA POINT TYPES TWO FLINK STATES TO MANAGE:
  • 53. 5 3 CHECKPOINT TUNING FEATURE MANAGER – UP TO 1TB OF STATE 2.5 MINUTES WRITING A CHECKPOINT EVERY 10 MINUTES TURNED OFF FOR AWHILE! (RATIONALE: WOULD ONLY TAKE A FEW HOURS TO RECOVER ALL STATE) INCREMENTAL CHECKPOINTING HELPED LONG ALIGNMENT TIMES ON HIGH PARALLELISM REDUCED THE SIZE OF STATE FOR 4 AND 24 HOUR WINDOWS USING FEATURE LOOKUP TABLE
  • 55. 5 5 AMAZON EMR AND HIGH AVAILABILITY EMR FLINK CLUSTER SINGLE AVAILABILITY ZONE (AZ) SINGLE MASTER NODE NOT HA AWS REGION AZ1 AZ2 EMR CLUSTER 1 M T T T T EMR CLUSTER 2 M T T T T S3 Checkpoints
  • 56. 5 6 AMAZON AWS EMR SESSIONS KEPT DYING TMP DIR IS PERIODICALLY CLEARED BREAKS ‘BLOB STORE’ ON LONG RUNNING JOBS SOLUTION - CONFIGURE TO NOT BE IN /TMP jobmanager.web.tmpdir blob.storage.directory EMR FLINK VERSION LAGS APACHE RELEASE BY 1-2 MONTHS
  • 57. 5 7 TECHNOLOGY OPTIONS FLINK ON KUBERNETES AMAZON EKS (ELASTIC KUBERNETES SERVICE) SELF-MANAGED KUBERNETES ON EC2 DATA ARTISANS VERVERICA PLATFORM Bonus: Reduced Cost / Overhead by having fewer Master Nodes Less overhead for smaller (low parallelism) flows
  • 59. 5 9 DATA POINT VOLUME MAY–NOVEMBER 2018
  • 60. 6 0 DATA POINT VOLUME MAY–NOVEMBER 2018 OCTOBER 2018 – 1 BILLION
  • 61. 6 1 DATA POINT VOLUME MAY–NOVEMBER 2018 OCTOBER 2018 – 1 BILLION NOVEMBER 2018 – 2 BILLION
  • 62. 6 2 LET’S ADD ANOTHER 3 BILLION! 15 MILLION DEVICES 3 BILLION DATA POINTS
  • 63. 6 3 SOLUTION APPROACH ISOLATE TRIGGER FOR HIGH- VOLUME DATA POINT TYPES ISOLATE ONLINE FEATURE STORE FOR HIGH- VOLUME DATA POINT TYPES ISOLATE DEDICATED FEATURE STORAGE FLOW TO HIGH-SPEED FEATURE STORE SEPARATE HISTORICAL FEATURE STORAGE (S3 WRITER) FROM REAL TIME MODEL EXECUTION FLOWS 2 3 4 1
  • 64. 6 4 1: SEPARATE HIGH VOLUME TRIGGER FLOWS State Flow And Trigger FLINK LOCAL STATE HIGH VOLUME State / Trigger 2 HIGH VOLUME Data Point Producer 2 FLINK LOCAL STATE HIGH VOLUME State / Trigger 1 HIGH VOLUME Data Point Producer 1 FLINK LOCAL STATE ONLINE FEATURE STORE MODEL EXECUTION 3B IN 50M OUT
  • 65. 6 5 2: FEDERATED ONLINE FEATURE STORE Feature REST Service (Queryable State Client) REQUESTING APPLICATION Online Feature Store FLINK LOCAL STATE Online Feature Store 2 HIGH VOLUME Data Point Producer 2 FLINK LOCAL STATE Online Feature Store 1 HIGH VOLUME Data Point Producer 1 FLINK LOCAL STATE FEDERATEDFEATURESTORE Model/ Feature Metadata
  • 66. 6 6 3+4: HISTORICAL FEATURE STORE Feature REST Service (Queryable State Client) REQUESTING APPLICATION Online Feature Store HIGH VOLUME Data Point Producer 1 FLINK LOCAL STATE Historical Feature Store (S3) HIGH VOLUME State / Trigger FLINK LOCAL STATE Feature Producer Feature Sink MODEL EXECUTION
  • 67. 6 7 MODEL EXECUTION FEDERATED ONLINE FEATURE STORE State Flow And Trigger CURRENT ARCHITECTURE High Volume Feature Producer Feature Sink State Trigger Prediction / Outcome Store Prediction Store REST Service Feature Store REST Service Customer Experience Data Point Producers FLINK LOCAL STATE HIGH VOLUME Data Point Producer High Volume Feature Producer Feature Sink State Trigger HIGH VOLUME Data Point Producer Historical Feature Store Feature Sink FEATURE MANAGER Feature Manager Feature Assembly Model Execution Prediction Sink Feature Manager Feature Assembly
  • 68. 6 8 WHERE ARE WE TODAY INDICATORS HIGH-VOLUME INDICATORS TOTAL 725+ MILLION DATA POINTS PER DAY 6 BILLION PER DAY ~7 BILLION DATA POINTS PER DAY
  • 69. 6 9 WHERE ARE WE TODAY – FLINK CLUSTERS CLUSTERS 14 INSTANCES 150 VCPU 1100 RAM 5.8 TB
  • 70. 7 0 FUTURE WORK EXPAND CUSTOMER EXPERIENCE INDICATORS TO OTHER COMCAST PRODUCTS IMPROVED ML MODELS BASED ON RESULTS AND CONTINUOUS RETRAINING MODULAR MODEL EXECUTION VIA KUBEFLOW AND GRPC CALLS