Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
ADVENTURES IN SCALING
FROM ZERO TO 5 BILLION
DATA POINTS PER DAY
Dave Torok
Distinguished Architect
Comcast Corporation
2 ...
2
COMCAST CUSTOMER RELATIONSHIPS
30.3 MILLION OVERALL CUSTOMER
RELATIONSHIPS AT 2018 YEAR END
25.1 MILLION RESIDENTIAL HIG...
3
DELIVER THE ULTIMATE CUSTOMER EXPERIENCE
IS THE CUSTOMER HAVING A GOOD EXPERIENCE
FOR HIGH SPEED DATA (HSD) SERVICE?
IF ...
4
HIGH LEVEL CONCEPT
Customer
Experience
Diagnosis
Platform
Service
Delivery
Systems
Predictions /
Diagnosis /
Treatments
...
5
CUSTOMER EXPERIENCE INDICATORS
Comcast collects, stores, and uses all data in accordance with our privacy disclosures to...
6
NINE ADVENTURES IN SCALING
THE
TRIGGER AND
DIAGNOSIS
PROBLEM
THE
FEATURE STORE
PROBLEM
THE
CHECKPOINT
PROBLEM
THE
REST
P...
7
ML FRAMEWORK ARCHITECTURE – 2018
” E M B E D D I N G F L I N K T H R O U G H O U T AN O P E R AT I O N AL I Z E D S T R ...
THE
TRIGGER AND
DIAGNOSIS
PROBLEM
9
MAY 2018 – INITIAL VOLUMES
INDICATOR #1
9 MILLION
INDICATOR #2
166 MILLION
INDICATOR #3
1.2 MILLION
INDICATOR #4
2500 (S...
1 0
175 MILLION PREDICTIONS?
175 MILLION
DATA
POINTS
??????
PREDICTION /
DIAGNOSIS
OUTPUT
STORE
1 1
INTRODUCE A ‘TRIGGER’ CONDITION
GREEN
GREEN
GREEN
RED
GREEN
GREEN
1 2
INTRODUCE A ‘TRIGGER’ CONDITION
GREEN
GREEN
GREEN
RED
GREEN
GREEN
GREEN
RED
GREEN
RED
1 3
INTRODUCE A ‘TRIGGER’ CONDITION
GREEN
GREEN
GREEN
RED
GREEN
GREEN
GREEN
RED
GREEN
REDDETECT STATE CHANGE
1 4
INTRODUCE A ‘TRIGGER’ CONDITION
GREEN
GREEN
GREEN
RED
GREEN
GREEN
GREEN
GREEN
GREEN
RED
GREEN
RED
GREEN
RED
GREEN
REDD...
1 5
State
Flow
And
Trigger
INTRODUCE FLINK LOCAL STATE
STATE MANAGER AND TRIGGER
175 MILLION
DATA
POINTS
Customer
Experien...
THE
REST
PROBLEM
1 7
ML FRAMEWORK ARCHITECTURE – 2018
” E M B E D D I N G F L I N K T H R O U G H O U T AN O P E R AT I O N AL I Z E D S T ...
1 8 ML FRAMEWORK
State Flow
And Model
Execution
Trigger
INITIAL PLATFORM ARCHITECTURE
Online Feature
Store
Model/ Feature
...
1 9
ORIGINAL ML FRAMEWORK ARCHITECTURE
EVERY COMMUNICATION
WAS A REST SERVICE
ML FRAMEWORK
ISOLATED FROM DATA
POINT USE CA...
2 0
MAX THROUGHPUT PER SECOND = (SLOTS) * (THREADS) * (1000 / TASK DURATION MSEC)
(12 SLOTS) * (20 THREADS) * (1000 / 300)...
2 1
State Flow
And Model
Execution
Trigger
REPLACE REST CALLS WITH KAFKA (I)
Online Feature
Store
Model/ Feature
Metadata
...
2 2
REPLACE REST CALL WITH KAFKA (II)
Online Feature
Store
Model/ Feature
Metadata
Feature
Assembly
Feature Sink Flow
(JDB...
THE
INEFFICIENT
OBJECT HANDLING
PROBLEM
2 4
HAS THIS EVER HAPPENED TO YOU?
New “JOLT” JSON Transformer
(but don’t cache it)
 New ObjectMapper()
(but don’t cache ...
2 5
HAS THIS EVER HAPPENED TO YOU?
New “JOLT” JSON Transformer
(but don’t cache it)
 New ObjectMapper()
(but don’t cache ...
2 6
HAS THIS EVER HAPPENED TO YOU?
New “JOLT” JSON Transformer
(but don’t cache it)
 New ObjectMapper()
(but don’t cache ...
2 7
HAS THIS EVER HAPPENED TO YOU?
New “JOLT” JSON Transformer
(but don’t cache it)
 New ObjectMapper()
(but don’t cache ...
2 8
SLIGHTLY MORE EFFICIENT SOLUTION
Cache our new ObjectMapper()
Parse JSON string into Map ONCE
Lightweight Map Traversa...
2 9
FLINK OBJECT EFFICIENCIES
REDUCE OBJECT
CREATION AND
GARBAGE
COLLECTION
BE CAREFUL OF
SIDE EFFECTS
WITHIN
OPERATORS
St...
THE
FEATURE STORE
PROBLEM
3 1
WHAT IS A FEATURE STORE?
“FEATURE” IS A DATA POINT INPUT TO ML MODEL
WE NEED TO:
• STORE ALL THE INPUT DATA POINTS
• S...
3 2
FIRST TRY: AWS AURORA RDBMS
“HEY IT SHOULD GET US UP TO 10,000 / SECOND”
…THE UPSERT PROBLEM
DIDN’T WORK MUCH PAST 3,0...
3 3
MITIGATION: STORE ONLY AT DIAGNOSIS TIME
STOPPED STORING ALL RAW DATA POINTS
ONLY STORE DATA POINTS ALONGSIDE
DIAGNOSI...
3 4
OPTIMIZING THE WRITE PATH
WHICH FLINK FLOW WRITES TO FEATURE STORE?
FEATURE STORE NOT NEEDED FOR TRIGGER
FEATURE STORE...
3 5
ALTERNATIVE DATA STORES
CASSANDRA
COST AND COMPLEXITY?
TIME SERIES DB
DRUID
INFLUX
REDIS
STORE ONLY LATEST DATA POINTS...
3 6
REDIS
WE ONLY NEED < 7 DAYS OF DATA, ONLY LATEST
VALUE, SO USE REDIS AS THE KV STORE
ACCESSIBLE BY OTHER CONSUMERS OR ...
3 7
FLINK QUERYABLE STATE
REUSE OUR FEATURE REQUEST/RESPONSE
FRAMEWORK
SCALABLE, FAST
OUR READ LOAD IS LOW (10 / SECOND)
E...
3 8
QUERYABLE STATE - ONLINE FEATURE STORE
Feature
Manager
FLINK
LOCAL
STATE
Upstream
Flow
Model
Execution
Online Feature
...
THE
VOLUME
PROBLEM
4 0
RAMPED UP PRODUCTION – JULY 2018
4 1
ADDED NEW DATA POINT TYPES
600+
Million
4 2
INCREASED DATA POINT FREQUENCY
ONCE EVERY 24 HOURS
4 3
INCREASED DATA POINT FREQUENCY
ONCE EVERY 24 HOURS
ONCE EVERY 4 HOURS
4 4
INCREASED DATA POINT FREQUENCY
ONCE EVERY 24 HOURS
ONCE EVERY 4 HOURS
ONCE EVERY 5 MINUTES ???
4 5
FLINK CLUSTER COULDN’T KEEP UP WITH KAFKA VOLUME
SYMPTOM: KAFKA LATENCY INCREASING
TIME
CONSUMERLAG
4 6
KAFKA OPTIMIZATION
INCREASE PARTITIONS
1 KAFKA PARTITION
READ BY
1 FLINK TASK THREAD
MATCH PARALLELISM TO
PARTITIONS (...
4 7
MODEL EXECUTION ORCHESTRATION
MODEL EXECUTION BACKUP PROBLEM
GENERAL SLOWNESS BACKS UP MODEL EXECUTION PIPELINE
Model ...
4 8
NETWORK
IN
CLUSTER NETWORK I/O TRAFFIC
KAFKABROKERS
PARTITION 1
PARTITION 2
PARTITION 3
PARTITION 4
PARTITION 5
PARTIT...
4 9
GOOD READ: HTTPS://WWW.VERVERICA.COM/BLOG/HOW-TO-SIZE-YOUR-APACHE-FLINK-CLUSTER-GENERAL-GUIDELINES
CLUSTER NETWORK I/O...
THE
CUSTOMER STATE
PROBLEM
5 1
KEEPING TRACK OF DATA POINT STATE
QUERYABLE STATE FEATURE STORE
UP TO 15KB (JSON) PER DATA POINT
10KB * 25 MILLION = 2...
THE
CHECKPOINT
PROBLEM
5 3
CHECKPOINT TUNING
FEATURE MANAGER – UP TO 1TB OF STATE
2.5 MINUTES WRITING A CHECKPOINT EVERY 10 MINUTES
TURNED OFF FO...
THE
HA
PROBLEM
5 5
AMAZON EMR AND HIGH AVAILABILITY
EMR FLINK CLUSTER
SINGLE AVAILABILITY ZONE (AZ)
SINGLE MASTER NODE
NOT HA
AWS REGION
...
5 6
AMAZON AWS EMR
SESSIONS KEPT DYING
TMP DIR IS PERIODICALLY CLEARED
BREAKS ‘BLOB STORE’ ON LONG RUNNING JOBS
SOLUTION -...
5 7
TECHNOLOGY OPTIONS
FLINK ON KUBERNETES
AMAZON EKS (ELASTIC KUBERNETES SERVICE)
SELF-MANAGED KUBERNETES ON EC2
DATA ART...
THE
REALLY HIGH
VOLUME AND
FEATURE STORE
PROBLEM #2
5 9
DATA POINT VOLUME MAY–NOVEMBER 2018
6 0
DATA POINT VOLUME MAY–NOVEMBER 2018
OCTOBER 2018 – 1 BILLION
6 1
DATA POINT VOLUME MAY–NOVEMBER 2018
OCTOBER 2018 – 1 BILLION
NOVEMBER 2018 – 2 BILLION
6 2
LET’S ADD ANOTHER 3 BILLION!
15 MILLION DEVICES
3 BILLION DATA POINTS
6 3
SOLUTION APPROACH
ISOLATE TRIGGER FOR HIGH-
VOLUME DATA POINT TYPES
ISOLATE ONLINE FEATURE
STORE FOR HIGH- VOLUME
DATA...
6 4
1: SEPARATE HIGH VOLUME TRIGGER FLOWS
State Flow
And Trigger
FLINK
LOCAL
STATE
HIGH VOLUME
State / Trigger 2
HIGH
VOLU...
6 5
2: FEDERATED ONLINE FEATURE STORE
Feature
REST Service
(Queryable
State Client)
REQUESTING
APPLICATION
Online Feature
...
6 6
3+4: HISTORICAL FEATURE STORE
Feature
REST Service
(Queryable
State Client)
REQUESTING
APPLICATION
Online Feature
Stor...
6 7
MODEL EXECUTION
FEDERATED
ONLINE
FEATURE
STORE
State Flow
And Trigger
CURRENT ARCHITECTURE
High Volume
Feature Produce...
6 8
WHERE ARE WE TODAY
INDICATORS HIGH-VOLUME
INDICATORS TOTAL
725+
MILLION
DATA
POINTS
PER DAY
6 BILLION
PER DAY
~7 BILLI...
6 9
WHERE ARE WE TODAY – FLINK CLUSTERS
CLUSTERS
14
INSTANCES
150
VCPU
1100
RAM
5.8 TB
7 0
FUTURE WORK
EXPAND
CUSTOMER EXPERIENCE
INDICATORS TO OTHER
COMCAST PRODUCTS
IMPROVED ML MODELS
BASED ON RESULTS AND
CO...
7 1
WE’RE
HIRING!
PHILADELPHIA
WASHINGTON, D.C.
MOUNTAIN VIEW
DENVER
THANK YOU!
Flink Forward San Francisco 2019: Adventures in Scaling from Zero to 5 Billion Data Points per Day - Dave Torok
Prochain SlideShare
Chargement dans…5
×

Flink Forward San Francisco 2019: Adventures in Scaling from Zero to 5 Billion Data Points per Day - Dave Torok

252 vues

Publié le

Adventures in Scaling from Zero to 5 Billion Data Points per Day
At Flink Forward San Francisco 2018 our team at Comcast presented the operationalized streaming ML framework which had just gone into production. This year in just a few short months we scaled a Customer Experience use case from an initial trickle of volume to processing over 5 Billion data points per day. This use case is used to help diagnose potential issues with High Speed Data service and provide recommendations to solving this issues as quickly and as cost-effectively as possible.

As with any solution that grows quickly, our platform faced challenges, bottlenecks, and technology limits; forcing us to quickly adapt and evolve our approach to enable handling 50,000+ data points per second.

We will introduce the problems, approaches, solutions, and lessons we learned along the way including: The Trigger and Diagnosis Problem, The REST problem, The “Feature Store” Problem, The “Customer State” Problem, The Savepoint Problem, The HA Problem, The Volume Problem, and of course The Really High Volume Feature Store Problem #2.

Publié dans : Technologie
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y3nhqquc } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici

Flink Forward San Francisco 2019: Adventures in Scaling from Zero to 5 Billion Data Points per Day - Dave Torok

  1. 1. ADVENTURES IN SCALING FROM ZERO TO 5 BILLION DATA POINTS PER DAY Dave Torok Distinguished Architect Comcast Corporation 2 April, 2019 Flink Forward – San Francisco 2019
  2. 2. 2 COMCAST CUSTOMER RELATIONSHIPS 30.3 MILLION OVERALL CUSTOMER RELATIONSHIPS AT 2018 YEAR END 25.1 MILLION RESIDENTIAL HIGH-SPEED INTERNET CUSTOMERS AT 2018 YEAR END 1.2 MILLION RESIDENTIAL HIGH-SPEED INTERNET CUSTOMER NET ADDITIONS IN 2018
  3. 3. 3 DELIVER THE ULTIMATE CUSTOMER EXPERIENCE IS THE CUSTOMER HAVING A GOOD EXPERIENCE FOR HIGH SPEED DATA (HSD) SERVICE? IF THE CUSTOMER ENGAGES US DIGITALLY, CAN WE OFFER A SELF-SERVICE SOLUTION? IF THERE IS AN ISSUE CAN WE OFFER OUR AGENTS AND TECHNICIANS A DIAGNOSIS TO HELP SOLVE THE PROBLEM QUICKER? REDUCE THE TIME TO RESOLVE ISSUES REDUCE COST TO THE BUSINESS AND THE CUSTOMER
  4. 4. 4 HIGH LEVEL CONCEPT Customer Experience Diagnosis Platform Service Delivery Systems Predictions / Diagnosis / Treatments CUSTOMER ENGAGEMENT PLATFORM CUSTOMER EXPERIENCE INDICATORS Comcast collects, stores, and uses all data in accordance with our privacy disclosures to users and applicable laws.
  5. 5. 5 CUSTOMER EXPERIENCE INDICATORS Comcast collects, stores, and uses all data in accordance with our privacy disclosures to users and applicable laws.
  6. 6. 6 NINE ADVENTURES IN SCALING THE TRIGGER AND DIAGNOSIS PROBLEM THE FEATURE STORE PROBLEM THE CHECKPOINT PROBLEM THE REST PROBLEM THE VOLUME PROBLEM THE TRIGGER AND DIAGNOSIS PROBLEM THE INEFFICIENT OBJECT HANDLING PROBLEM THE CUSTOMER STATE PROBLEM THE REALLY HIGH VOLUME AND FEATURE STORE PROBLEM #2
  7. 7. 7 ML FRAMEWORK ARCHITECTURE – 2018 ” E M B E D D I N G F L I N K T H R O U G H O U T AN O P E R AT I O N AL I Z E D S T R E AM I N G M L L I F E C Y C L E ” – S F F L I N K F O RWAR D 2 0 1 8 REQUESTING APPLICATION REQUESTING APPLICATION Feature Creation Pipelines ML FRAMEWORK Online Feature Store Model/ Feature Metadata Feature Check Async IO Feature Assembly Model Execution Prediction / Outcome Store Prediction Store Sink Prediction Store REST Service Feature Store REST Service Model Execution Request REST Service Model Prediction REST Service Model Execute Async IO FEATURE SOURCES / DATA AND SERVICES
  8. 8. THE TRIGGER AND DIAGNOSIS PROBLEM
  9. 9. 9 MAY 2018 – INITIAL VOLUMES INDICATOR #1 9 MILLION INDICATOR #2 166 MILLION INDICATOR #3 1.2 MILLION INDICATOR #4 2500 (SMALL TRIAL) TOTAL 175 MILLION / DAY Comcast collects, stores, and uses all data in accordance with our privacy disclosures to users and applicable laws.
  10. 10. 1 0 175 MILLION PREDICTIONS? 175 MILLION DATA POINTS ?????? PREDICTION / DIAGNOSIS OUTPUT STORE
  11. 11. 1 1 INTRODUCE A ‘TRIGGER’ CONDITION GREEN GREEN GREEN RED GREEN GREEN
  12. 12. 1 2 INTRODUCE A ‘TRIGGER’ CONDITION GREEN GREEN GREEN RED GREEN GREEN GREEN RED GREEN RED
  13. 13. 1 3 INTRODUCE A ‘TRIGGER’ CONDITION GREEN GREEN GREEN RED GREEN GREEN GREEN RED GREEN REDDETECT STATE CHANGE
  14. 14. 1 4 INTRODUCE A ‘TRIGGER’ CONDITION GREEN GREEN GREEN RED GREEN GREEN GREEN GREEN GREEN RED GREEN RED GREEN RED GREEN REDDETECT STATE CHANGE SEND FOR DIAGNOSIS
  15. 15. 1 5 State Flow And Trigger INTRODUCE FLINK LOCAL STATE STATE MANAGER AND TRIGGER 175 MILLION DATA POINTS Customer Experience Data Point Producers Customer Experience Diagnosis Platform “A FEW MILLION” TRIGGERS FLINK LOCAL STATE PREDICTION / DIAGNOSIS OUTPUT STORE
  16. 16. THE REST PROBLEM
  17. 17. 1 7 ML FRAMEWORK ARCHITECTURE – 2018 ” E M B E D D I N G F L I N K T H R O U G H O U T AN O P E R AT I O N AL I Z E D S T R E AM I N G M L L I F E C Y C L E ” – S F F L I N K F O RWAR D 2 0 1 8 REQUESTING APPLICATION REQUESTING APPLICATION Feature Creation Pipelines ML FRAMEWORK Online Feature Store Model/ Feature Metadata Feature Check Async IO Feature Assembly Model Execution Prediction / Outcome Store Prediction Store Sink Prediction Store REST Service Feature Store REST Service Model Execution Request REST Service Model Prediction REST Service Model Execute Async IO FEATURE SOURCES / DATA AND SERVICES
  18. 18. 1 8 ML FRAMEWORK State Flow And Model Execution Trigger INITIAL PLATFORM ARCHITECTURE Online Feature Store Model/ Feature Metadata Feature Getter Async IO Feature Assembly Feature Sink Model Execution Prediction / Outcome Store Prediction Store Sink Prediction Store REST Service Feature Store REST Service Customer Experience Data Point Producers Model Execution Request REST Service Model Prediction REST Service FLINK LOCAL STATE Model Execute Async IO REQUESTING APPLICATION USE CASE
  19. 19. 1 9 ORIGINAL ML FRAMEWORK ARCHITECTURE EVERY COMMUNICATION WAS A REST SERVICE ML FRAMEWORK ISOLATED FROM DATA POINT USE CASE BY DESIGN FLINK ASYNC + REST IS DIFFICULT TO SCALE ELASTICALLY HTTP REQUEST RESPONSE
  20. 20. 2 0 MAX THROUGHPUT PER SECOND = (SLOTS) * (THREADS) * (1000 / TASK DURATION MSEC) (12 SLOTS) * (20 THREADS) * (1000 / 300) = 800 / SECOND TUNING ASYNC I/O – LATENCY AND VOLUME APACHE HTTP CLIENT / EXECUTOR THREADS setMaxTotal setDefaultMaxPerRoute (THREADS) FLINK ASYNC MAX OUTSTANDING RichAsyncFunction “Capacity” AsyncDataStream.unorderedWait() with max capacity parameter RATE THROTTLING if (numberOfRequestsPerPeriod > 0) { int numberOfProcesses = getRuntimeContext() .getNumberOfParallelSubtasks(); numberOfRequestsPerPeriodPerProcess = Math.max(1, numberOfRequestsPerPeriod / numberOfProcesses); }
  21. 21. 2 1 State Flow And Model Execution Trigger REPLACE REST CALLS WITH KAFKA (I) Online Feature Store Model/ Feature Metadata Feature Getter Async IO Feature Assembly Prediction / Outcome Store Prediction Store REST Service Feature Store REST Service Customer Experience Data Point Producers FLINK LOCAL STATE REQUESTING APPLICATION Feature Sink (JDBCAppendTableSink) Model Execution Prediction Store Sink (JDBCAppendTableSink)
  22. 22. 2 2 REPLACE REST CALL WITH KAFKA (II) Online Feature Store Model/ Feature Metadata Feature Assembly Feature Sink Flow (JDBCAppendTableSink) Model Execution Prediction / Outcome Store Prediction Store Sink (JDBCAppendTableSink) Customer Experience Data Point Producers FLINK LOCAL STATE State Flow And Trigger FLINK LOCAL STATE FLINK LOCAL STATE Feature Store Manager FLINK LOCAL STATE
  23. 23. THE INEFFICIENT OBJECT HANDLING PROBLEM
  24. 24. 2 4 HAS THIS EVER HAPPENED TO YOU? New “JOLT” JSON Transformer (but don’t cache it)  New ObjectMapper() (but don’t cache it)  Parse JSON string into Map  Load and parse “Transform” JSON from the resource path (but don’t cache it)  Serialize transformed Map back into string result String transformedJson = new JoltJsonTransformer() .transformJson(myJsonValue, pTool.getRequired(SPEC_PATH)); JSONObject inputPayload = new JSONObject(transformedJson); try { aType = inputPayload.getJSONObject("info") .getJSONObject(“aType"). getString("string"); } catch (Exception e) { // ignore, invalid object aType = STATUS_INVALID; }
  25. 25. 2 5 HAS THIS EVER HAPPENED TO YOU? New “JOLT” JSON Transformer (but don’t cache it)  New ObjectMapper() (but don’t cache it)  Parse JSON string into Map  Load and parse “Transform” JSON from the resource path (but don’t cache it)  Serialize transformed Map back into string result Then let’s make a NEW JSONObject from that string String transformedJson = new JoltJsonTransformer() .transformJson(myJsonValue, pTool.getRequired(SPEC_PATH)); JSONObject inputPayload = new JSONObject(transformedJson); try { aType = inputPayload.getJSONObject("info") .getJSONObject(“aType"). getString("string"); } catch (Exception e) { // ignore, invalid object aType = STATUS_INVALID; }
  26. 26. 2 6 HAS THIS EVER HAPPENED TO YOU? New “JOLT” JSON Transformer (but don’t cache it)  New ObjectMapper() (but don’t cache it)  Parse JSON string into Map  Load and parse “Transform” JSON from the resource path (but don’t cache it)  Serialize transformed Map back into string result Then let’s make a NEW JSONObject from that string Go deep into the path to get value String transformedJson = new JoltJsonTransformer() .transformJson(myJsonValue, pTool.getRequired(SPEC_PATH)); JSONObject inputPayload = new JSONObject(transformedJson); try { aType = inputPayload.getJSONObject("info") .getJSONObject(“aType"). getString("string"); } catch (Exception e) { // ignore, invalid object aType = STATUS_INVALID; }
  27. 27. 2 7 HAS THIS EVER HAPPENED TO YOU? New “JOLT” JSON Transformer (but don’t cache it)  New ObjectMapper() (but don’t cache it)  Parse JSON string into Map  Load and parse “Transform” JSON from the resource path (but don’t cache it)  Serialize transformed Map back into string result Then let’s make a NEW JSONObject from that string Go deep into the path to get value Cast exception? STATUS_INVALID String transformedJson = new JoltJsonTransformer() .transformJson(myJsonValue, pTool.getRequired(SPEC_PATH)); JSONObject inputPayload = new JSONObject(transformedJson); try { aType = inputPayload.getJSONObject("info") .getJSONObject(“aType"). getString("string"); } catch (Exception e) { // ignore, invalid object aType = STATUS_INVALID; }
  28. 28. 2 8 SLIGHTLY MORE EFFICIENT SOLUTION Cache our new ObjectMapper() Parse JSON string into Map ONCE Lightweight Map Traversal utility Java Stream syntax Optional.ofNullable Optional.orElse if (mapper == null) mapper = new ObjectMapper(); Map<?, ?> resultMap = JsonUtils.readValue(mapper, (String) result, Map.class); MapTreeWalker mapWalker = new MapTreeWalker(resultMap); final String aType = mapWalker.step("info") .step(“aSchemaInfo") .step(“aType") .step("string") .<String>get(). orElse(STATUS_INVALID);
  29. 29. 2 9 FLINK OBJECT EFFICIENCIES REDUCE OBJECT CREATION AND GARBAGE COLLECTION BE CAREFUL OF SIDE EFFECTS WITHIN OPERATORS StreamExecutionEnvironment env; env.getConfig().enableObjectReuse(); SEMANTIC ANNOTATIONS @NonForwardedFields @ForwardedFields
  30. 30. THE FEATURE STORE PROBLEM
  31. 31. 3 1 WHAT IS A FEATURE STORE? “FEATURE” IS A DATA POINT INPUT TO ML MODEL WE NEED TO: • STORE ALL THE INPUT DATA POINTS • SNAPSHOT AT ANY MOMENT IN TIME • ASSOCIATE WITH A DIAGNOSIS (MODEL OUTPUT) • HAVE ACCESS TO THE DATA POINTS (API FOR OTHER APPS) • STORE ALL DATA POINTS FOR ML MODEL TRAINING Feature Creation Pipeline Historical Feature Store Online Feature Store Prediction Phase Model Training Phase AppendOverwrite
  32. 32. 3 2 FIRST TRY: AWS AURORA RDBMS “HEY IT SHOULD GET US UP TO 10,000 / SECOND” …THE UPSERT PROBLEM DIDN’T WORK MUCH PAST 3,000 / SECOND …SOON OUR INPUT RATE WAS 20,000 / SECOND
  33. 33. 3 3 MITIGATION: STORE ONLY AT DIAGNOSIS TIME STOPPED STORING ALL RAW DATA POINTS ONLY STORE DATA POINTS ALONGSIDE DIAGNOSIS CON: ONLY A SMALL % OF DATA POINTS CON: DATA NOT CURRENT, ONLY AS OF RECENT DIAGNOSIS CON: LIMITED USEFULNESS Feature Sink Model Execution Feature Store Prediction / Outcome Store Prediction Store Sink
  34. 34. 3 4 OPTIMIZING THE WRITE PATH WHICH FLINK FLOW WRITES TO FEATURE STORE? FEATURE STORE NOT NEEDED FOR TRIGGER FEATURE STORE NOT NEEDED FOR MODEL EXECUTION SEPARATE ‘TRANSFORM AND SINK’ FLOW Online Feature Store Feature Manager Feature Store REST Service Online Feature Store Feature Sink Flow Feature Manager FLINK LOCAL STATE Upstream Flow Model Execution BEFORE AFTER
  35. 35. 3 5 ALTERNATIVE DATA STORES CASSANDRA COST AND COMPLEXITY? TIME SERIES DB DRUID INFLUX REDIS STORE ONLY LATEST DATA POINTS? FLINK QUERYABLE STATE PRODUCTION HARDENING? READ VOLUME?
  36. 36. 3 6 REDIS WE ONLY NEED < 7 DAYS OF DATA, ONLY LATEST VALUE, SO USE REDIS AS THE KV STORE ACCESSIBLE BY OTHER CONSUMERS OR VIA A SERVICE COULDN'T PUT MORE THAN 8,000 / SECOND FOR A 'REASONABLE' CLUSTER SIZE FLINK SINK CONNECTOR IS NOT CLUSTER-ENABLED ONE OBJECT PUT PER CALL JEDIS VS LETTUCE FRAMEWORK AND PIPELINING SOME OPTIMIZATION IF ABLE TO BATCH REDIS SHARDS BY PRE-COMPUTING
  37. 37. 3 7 FLINK QUERYABLE STATE REUSE OUR FEATURE REQUEST/RESPONSE FRAMEWORK SCALABLE, FAST OUR READ LOAD IS LOW (10 / SECOND) EXPENSIVE (RELATIVELY) NOT A “PRODUCTION” CAPABILITY NOT A “DATABASE” TECHNOLOGY
  38. 38. 3 8 QUERYABLE STATE - ONLINE FEATURE STORE Feature Manager FLINK LOCAL STATE Upstream Flow Model Execution Online Feature Store FLINK LOCAL STATE Feature REST Service (Queryable State Client) REQUESTING APPLICATION Upstream Flow Customer Experience Data Point Producers ONLINE FEATURE STORE PATH MODEL EXECUTION PATH
  39. 39. THE VOLUME PROBLEM
  40. 40. 4 0 RAMPED UP PRODUCTION – JULY 2018
  41. 41. 4 1 ADDED NEW DATA POINT TYPES 600+ Million
  42. 42. 4 2 INCREASED DATA POINT FREQUENCY ONCE EVERY 24 HOURS
  43. 43. 4 3 INCREASED DATA POINT FREQUENCY ONCE EVERY 24 HOURS ONCE EVERY 4 HOURS
  44. 44. 4 4 INCREASED DATA POINT FREQUENCY ONCE EVERY 24 HOURS ONCE EVERY 4 HOURS ONCE EVERY 5 MINUTES ???
  45. 45. 4 5 FLINK CLUSTER COULDN’T KEEP UP WITH KAFKA VOLUME SYMPTOM: KAFKA LATENCY INCREASING TIME CONSUMERLAG
  46. 46. 4 6 KAFKA OPTIMIZATION INCREASE PARTITIONS 1 KAFKA PARTITION READ BY 1 FLINK TASK THREAD MATCH PARALLELISM TO PARTITIONS (OR EVEN MULTIPLE) 100 KAFKA PARTITIONS 50 SLOT PARALLELISM 50 KAFKA PARTITIONS 50 SLOT PARALLELISM ADJUST KAFKA CONFIGURATION # KAFKA CONSUMER SETTINGS MAX.POLL.RECORDS = 500 10000 FETCH.MIN.BYTES = 1 102400 FETCH.MAX.WAIT.MS = 500 1000 # KAFKA PRODUCER SETTINGS BUFFER.MEMORY = 268435456 BATCH.SIZE = 131072 524288 LINGER.MS = 0 50 REQUEST.TIMEOUT.MS = 90000 600000
  47. 47. 4 7 MODEL EXECUTION ORCHESTRATION MODEL EXECUTION BACKUP PROBLEM GENERAL SLOWNESS BACKS UP MODEL EXECUTION PIPELINE Model Execution AsyncDataStream 60 second timeout Model Execution REST Service Feature Assembly 5 Minute Global Window Feature Getter
  48. 48. 4 8 NETWORK IN CLUSTER NETWORK I/O TRAFFIC KAFKABROKERS PARTITION 1 PARTITION 2 PARTITION 3 PARTITION 4 PARTITION 5 PARTITION 6 FlinkKafkaConsumer FlinkKafkaConsumer FlinkKafkaConsumer FlinkKafkaConsumer FlinkKafkaConsumer FlinkKafkaConsumer NODE 1 NODE 2 NODE 3 KeyedStreamOperator KeyedStreamOperator KeyedStreamOperator KeyedStreamOperator KeyedStreamOperator KeyedStreamOperator P1 P2 P3 P4 NETWORK OUT keyBy() SHUFFLE
  49. 49. 4 9 GOOD READ: HTTPS://WWW.VERVERICA.COM/BLOG/HOW-TO-SIZE-YOUR-APACHE-FLINK-CLUSTER-GENERAL-GUIDELINES CLUSTER NETWORK I/O VOLUME (10,000 MESSAGES / SECOND) * (1KB / MESSAGE) = 10 MB / SECOND 100 MBITS/SEC NETWORK SOLUTION – LARGER NODE TYPES WITH MORE VCPU AND HIGHER PARALLELISM… FEWER NODES IN CLUSTER SWEET SPOT SEEMED TO BE ~20 NODES
  50. 50. THE CUSTOMER STATE PROBLEM
  51. 51. 5 1 KEEPING TRACK OF DATA POINT STATE QUERYABLE STATE FEATURE STORE UP TO 15KB (JSON) PER DATA POINT 10KB * 25 MILLION = 250 GB 1KB * 25 MILLION = 25 GB APPROXIMATE TOTAL = 1 TB RED / GREEN “TRIGGER” MAPSTATE () / KEYBY (ENTITY ID) [ENTITY ID] (DATA POINT TYPE, STATE) 25 MILLION CUSTOMERS 10+ DATA POINT TYPES AVG 33 BYTES PER ENTITY ID ~1GB TOTAL 25+ MILLION HIGH SPEED INTERNET CUSTOMERS 10+ DATA POINT TYPES TWO FLINK STATES TO MANAGE:
  52. 52. THE CHECKPOINT PROBLEM
  53. 53. 5 3 CHECKPOINT TUNING FEATURE MANAGER – UP TO 1TB OF STATE 2.5 MINUTES WRITING A CHECKPOINT EVERY 10 MINUTES TURNED OFF FOR AWHILE! (RATIONALE: WOULD ONLY TAKE A FEW HOURS TO RECOVER ALL STATE) INCREMENTAL CHECKPOINTING HELPED LONG ALIGNMENT TIMES ON HIGH PARALLELISM REDUCED THE SIZE OF STATE FOR 4 AND 24 HOUR WINDOWS USING FEATURE LOOKUP TABLE
  54. 54. THE HA PROBLEM
  55. 55. 5 5 AMAZON EMR AND HIGH AVAILABILITY EMR FLINK CLUSTER SINGLE AVAILABILITY ZONE (AZ) SINGLE MASTER NODE NOT HA AWS REGION AZ1 AZ2 EMR CLUSTER 1 M T T T T EMR CLUSTER 2 M T T T T S3 Checkpoints
  56. 56. 5 6 AMAZON AWS EMR SESSIONS KEPT DYING TMP DIR IS PERIODICALLY CLEARED BREAKS ‘BLOB STORE’ ON LONG RUNNING JOBS SOLUTION - CONFIGURE TO NOT BE IN /TMP jobmanager.web.tmpdir blob.storage.directory EMR FLINK VERSION LAGS APACHE RELEASE BY 1-2 MONTHS
  57. 57. 5 7 TECHNOLOGY OPTIONS FLINK ON KUBERNETES AMAZON EKS (ELASTIC KUBERNETES SERVICE) SELF-MANAGED KUBERNETES ON EC2 DATA ARTISANS VERVERICA PLATFORM Bonus: Reduced Cost / Overhead by having fewer Master Nodes Less overhead for smaller (low parallelism) flows
  58. 58. THE REALLY HIGH VOLUME AND FEATURE STORE PROBLEM #2
  59. 59. 5 9 DATA POINT VOLUME MAY–NOVEMBER 2018
  60. 60. 6 0 DATA POINT VOLUME MAY–NOVEMBER 2018 OCTOBER 2018 – 1 BILLION
  61. 61. 6 1 DATA POINT VOLUME MAY–NOVEMBER 2018 OCTOBER 2018 – 1 BILLION NOVEMBER 2018 – 2 BILLION
  62. 62. 6 2 LET’S ADD ANOTHER 3 BILLION! 15 MILLION DEVICES 3 BILLION DATA POINTS
  63. 63. 6 3 SOLUTION APPROACH ISOLATE TRIGGER FOR HIGH- VOLUME DATA POINT TYPES ISOLATE ONLINE FEATURE STORE FOR HIGH- VOLUME DATA POINT TYPES ISOLATE DEDICATED FEATURE STORAGE FLOW TO HIGH-SPEED FEATURE STORE SEPARATE HISTORICAL FEATURE STORAGE (S3 WRITER) FROM REAL TIME MODEL EXECUTION FLOWS 2 3 4 1
  64. 64. 6 4 1: SEPARATE HIGH VOLUME TRIGGER FLOWS State Flow And Trigger FLINK LOCAL STATE HIGH VOLUME State / Trigger 2 HIGH VOLUME Data Point Producer 2 FLINK LOCAL STATE HIGH VOLUME State / Trigger 1 HIGH VOLUME Data Point Producer 1 FLINK LOCAL STATE ONLINE FEATURE STORE MODEL EXECUTION 3B IN 50M OUT
  65. 65. 6 5 2: FEDERATED ONLINE FEATURE STORE Feature REST Service (Queryable State Client) REQUESTING APPLICATION Online Feature Store FLINK LOCAL STATE Online Feature Store 2 HIGH VOLUME Data Point Producer 2 FLINK LOCAL STATE Online Feature Store 1 HIGH VOLUME Data Point Producer 1 FLINK LOCAL STATE FEDERATEDFEATURESTORE Model/ Feature Metadata
  66. 66. 6 6 3+4: HISTORICAL FEATURE STORE Feature REST Service (Queryable State Client) REQUESTING APPLICATION Online Feature Store HIGH VOLUME Data Point Producer 1 FLINK LOCAL STATE Historical Feature Store (S3) HIGH VOLUME State / Trigger FLINK LOCAL STATE Feature Producer Feature Sink MODEL EXECUTION
  67. 67. 6 7 MODEL EXECUTION FEDERATED ONLINE FEATURE STORE State Flow And Trigger CURRENT ARCHITECTURE High Volume Feature Producer Feature Sink State Trigger Prediction / Outcome Store Prediction Store REST Service Feature Store REST Service Customer Experience Data Point Producers FLINK LOCAL STATE HIGH VOLUME Data Point Producer High Volume Feature Producer Feature Sink State Trigger HIGH VOLUME Data Point Producer Historical Feature Store Feature Sink FEATURE MANAGER Feature Manager Feature Assembly Model Execution Prediction Sink Feature Manager Feature Assembly
  68. 68. 6 8 WHERE ARE WE TODAY INDICATORS HIGH-VOLUME INDICATORS TOTAL 725+ MILLION DATA POINTS PER DAY 6 BILLION PER DAY ~7 BILLION DATA POINTS PER DAY
  69. 69. 6 9 WHERE ARE WE TODAY – FLINK CLUSTERS CLUSTERS 14 INSTANCES 150 VCPU 1100 RAM 5.8 TB
  70. 70. 7 0 FUTURE WORK EXPAND CUSTOMER EXPERIENCE INDICATORS TO OTHER COMCAST PRODUCTS IMPROVED ML MODELS BASED ON RESULTS AND CONTINUOUS RETRAINING MODULAR MODEL EXECUTION VIA KUBEFLOW AND GRPC CALLS
  71. 71. 7 1 WE’RE HIRING! PHILADELPHIA WASHINGTON, D.C. MOUNTAIN VIEW DENVER
  72. 72. THANK YOU!

×