Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Massive Scale
Data Processing
Pallavi Phadnis, Snehal Nagmote
Flink Forward SF 2019
● Consolidated Logging (CL) Overview
● High Level Architecture of CL platform
● Log Processing at Scale
● Event Extractor ...
Consolidated Logging (CL)
Build an integrated solution to provide insights into user
behavior and application performance metrics through
client-sid...
Use Cases Powered By CL
● Personalization
● Recommendations
● A/B Experimentation
● Application Performance
Consolidated Logging
X
Event Types
300+
Log
ProfileIdentify
Presented
NavigationLevel
Focus
...
Play
Device Platforms /
Ap...
CL Platform
NETFLIX
APPS
OTHER
INTERNAL/EXTERNAL
APPS
Consolidated
Logging
DATA COLLECTION
DATA TRANSFORMATION
EVENT EXTRA...
Legacy Pipeline
Flink Based Platform
Landing
Service
Kafka
Event
Extractor
CL App
Kafka Streams
Elasticsearch
Hive tables
...
CL App
● Generic log processing application - supports different logging specifications
● Real-time processing
○ Data transformat...
CL App Design
● Stateless Flink Application (Flink 1.4, Kafka 1.1)
○ At-least once processing
● Isolation of concerns thro...
Common Log Processing Framework
Log
Consumer
Config
Reader
(FP)
Data
Enrichment
Data
Transformations
Spec
Parser
Data
Sink...
● Embarrassingly parallel job (parallelism over 2000)
○ Uniform CPU utilization with high number of partitions on source k...
Data compression - a factor to consider
● Data compression ratio was worse for parquet and kafka (~ 4x)
○ Upstream kafka producer batching difference increased da...
Event Extractor
Event Extractor Use Case
Personalization
Pipeline
CL Consumers
User clicks
User Searches
App perf metrics
Impressions
CL S...
Keystone Routes For CL
● Growth/Scale
○ 3.5 million events/sec
○ Reading same data multiple times
■ Compute redundancy
■ Scale Kafka infrastructu...
Keystone Routes For CL Event Extractor
● Stateless Single Flink Application
● Read data once, apply processing and route it to multiple streams
● Configuration d...
● User configuration in Yaml
● Confings are managed in version control and updated in s3
● Example config
Event Extractor ...
Event Extractor Design
Config
Reader
SQL Parser
Config
Parser
Transformation
Projection
User Config Management Pipeline
Fi...
● Scaling single Flink Application
● Lack of Isolation
○ Isolated by type of sink application writes to
○ Deployment per s...
● Buildup of Network Pressure caused S3 checkpoint failures due to socket timeouts
○ Job goes into restart loop due to hig...
● Flink Kafka Consumer needs continuous stream to progress high watermark (FLINK-5479)
○ StickyPartitioner Producer skips ...
● Keystone (Self-Serve UI) for deployment of streaming apps
○ Out of box ELK stack support for application logs
○ Automate...
Monitoring and Alerting
Monitoring and Alerting
CL Platform Benefits
Improved
Data
Processing
Can Handle
Large Payloads
compared to
Legacy pipeline
Improved error
handlin...
Thank you.
Prochain SlideShare
Chargement dans…5
×

Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix using Flink - Snehal Nagmote & Pallavi Phadnis

150 vues

Publié le

Over 137 million members worldwide are enjoying TV series, feature films across a wide variety of genres and languages on Netflix. It leads to petabyte scale of user behavior data. At Netflix, our client logging platform collects and processes this data to empower recommendations, personalization and many other services to enhance user experience. Built with Apache Flink, this platform processes 100s of billion events and a petabyte data per day, 2.5 million events/sec in sub milliseconds latency. The processing involves a series of data transformations such as decryption and data enrichment of customer, geo, device information using microservices based lookups.

The transformed and enriched data is further used by multiple data consumers for a variety of applications such as improving user-experience with A/B tests, tracking application performance metrics, tuning algorithms. This causes redundant reads of the dataset by multiple batch jobs and incurs heavy processing costs. To avoid this, we have developed a config driven, centralized, managed platform, on top of Apache Flink, that reads this data once and routes it to multiple streams based on dynamic configuration. This has resulted in improved computation efficiency, reduced costs and reduced operational overhead.

Stream processing at scale while ensuring that the production systems are scalable and cost-efficient brings interesting challenges. In this talk, we will share about how we leverage Apache Flink to achieve this, the challenges we faced and our learnings while running one of the largest Flink application at Netflix.

Publié dans : Technologie
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix using Flink - Snehal Nagmote & Pallavi Phadnis

  1. 1. Massive Scale Data Processing Pallavi Phadnis, Snehal Nagmote Flink Forward SF 2019
  2. 2. ● Consolidated Logging (CL) Overview ● High Level Architecture of CL platform ● Log Processing at Scale ● Event Extractor Use Case ● Monitoring and Alerting ● Impact of Flink based Platform Agenda
  3. 3. Consolidated Logging (CL)
  4. 4. Build an integrated solution to provide insights into user behavior and application performance metrics through client-side logging. Consolidated Logging
  5. 5. Use Cases Powered By CL ● Personalization ● Recommendations ● A/B Experimentation ● Application Performance
  6. 6. Consolidated Logging X Event Types 300+ Log ProfileIdentify Presented NavigationLevel Focus ... Play Device Platforms / App Versions 10+ TVUI Android iOS Web ... Log Events 100s of billion events / day 1+ petabyte of user behavior data per day =
  7. 7. CL Platform NETFLIX APPS OTHER INTERNAL/EXTERNAL APPS Consolidated Logging DATA COLLECTION DATA TRANSFORMATION EVENT EXTRACTION DATA ENRICHMENT - USER INFO - GEO/DEVICE INFO SITE USAGE APP NAVIGATIONS SHOWS WATCHED USER SESSIONIZATION CL Schema App Schema iOS ANDROID TVUI WWW PLATFORM FEATURES REALTIME & BATCH DATA SINKS
  8. 8. Legacy Pipeline Flink Based Platform Landing Service Kafka Event Extractor CL App Kafka Streams Elasticsearch Hive tables Landing Service SQS Log Processing Server Kafka CL Streaming App CL ETL CL DW (Hive) Kafka CL Router App 13 Keystone routes S3
  9. 9. CL App
  10. 10. ● Generic log processing application - supports different logging specifications ● Real-time processing ○ Data transformations ○ Data enrichment - Membership information, Geo, Device type ■ Joins ● Single source of truth with unified output schema ● Supports different data sinks: Kafka/Hive ● SLA ○ RPS: 3.5 million events per sec at peak, Latency: < 3ms CL App Features
  11. 11. CL App Design ● Stateless Flink Application (Flink 1.4, Kafka 1.1) ○ At-least once processing ● Isolation of concerns through separate Flink jobs for different use cases/sink types ● Different job DAGs with common framework library: Fan In/ Fan Out
  12. 12. Common Log Processing Framework Log Consumer Config Reader (FP) Data Enrichment Data Transformations Spec Parser Data Sink Raw events Processed events Kafka Kafka Hive / Iceberg CL Schema / App Schema Request Type & Version Source Segregated sources Multiple sinks Raw events Hive Data Partitioning Events Backup
  13. 13. ● Embarrassingly parallel job (parallelism over 2000) ○ Uniform CPU utilization with high number of partitions on source kafka topic ● High memory pressure and GC pause on JM - Recovery failure/restart loop ○ Memory leak in archiving execution history (FLINK-10066) ○ Scaling bottleneck of kafka source’s union state (FLINK-10122) ● Overwhelmed coordinator due to thundering herd problem with high parallelism (KIP-266) Learnings & Best Practices
  14. 14. Data compression - a factor to consider
  15. 15. ● Data compression ratio was worse for parquet and kafka (~ 4x) ○ Upstream kafka producer batching difference increased data entropy ● Backlog in kafka can lead to sudden load on external micro-services ● Kafka backpressure leads to task failures ○ Duplicate events ● Guice dependency injection conflicts with Flink ○ classloader.resolve-order=parent-first Learnings & Best Practices
  16. 16. Event Extractor
  17. 17. Event Extractor Use Case Personalization Pipeline CL Consumers User clicks User Searches App perf metrics Impressions CL Stream (Transformed and enriched) Personalization stream Search stream Impressions stream Experimentation stream Search Pipeline Impressions Pipeline A/B Experimentation Pipeline Consumer Insights Pipeline Exploratory Analysis Customer Service Tool
  18. 18. Keystone Routes For CL
  19. 19. ● Growth/Scale ○ 3.5 million events/sec ○ Reading same data multiple times ■ Compute redundancy ■ Scale Kafka infrastructure for outgoing bytes ■ Operational Overhead ● High Compute and Operational cost Problems with CL Legacy Pipeline
  20. 20. Keystone Routes For CL Event Extractor
  21. 21. ● Stateless Single Flink Application ● Read data once, apply processing and route it to multiple streams ● Configuration driven Processing, without code change ● SQL Support on Stream ● Filter, Transformation and Projection support on stream ● Out of box metrics for users What is Event Extractor ?
  22. 22. ● User configuration in Yaml ● Confings are managed in version control and updated in s3 ● Example config Event Extractor User Interface filterExpression: field1= 'Presented' and field2 like '%impressionToken%' and field3 not like '%storyArt%' projectionExpression: field_name1, field_name2, field_name3, field_name5 transformations: { OutputFieldName:inner_field, fieldName:top_level_field, nestedFieldName:inner_field, type: type} sinkDetails: {sinkType: kafka, name: topic_name} ownerName: email-address routeName: unique_name
  23. 23. Event Extractor Design Config Reader SQL Parser Config Parser Transformation Projection User Config Management Pipeline Filter Function Schema Builder Elastic Search Sink Kafka Sink Hive Sink User Configs via S3 CL Enriched Stream Hive Multiple Kafka Sinks Event Extractor
  24. 24. ● Scaling single Flink Application ● Lack of Isolation ○ Isolated by type of sink application writes to ○ Deployment per sink type (Kafka,Hive,Elasticsearch) ● Back pressure is shared between multiple consumers ○ Consumer Kafka topics are created in the same cluster ○ Canaries and testing before on boarding new config Challenges with Event Extractor
  25. 25. ● Buildup of Network Pressure caused S3 checkpoint failures due to socket timeouts ○ Job goes into restart loop due to high frequency of checkpoint failures ○ Better g1gc and increase s3 timeouts ● Tuning parallelism to avoid unbalanced CPU Utilization ○ Extensive CPU Flame Graphs and system metrics to identify bottlenecks ○ Setting parallelism in multiples of Kafka partitions and task slots to achieve better cpu utilization Learnings and Best Practices
  26. 26. ● Flink Kafka Consumer needs continuous stream to progress high watermark (FLINK-5479) ○ StickyPartitioner Producer skips producing data to out of sync partitions ○ Setting stickyPartitioner.minQualifiedIsrRatio=1.0 helps to produce data to out of sync partitions ● Outlier Container/Broker (due to bad hardware) ○ Consumer gets non-linear traffic pattern (stuck consumer alert) ○ Producer throws BatchExpiredTimeout Exception and increase in checkpoint failures Learnings and Best Practices
  27. 27. ● Keystone (Self-Serve UI) for deployment of streaming apps ○ Out of box ELK stack support for application logs ○ Automated Alerts integration with Atlas ● Deployment Strategy ○ Minimize Duplicates, Checkpoints are stored in S3 ● Restart Strategy ○ Fine-grained Recovery Deployment
  28. 28. Monitoring and Alerting
  29. 29. Monitoring and Alerting
  30. 30. CL Platform Benefits Improved Data Processing Can Handle Large Payloads compared to Legacy pipeline Improved error handling Reduced Data Loss Reduced points of failures Ability to backfill or reprocess historic raw events Legacy Tables Decommission and Reduced Storage Redundancy Read once and route to different sinks through event extractor Single source of truth (SSOT) for CL Data in Data warehouse Schema consistency across CL components and Tools Single Source of Truth Reduced Cost & Operational Overhead
  31. 31. Thank you.

×