SlideShare une entreprise Scribd logo
1  sur  36
Télécharger pour lire hors ligne
Key Consideration in
productionizing
Streaming Application
Vikram Agrawal, Prateek Srivastava
00Copyright 2018 © Qubole
● Stream Processing Paradigm
● Deep-dive into Structured Streaming
● Productionizing Streaming Application
● Streaming Lens
Agenda
00Copyright 2018 © Qubole
Data Processing Architecture
○ Data is pushed into Flat files, HDFS or databases
○ ETL Batch jobs to process raw data for various end goals
00Copyright 2018 © Qubole
Stream Processing
○ Message buses such as Kafka/Kinesis/RabbitMQ are part of the architecture
○ Business need to process your data in real-time instead of nightly batch job
00Copyright 2018 © Qubole
Stream Processing Use Cases
● Real time transformation like aggregations, deduplication
● Data enrichment using joins with other table/stream
● Ingest into a data-lake (such as s3) for further processing or archival
● Ingest into a data-warehouse (Redshift, ES) for ad-hocs analysis
● Real time dashboard/reporting (Druid etc)
● CEP rule processing or Model Scoring (Fraud Detection etc)
00Copyright 2018 © Qubole
How to decide the Streaming Engine
● SLAs and use-cases
○ Latency
■ Ingestion/reporting use-cases can tolerate few secs latency
■ Model scoring has tighter requirement (in ms)
○ Throughput
■ Current and future Incoming data rate
○ Complexity of Analytics
■ Real time transformation requirements - join, format conversion vs filter,
selection
● Community Support
○ Technical skills required to adopt new technology
● Production readiness
○ Time required to build streaming Application
○ Fault Tolerance - Exactly/Atleast Once Delivery Guarantees
●
00Copyright 2018 © Qubole
Why Spark Structured Streaming
● Latency
○ Micro-batch Execution for “sub-secs to few secs” is GA
○ Continuous Execution for “ms” latency is in Beta.
● Functionality
○ Built on top of Spark dataFrame APIs and takes advantage of SQL core engine
code & memory optimizations
○ Stream-stream join, stream-batch joins, late data handling, sliding window
aggregation, data format conversion, de-duplication etc
○ Connectors to and from various sources and sinks
○ Exactly/Atleast once semantics
● Throughput
○ Scalable and Mature Processing engine
○ Can easily handle 10s of million records per second
● API abstractions
○ Developer friendly - interoperability between batch and streaming code
Structured Streaming
00Copyright 2018 © Qubole
Spark’s Functionality
00Copyright 2018 © Qubole
Structured Streaming - under the hood
Abstractions of Repeated Queries
• Data Streams as unbounded
Table
• Streaming query is a batch-like
operation on this table
• After user specified trigger
interval, repeat the query on the
new records in the data stream.
00Copyright 2018 © Qubole
Micro Batch Model
Input Data Source
Provider (say
Kafka)
determines range
of records for the
batch.
Spark creates an
optimized plan for
the execution
Plan is converted
into task and
executed by
workers. Actual
data read from
Source and write
into final
destination
happens in the
execution phase
00Copyright 2018 © Qubole
Stateless Streaming - Ingest in S3
Batch 1
Batch 2
Batch 3
[1-4]
[5-8]
[9-10]
File 1
File 2
File 3
Micro batch consist of new records in each batch
00Copyright 2018 © Qubole
Micro batch consists of New Input Records & Previous micro-batches’ sum saved in
a state store
Stateful Streaming - Running Sum
Batch 1
Batch 2
Batch 3
State
= 10
[1-4]
[5-8]
[9-10]
State
= 36
State
= 55
00Copyright 2018 © Qubole
Productionizing Streaming Applications
Productionizing Streaming Application
Ease of
composition and
experimentation
Data Accuracy
and Consistency
Higher
Performance
Replay/Reprocess
Data
Lower TCO
Optimized for faster
downstream
processing
PortabilityMonitoring,
Insights & Alerts
00Copyright 2018 © Qubole
● What should be the right cluster configuration for my streaming job?
● Data Ingestion rate is variable. How can I autoscale my cluster?
● How can I know if my streaming application is healthy?
● How should I partition my input data source?
● Time lag between the last processed event and tip of the input stream is
increasing. What can I do?
Problem Statement
00Copyright 2018 © Qubole
● Performance tuning tool for Apache Spark
● Introduced a concept of critical path of a spark job to understand its
scalability limit
● Open-sourced by Qubole
● https://github.com/qubole/sparklens
Spark Lens
00Copyright 2018 © Qubole
00Copyright 2018 © Qubole
Spark Lens in Structured Streaming =
Streaming Lens
● Batch Running Time: Actual Time taken to process a micro batch
● Trigger Interval: Specified by the user while writing streaming query. Can
be proxied as SLA.
● Critical Path Time: Time to complete the microbatch if we had
provisioned infinite executors.
00Copyright 2018 © Qubole
Approach
● Sampling and Analyzing some Microbatches at regular intervals can
give a fair idea of the health of the streaming pipeline.
● Trigger Interval is a measure of the SLA which the pipeline is expected
to meet. Batch running time should be safely lower than trigger interval.
● If Critical Time is safely lower than Trigger Interval, throwing more
resources at the application can help in meeting the SLA specified by
trigger interval.
00Copyright 2018 © Qubole
Trigger Interval vs batch processing time vs Critical Path
SLA
Under
Utilized
Over-utilized.Ups
cale to achieve
SLA
Autoscale cannot
help. Repartition
Desired zone
00Copyright 2018 © Qubole
Condition I Condition II Pipeline State
Batch Running Time <
0.4 * Trigger Interval
OVERPROVISIONED or
UNDER-UTILIZED
0.4 * Trigger Interval
Time < Batch Running
Time < 0.8 * Trigger
Interval
DESIRED
Batch Running Time >
0.8 * Trigger Interval
Critical Time < 0.7 *
Trigger Interval
UNDER-PROVISIONED
or OVER- UTILIZED
Batch Running Time >
0.8 * Trigger Interval
Critical Time >= 0.7 *
Trigger Interval
UNHEALTHY
StreamingLens Heuristic
00Copyright 2018 © Qubole
Pipeline State Inference Recommendations
OVERPROVISIONED ● Stream may be lagging due to
inaccurately configured source
properties or trigger interval.
● Cluster may be over
provisioned.
● If stream is lagging, increase load on source by
increasing thresholds like maxOffsetsPerTrigger (for
Kafka) or maxFilesPerTrigger (for File Source)
● Reduce the value of trigger interval if required.
● If stream is not lagging, downscale the cluster if
required to reduce costs.
DESIRED - -
UNDER-PROVISIONED Tasks are getting queued up. We
can increase no. of parallely
running task to meet the Trigger
Interval.
Increase the number of executors.
UNHEALTHY ● Increasing executors won’t be
helpful.
● Need to increase parallelism
and create more tasks.
● Possibility of skew.
Recommendation depends on Source
● For Kafka Source, increase Kafka Partitions.
● For Kinesis source, increase Kinesis Shards.
● If query has aggregations, increasing shuffle
partitions may be helpful.
Experiments
00Copyright 2018 © Qubole
● Query Operations: Aggregation Based on Timestamp
● Executors: Single 8 core executor
● Shuffle Partitions: 100
● Trigger Interval: 60 secs
● Rate: 5000 rows per second
Setup 1
00Copyright 2018 © Qubole
00Copyright 2018 © Qubole
Insight
Cluster is over-provisioned.
Recommendation
Recommendation:
1. Downscale ( if cant reduce number of executors, pick lower capacity
machine) and/or
2. Reduce Trigger Time (Get more real-time updates) and/or
3. Process more data (Check your configs, increase ingestion rate etc)
Next Step: Try increasing the input data rate
00Copyright 2018 © Qubole
● Query Operations: Aggregation Based on Timestamp
● Executors: Single 8 core executor
● Shuffle Partitions: 100
● Trigger Interval: 60 secs
● Rate: 20000 rps
Setup 2
00Copyright 2018 © Qubole
00Copyright 2018 © Qubole
Insight
Cluster is under-provisioned with a high risk of meeting SLA
Recommendation
Recommendation:
1. UpScale or
2. Have smaller tasks ~ more partitions
3. Process same task in lesser amount of time - Pick better machine
Next Step: Increase number of executors, Increase shuffle partition
00Copyright 2018 © Qubole
● Query Operations: Aggregation Based on Timestamp
● Executors: Three 8 core executor
● Shuffle Partitions: 200
● Trigger Interval: 60 secs
● Rate: 20000 rows per second
Setup 3
00Copyright 2018 © Qubole
00Copyright 2018 © Qubole
● Open Source StreamingLens
● Things to do
○ Incorporate “time lag” in our recommendation
○ Convert Recommendation → Action by implementing SLA aware streaming
autoscaling for better cost control
Next steps
Contributions will be welcome
00Copyright 2018 © Qubole
● Spark Lens - https://github.com/qubole/sparklens
● Kinesis Data Source - https://github.com/qubole/kinesis-sql
● S3-SQS Input Data Source for Better Performance -
https://github.com/apache/bahir/pull/91
● RocksDb State Storage - https://github.com/itsvikramagr/rocksdb-state-storage
Other open source contributions
Thank you!

Contenu connexe

Tendances

Tendances (20)

Uber Real Time Data Analytics
Uber Real Time Data AnalyticsUber Real Time Data Analytics
Uber Real Time Data Analytics
 
Investing the Effects of Overcommitting YARN resources
Investing the Effects of Overcommitting YARN resourcesInvesting the Effects of Overcommitting YARN resources
Investing the Effects of Overcommitting YARN resources
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingBravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
 
Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium
 
HBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBaseHBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBase
 
Real-Time Machine Learning with Pulsar Functions - Pulsar Summit NA 2021
Real-Time Machine Learning with Pulsar Functions - Pulsar Summit NA 2021Real-Time Machine Learning with Pulsar Functions - Pulsar Summit NA 2021
Real-Time Machine Learning with Pulsar Functions - Pulsar Summit NA 2021
 
Practice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China MobilePractice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China Mobile
 
Will it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing ApplicationsWill it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing Applications
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexIngestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache Apex
 
Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019
Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019
Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019
 
Spark Summit EU talk by Zoltan Zvara
Spark Summit EU talk by Zoltan ZvaraSpark Summit EU talk by Zoltan Zvara
Spark Summit EU talk by Zoltan Zvara
 
PGConf APAC 2018 - A PostgreSQL DBAs Toolbelt for 2018
PGConf APAC 2018 - A PostgreSQL DBAs Toolbelt for 2018PGConf APAC 2018 - A PostgreSQL DBAs Toolbelt for 2018
PGConf APAC 2018 - A PostgreSQL DBAs Toolbelt for 2018
 
Streaming Data from Cassandra into Kafka
Streaming Data from Cassandra into KafkaStreaming Data from Cassandra into Kafka
Streaming Data from Cassandra into Kafka
 
Present & Future of Greenplum Database A massively parallel Postgres Database...
Present & Future of Greenplum Database A massively parallel Postgres Database...Present & Future of Greenplum Database A massively parallel Postgres Database...
Present & Future of Greenplum Database A massively parallel Postgres Database...
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analytics
 
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshopSamza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
 
Parallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta LakeParallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta Lake
 
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
 
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
 
Apache Gearpump next-gen streaming engine
Apache Gearpump next-gen streaming engineApache Gearpump next-gen streaming engine
Apache Gearpump next-gen streaming engine
 

Similaire à Key considerations in productionizing streaming applications

Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - final
Sub Szabolcs Feczak
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
confluent
 
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidArchmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Imply
 
Building Pinterest Real-Time Ads Platform Using Kafka Streams
Building Pinterest Real-Time Ads Platform Using Kafka Streams Building Pinterest Real-Time Ads Platform Using Kafka Streams
Building Pinterest Real-Time Ads Platform Using Kafka Streams
confluent
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic Optimizing
Databricks
 

Similaire à Key considerations in productionizing streaming applications (20)

Kafka Practices @ Uber - Seattle Apache Kafka meetup
Kafka Practices @ Uber - Seattle Apache Kafka meetupKafka Practices @ Uber - Seattle Apache Kafka meetup
Kafka Practices @ Uber - Seattle Apache Kafka meetup
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - final
 
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
 
Performance vision Version 2.15 news
Performance vision Version 2.15 newsPerformance vision Version 2.15 news
Performance vision Version 2.15 news
 
Presto Summit 2018 - 10 - Qubole
Presto Summit 2018  - 10 - QubolePresto Summit 2018  - 10 - Qubole
Presto Summit 2018 - 10 - Qubole
 
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
 
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with SchlumbergerGet Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
 
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidArchmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on Druid
 
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
 
Oracle GoldenGate Performance Tuning
Oracle GoldenGate Performance TuningOracle GoldenGate Performance Tuning
Oracle GoldenGate Performance Tuning
 
20180503 kube con eu kubernetes metrics deep dive
20180503 kube con eu   kubernetes metrics deep dive20180503 kube con eu   kubernetes metrics deep dive
20180503 kube con eu kubernetes metrics deep dive
 
Building Pinterest Real-Time Ads Platform Using Kafka Streams
Building Pinterest Real-Time Ads Platform Using Kafka Streams Building Pinterest Real-Time Ads Platform Using Kafka Streams
Building Pinterest Real-Time Ads Platform Using Kafka Streams
 
Charles sonigo - Demuxed 2018 - How to be data-driven when you aren't Netflix...
Charles sonigo - Demuxed 2018 - How to be data-driven when you aren't Netflix...Charles sonigo - Demuxed 2018 - How to be data-driven when you aren't Netflix...
Charles sonigo - Demuxed 2018 - How to be data-driven when you aren't Netflix...
 
Monitoring with Clickhouse
Monitoring with ClickhouseMonitoring with Clickhouse
Monitoring with Clickhouse
 
Scaling Apache Pulsar to 10 Petabytes/Day
Scaling Apache Pulsar to 10 Petabytes/DayScaling Apache Pulsar to 10 Petabytes/Day
Scaling Apache Pulsar to 10 Petabytes/Day
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
GPORCA: Query Optimization as a Service
GPORCA: Query Optimization as a ServiceGPORCA: Query Optimization as a Service
GPORCA: Query Optimization as a Service
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic Optimizing
 

Plus de KafkaZone

Plus de KafkaZone (7)

Introduction to ksqlDB and stream processing (Vish Srinivasan - Confluent)
Introduction to ksqlDB and stream processing (Vish Srinivasan  - Confluent)Introduction to ksqlDB and stream processing (Vish Srinivasan  - Confluent)
Introduction to ksqlDB and stream processing (Vish Srinivasan - Confluent)
 
Real time data processing and model inferncing platform with Kafka streams (N...
Real time data processing and model inferncing platform with Kafka streams (N...Real time data processing and model inferncing platform with Kafka streams (N...
Real time data processing and model inferncing platform with Kafka streams (N...
 
Abstractions for managed stream processing platform (Arya Ketan - Flipkart)
Abstractions for managed stream processing platform (Arya Ketan - Flipkart)Abstractions for managed stream processing platform (Arya Ketan - Flipkart)
Abstractions for managed stream processing platform (Arya Ketan - Flipkart)
 
Tale of two streaming frameworks (Karthik D - Walmart)
Tale of two streaming frameworks (Karthik D - Walmart)Tale of two streaming frameworks (Karthik D - Walmart)
Tale of two streaming frameworks (Karthik D - Walmart)
 
Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)
 
Stream processing at Hotstar
Stream processing at HotstarStream processing at Hotstar
Stream processing at Hotstar
 
Data science at scale with Kafka and Flink (Razorpay)
Data science at scale with Kafka and Flink (Razorpay)Data science at scale with Kafka and Flink (Razorpay)
Data science at scale with Kafka and Flink (Razorpay)
 

Dernier

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 

Dernier (20)

8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodology
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 

Key considerations in productionizing streaming applications

  • 1. Key Consideration in productionizing Streaming Application Vikram Agrawal, Prateek Srivastava
  • 2. 00Copyright 2018 © Qubole ● Stream Processing Paradigm ● Deep-dive into Structured Streaming ● Productionizing Streaming Application ● Streaming Lens Agenda
  • 3. 00Copyright 2018 © Qubole Data Processing Architecture ○ Data is pushed into Flat files, HDFS or databases ○ ETL Batch jobs to process raw data for various end goals
  • 4. 00Copyright 2018 © Qubole Stream Processing ○ Message buses such as Kafka/Kinesis/RabbitMQ are part of the architecture ○ Business need to process your data in real-time instead of nightly batch job
  • 5. 00Copyright 2018 © Qubole Stream Processing Use Cases ● Real time transformation like aggregations, deduplication ● Data enrichment using joins with other table/stream ● Ingest into a data-lake (such as s3) for further processing or archival ● Ingest into a data-warehouse (Redshift, ES) for ad-hocs analysis ● Real time dashboard/reporting (Druid etc) ● CEP rule processing or Model Scoring (Fraud Detection etc)
  • 6. 00Copyright 2018 © Qubole How to decide the Streaming Engine ● SLAs and use-cases ○ Latency ■ Ingestion/reporting use-cases can tolerate few secs latency ■ Model scoring has tighter requirement (in ms) ○ Throughput ■ Current and future Incoming data rate ○ Complexity of Analytics ■ Real time transformation requirements - join, format conversion vs filter, selection ● Community Support ○ Technical skills required to adopt new technology ● Production readiness ○ Time required to build streaming Application ○ Fault Tolerance - Exactly/Atleast Once Delivery Guarantees ●
  • 7. 00Copyright 2018 © Qubole Why Spark Structured Streaming ● Latency ○ Micro-batch Execution for “sub-secs to few secs” is GA ○ Continuous Execution for “ms” latency is in Beta. ● Functionality ○ Built on top of Spark dataFrame APIs and takes advantage of SQL core engine code & memory optimizations ○ Stream-stream join, stream-batch joins, late data handling, sliding window aggregation, data format conversion, de-duplication etc ○ Connectors to and from various sources and sinks ○ Exactly/Atleast once semantics ● Throughput ○ Scalable and Mature Processing engine ○ Can easily handle 10s of million records per second ● API abstractions ○ Developer friendly - interoperability between batch and streaming code
  • 9. 00Copyright 2018 © Qubole Spark’s Functionality
  • 10. 00Copyright 2018 © Qubole Structured Streaming - under the hood Abstractions of Repeated Queries • Data Streams as unbounded Table • Streaming query is a batch-like operation on this table • After user specified trigger interval, repeat the query on the new records in the data stream.
  • 11. 00Copyright 2018 © Qubole Micro Batch Model Input Data Source Provider (say Kafka) determines range of records for the batch. Spark creates an optimized plan for the execution Plan is converted into task and executed by workers. Actual data read from Source and write into final destination happens in the execution phase
  • 12. 00Copyright 2018 © Qubole Stateless Streaming - Ingest in S3 Batch 1 Batch 2 Batch 3 [1-4] [5-8] [9-10] File 1 File 2 File 3 Micro batch consist of new records in each batch
  • 13. 00Copyright 2018 © Qubole Micro batch consists of New Input Records & Previous micro-batches’ sum saved in a state store Stateful Streaming - Running Sum Batch 1 Batch 2 Batch 3 State = 10 [1-4] [5-8] [9-10] State = 36 State = 55
  • 16. Productionizing Streaming Application Ease of composition and experimentation Data Accuracy and Consistency Higher Performance Replay/Reprocess Data Lower TCO Optimized for faster downstream processing PortabilityMonitoring, Insights & Alerts
  • 17. 00Copyright 2018 © Qubole ● What should be the right cluster configuration for my streaming job? ● Data Ingestion rate is variable. How can I autoscale my cluster? ● How can I know if my streaming application is healthy? ● How should I partition my input data source? ● Time lag between the last processed event and tip of the input stream is increasing. What can I do? Problem Statement
  • 18. 00Copyright 2018 © Qubole ● Performance tuning tool for Apache Spark ● Introduced a concept of critical path of a spark job to understand its scalability limit ● Open-sourced by Qubole ● https://github.com/qubole/sparklens Spark Lens
  • 20. 00Copyright 2018 © Qubole Spark Lens in Structured Streaming = Streaming Lens ● Batch Running Time: Actual Time taken to process a micro batch ● Trigger Interval: Specified by the user while writing streaming query. Can be proxied as SLA. ● Critical Path Time: Time to complete the microbatch if we had provisioned infinite executors.
  • 21. 00Copyright 2018 © Qubole Approach ● Sampling and Analyzing some Microbatches at regular intervals can give a fair idea of the health of the streaming pipeline. ● Trigger Interval is a measure of the SLA which the pipeline is expected to meet. Batch running time should be safely lower than trigger interval. ● If Critical Time is safely lower than Trigger Interval, throwing more resources at the application can help in meeting the SLA specified by trigger interval.
  • 22. 00Copyright 2018 © Qubole Trigger Interval vs batch processing time vs Critical Path SLA Under Utilized Over-utilized.Ups cale to achieve SLA Autoscale cannot help. Repartition Desired zone
  • 23. 00Copyright 2018 © Qubole Condition I Condition II Pipeline State Batch Running Time < 0.4 * Trigger Interval OVERPROVISIONED or UNDER-UTILIZED 0.4 * Trigger Interval Time < Batch Running Time < 0.8 * Trigger Interval DESIRED Batch Running Time > 0.8 * Trigger Interval Critical Time < 0.7 * Trigger Interval UNDER-PROVISIONED or OVER- UTILIZED Batch Running Time > 0.8 * Trigger Interval Critical Time >= 0.7 * Trigger Interval UNHEALTHY StreamingLens Heuristic
  • 24. 00Copyright 2018 © Qubole Pipeline State Inference Recommendations OVERPROVISIONED ● Stream may be lagging due to inaccurately configured source properties or trigger interval. ● Cluster may be over provisioned. ● If stream is lagging, increase load on source by increasing thresholds like maxOffsetsPerTrigger (for Kafka) or maxFilesPerTrigger (for File Source) ● Reduce the value of trigger interval if required. ● If stream is not lagging, downscale the cluster if required to reduce costs. DESIRED - - UNDER-PROVISIONED Tasks are getting queued up. We can increase no. of parallely running task to meet the Trigger Interval. Increase the number of executors. UNHEALTHY ● Increasing executors won’t be helpful. ● Need to increase parallelism and create more tasks. ● Possibility of skew. Recommendation depends on Source ● For Kafka Source, increase Kafka Partitions. ● For Kinesis source, increase Kinesis Shards. ● If query has aggregations, increasing shuffle partitions may be helpful.
  • 26. 00Copyright 2018 © Qubole ● Query Operations: Aggregation Based on Timestamp ● Executors: Single 8 core executor ● Shuffle Partitions: 100 ● Trigger Interval: 60 secs ● Rate: 5000 rows per second Setup 1
  • 28. 00Copyright 2018 © Qubole Insight Cluster is over-provisioned. Recommendation Recommendation: 1. Downscale ( if cant reduce number of executors, pick lower capacity machine) and/or 2. Reduce Trigger Time (Get more real-time updates) and/or 3. Process more data (Check your configs, increase ingestion rate etc) Next Step: Try increasing the input data rate
  • 29. 00Copyright 2018 © Qubole ● Query Operations: Aggregation Based on Timestamp ● Executors: Single 8 core executor ● Shuffle Partitions: 100 ● Trigger Interval: 60 secs ● Rate: 20000 rps Setup 2
  • 31. 00Copyright 2018 © Qubole Insight Cluster is under-provisioned with a high risk of meeting SLA Recommendation Recommendation: 1. UpScale or 2. Have smaller tasks ~ more partitions 3. Process same task in lesser amount of time - Pick better machine Next Step: Increase number of executors, Increase shuffle partition
  • 32. 00Copyright 2018 © Qubole ● Query Operations: Aggregation Based on Timestamp ● Executors: Three 8 core executor ● Shuffle Partitions: 200 ● Trigger Interval: 60 secs ● Rate: 20000 rows per second Setup 3
  • 34. 00Copyright 2018 © Qubole ● Open Source StreamingLens ● Things to do ○ Incorporate “time lag” in our recommendation ○ Convert Recommendation → Action by implementing SLA aware streaming autoscaling for better cost control Next steps Contributions will be welcome
  • 35. 00Copyright 2018 © Qubole ● Spark Lens - https://github.com/qubole/sparklens ● Kinesis Data Source - https://github.com/qubole/kinesis-sql ● S3-SQS Input Data Source for Better Performance - https://github.com/apache/bahir/pull/91 ● RocksDb State Storage - https://github.com/itsvikramagr/rocksdb-state-storage Other open source contributions