"Stateful app as an efficient way to build dispatching for riders and drivers", Oleksandr Chumak

Fwdays
FwdaysFwdays
"Stateful app as an efficient way to build dispatching for riders and drivers",  Oleksandr Chumak
Uklon in numbers
12 130+
Engineers
Product Teams
16 M
Android/iOS
downloads
1.5M+
Riders DAU
30+
microservices
200k+
Drivers DAU
3
Countries
30
Cities
"Stateful app as an efficient way to build dispatching for riders and drivers",  Oleksandr Chumak
Uklon
RiderApp DriverApp
How to reduce CPU consumption by 10 times due
to stateful-processing and ensure high reliability
What is the report about?
3
What are the solutions employed
by our competitors?
1
Scaling of stateful services
Reliability of stateful services
Workloads that make the stateless approach inefficient
Basic concepts
Agenda
Workloads that make the
stateless approach inefficient
1. massive frequent write operations are needed to track the objects'
current locations. As drivers can move as fast as 20 meters per second,
it is therefore important to update drivers' locations at a second.
Several challenges within
the ride-hailing are…
2. a K-nearest neighbour (kNN) query poses tremendous challenges,
compared to a simple Get query, in a key-value data store such as
Redis.
Feature #1
Orders Dispatching
Find the best driver for the order
Feature #2
Orders Broadcasting
Streaming your order to many drivers
DriverApp
Feature #3
Batch dispatching
Greedy algorithm Batching algorithm
The Process of Order Dispatching
with Batch Windows
2 min
9 min
4 min
4 min
Total wait time = 11 min Total wait time = 8 min
image
Feature #4
Driver ETA Tracker
Requirements:
1. Active Orders = tens of thousands
2. Drivers send their location every
2-5 seconds
1. Order offers. Find the best driver near you.
2. Order broadcasts. Fan-out orders to multiple drivers.
3. Order chaining. Find the next order for the driver, while
completing the current one.
4. Order batching (optimization). Reduce the total waiting time
for all passengers.
5. Sector queue (airports, train stations).
6. Driver ETA tracking for accepted order.
7. Matching driver’s GPS location to map graph node.
Other Workloads
Simplified Overview of
the Architecture
Stateful
● Load balancing algorithms
● Scalability
○ Partitioning
○ Replication
● Fault tolerance and Cold start
4
Stateful
architectures
Open Problems
1
Key concept
1. Local state is stored in memory KV structures
2. The local state restored from the durable log.
In same cases, local state change may have
been checkpointed to remote KV store (or into
a separate kafka topic)
3. Local state updates occur within a
single-threaded. No concurrency, Monotonic
Writes
NFR (Kyiv only)
Writes
1.1) 5000-10000 rps
1.2) 100-500 rps
Reads
2.1) 500 rps (handle 100-500 drivers
per request)
2.2) fetch 50000-200000 rows/sec
(100-400MB/sec)
driver entity: 2 KB (50 perc)/ 13 KB (99 perc)
total size for 100K = 200 MB
Key differences
Stateless (remote KV)
● Provide GET/PUT/DELETE API
● A high CPU cost due to
marshalling and serialization
● Additional network latency
● Frequently necessitates
additional local caching
Stateful (in-memory/local KV)
● Domain specific API. Ex:
○ Find nearest drivers
○ Calculate ETA
● Data locality
● Shared-nothing
1
Access patterns for
In-memory KV
1. Key lookup
2. Index seek (Offers, Broadcast)
3. All scans / Range scans
Concept #1: Co-partitioning
Two topics are described as
co-partitioned if:
1. Their keys have the same schemas
2. They are materialized by topics
with the same number of partitions
3. Their producers have similar
'partitioner'
Concept #1: Co-partitioning
Concept #2: Re-keying partitions
● Related events are not
co-partitioned
● Well-balanced partitions
● These can be unbalanced partitions and,
as a result, consumers
● Achieving data locality for the consumer
Concept #3: Filtering + Enriching
DriverLocation {
"driver": 12345
"latitude": 50.30846,
"longitude": 30.53419
}
DriverETA {
"driver": 12345
"latitude": 50.30846,
"longitude": 30.53419
“order”: 98765,
“eta”: “2 min”
}
How to scale?
Driver Dispatching
Driver Dispatching
Driver Dispatching
Driver Dispatching
1
Scalability
1
1. geospatial indexing (geohash, S2, H3)
2. city_id (region)
Some sharding strategy
Consider the following points when you design a data
partitioning scheme:
1. Minimize cross-partition data access operations
2. Minimize cross-partition joins
1
Partitioning by Region
Possible challenges:
● down-time during rebalance:
scale-out, rolling update
● unbalanced load: The load
from Kyiv is equivalent to the
load from all cities of Ukraine
combined)
1
Try to fix:
Partitioning by Region + Replication
Replication:
● Standalone consumers
● No partitions rebalance
● No down-time
● Replication overhead is
less than 0.1CPU per pod
● Reduced requirements
for cold recovery
1
1. Scalability - adding Kafka
partitions and deploying
separate Shard-Instances for
cities/countries
2. Elasticity - scale-out of
consumers within a Shard
Scalability
Reliability?
1
Replica synchronization
● State-based CRDT
● Last write wins (LWW)
● Optimistic replication (can
become temporarily
inconsistent)
● Strong Eventual Consistency
(SEC)
● Reading Your Own
Writes
● Monotonic Reads
● Consistent Prefix Reads
Depends on your Domain
● Reading Your Own
Writes
● Monotonic Reads
● Consistent Prefix Reads
1
Problems with Replication Lag?
1
1. Single infrastructure dependency - Kafka (battle tested streaming
platform with high throughput, fault-tolerance, and scalability).
2. When a task instance restarts, local state is repopulated by reading its
own Kafka log
3. Yes, reading and repopulating will take some time
Fault tolerance with local state
1
1. Key-Based Retention
a. Aggressive topic compaction
b. Tombstones
2. Time-Based Retention
Controlling State Size.
How long time to rebuild the state?
1
1. Driver state retention: 1hour
2. Repopulate local state:
a. Read driver-state from the beginning of the topic: 400k msg (8
partitions)
b. Read driver-locations from the 'now - 5sec'
3. You need to implement own event for ”live processing started”
How long time to rebuild
the state?
"Live processing started "dispatching.driver-summary-events [0]"
after 00:00:01.7875633 sec (50142 msgs)"
SLA level of 99.998% uptime/availability
results in the following periods of allowed
downtime/unavailability:
■ Daily: 1.7s
Traffic Jams requirements
1. Reduce the cost of Google
Maps API
2. High rate of Writes (20k
online drivers)
3. Update traffic information
every 5min
Stateful processing
● Grouping messages by partition key
● Aggregating messages in hopping window
● MapReduce
Driver ETA Tracker
4
Similar workload using Redis
https://aws.amazon.com/blogs/database/optimize-redis-client-performance-for-amazon-elasticache/?utm_source=pocket_saves
○ Client: c5.4xlarge (16 vCPU 32GiB)
○ Redis: 3 nodes r6g.2xlarge (8 vCPUs 64Gib)
46
Resources Usage
Although the current design is simple, it allows flexibility to change
key aspects:
○ Replication + Sharding
4
Future works
46
1. Stateful is not always difficult
2. Simple and Reliable solution
3. Easy to maintain
4. Much more efficient in terms of resources (2 vCPUs for all
dispatching) instead of a Redis cluster with 16-24 vCPUs
5. What about MS Orleans?
Lessons learned
4
The Twelve-Factor App
Misleading
46
Space-based architecture?
https://www.amazon.com/_/dp/1492043451?smid=ATVPDKIKX0DER&_encoding=UTF8&tag=oreilly20-20
Contacts
Solution Architect
Oleksandr Chumak
https:/
/www.linkedin.com/in/oleksandr-chuma
k-45967588/
facebook.com/achumak.dev
1 sur 46

Recommandé

Kubernetes @ Squarespace (SRE Portland Meetup October 2017) par
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kevin Lynch
237 vues51 diapositives
Stephan Ewen - Experiences running Flink at Very Large Scale par
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica
3.5K vues76 diapositives
BWC Supercomputing 2008 Presentation par
BWC Supercomputing 2008 PresentationBWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 Presentationlilyco
343 vues25 diapositives
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large... par
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...Flink Forward
1.5K vues44 diapositives
QCON 2015: Gearpump, Realtime Streaming on Akka par
QCON 2015: Gearpump, Realtime Streaming on AkkaQCON 2015: Gearpump, Realtime Streaming on Akka
QCON 2015: Gearpump, Realtime Streaming on AkkaSean Zhong
634 vues60 diapositives
Our Multi-Year Journey to a 10x Faster Confluent Cloud par
Our Multi-Year Journey to a 10x Faster Confluent CloudOur Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent CloudHostedbyConfluent
32 vues43 diapositives

Contenu connexe

Similaire à "Stateful app as an efficient way to build dispatching for riders and drivers", Oleksandr Chumak

Challenges in Cloud Computing – VM Migration par
Challenges in Cloud Computing – VM MigrationChallenges in Cloud Computing – VM Migration
Challenges in Cloud Computing – VM MigrationSarmad Makhdoom
7.1K vues26 diapositives
Velocity 2018 preetha appan final par
Velocity 2018   preetha appan finalVelocity 2018   preetha appan final
Velocity 2018 preetha appan finalpreethaappan
118 vues70 diapositives
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale par
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at ScaleSean Zhong
900 vues56 diapositives
Practice and challenges from building IaaS par
Practice and challenges from building IaaSPractice and challenges from building IaaS
Practice and challenges from building IaaSShawn Zhu
841 vues26 diapositives
MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps) par
MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps)MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps)
MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps)Art Schanz
85 vues31 diapositives
Unclouding Container Challenges par
 Unclouding  Container Challenges Unclouding  Container Challenges
Unclouding Container ChallengesRakuten Group, Inc.
407 vues18 diapositives

Similaire à "Stateful app as an efficient way to build dispatching for riders and drivers", Oleksandr Chumak(20)

Challenges in Cloud Computing – VM Migration par Sarmad Makhdoom
Challenges in Cloud Computing – VM MigrationChallenges in Cloud Computing – VM Migration
Challenges in Cloud Computing – VM Migration
Sarmad Makhdoom7.1K vues
Velocity 2018 preetha appan final par preethaappan
Velocity 2018   preetha appan finalVelocity 2018   preetha appan final
Velocity 2018 preetha appan final
preethaappan118 vues
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale par Sean Zhong
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Sean Zhong900 vues
Practice and challenges from building IaaS par Shawn Zhu
Practice and challenges from building IaaSPractice and challenges from building IaaS
Practice and challenges from building IaaS
Shawn Zhu841 vues
MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps) par Art Schanz
MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps)MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps)
MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps)
Art Schanz85 vues
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala... par Martin Zapletal
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Martin Zapletal1.3K vues
Oow2007 performance par Ricky Zhu
Oow2007 performanceOow2007 performance
Oow2007 performance
Ricky Zhu494 vues
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni... par MLconf
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
MLconf9K vues
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w... par Data Con LA
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Data Con LA783 vues
Ingestion and Dimensions Compute and Enrich using Apache Apex par Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexIngestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache Apex
Apache Apex671 vues
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas... par areej qasrawi
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
areej qasrawi64 vues
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware par Lucidworks
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAwareLeveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
Lucidworks1.1K vues
Leveraging the Power of Solr with Spark par QAware GmbH
Leveraging the Power of Solr with SparkLeveraging the Power of Solr with Spark
Leveraging the Power of Solr with Spark
QAware GmbH959 vues
Mobile web performance - MoDev East par Patrick Meenan
Mobile web performance - MoDev EastMobile web performance - MoDev East
Mobile web performance - MoDev East
Patrick Meenan3.4K vues

Plus de Fwdays

"Drizzle: What Is It All About?", Alex Blokh, Dan Kochetov par
"Drizzle: What Is It All About?", Alex Blokh, Dan Kochetov"Drizzle: What Is It All About?", Alex Blokh, Dan Kochetov
"Drizzle: What Is It All About?", Alex Blokh, Dan KochetovFwdays
24 vues33 diapositives
"Package management in monorepos", Zoltan Kochan par
"Package management in monorepos", Zoltan Kochan"Package management in monorepos", Zoltan Kochan
"Package management in monorepos", Zoltan KochanFwdays
34 vues18 diapositives
"Node.js vs workers — A comparison of two JavaScript runtimes", James M Snell par
"Node.js vs workers — A comparison of two JavaScript runtimes", James M Snell"Node.js vs workers — A comparison of two JavaScript runtimes", James M Snell
"Node.js vs workers — A comparison of two JavaScript runtimes", James M SnellFwdays
14 vues30 diapositives
"AI and how to integrate ChatGPT as a customer support agent", Sergey Dyachok par
"AI and how to integrate ChatGPT as a customer support agent",  Sergey Dyachok"AI and how to integrate ChatGPT as a customer support agent",  Sergey Dyachok
"AI and how to integrate ChatGPT as a customer support agent", Sergey DyachokFwdays
39 vues17 diapositives
"Node.js Development in 2024: trends and tools", Nikita Galkin par
"Node.js Development in 2024: trends and tools", Nikita Galkin "Node.js Development in 2024: trends and tools", Nikita Galkin
"Node.js Development in 2024: trends and tools", Nikita Galkin Fwdays
33 vues38 diapositives
"Running students' code in isolation. The hard way", Yurii Holiuk par
"Running students' code in isolation. The hard way", Yurii Holiuk "Running students' code in isolation. The hard way", Yurii Holiuk
"Running students' code in isolation. The hard way", Yurii Holiuk Fwdays
36 vues34 diapositives

Plus de Fwdays(20)

"Drizzle: What Is It All About?", Alex Blokh, Dan Kochetov par Fwdays
"Drizzle: What Is It All About?", Alex Blokh, Dan Kochetov"Drizzle: What Is It All About?", Alex Blokh, Dan Kochetov
"Drizzle: What Is It All About?", Alex Blokh, Dan Kochetov
Fwdays24 vues
"Package management in monorepos", Zoltan Kochan par Fwdays
"Package management in monorepos", Zoltan Kochan"Package management in monorepos", Zoltan Kochan
"Package management in monorepos", Zoltan Kochan
Fwdays34 vues
"Node.js vs workers — A comparison of two JavaScript runtimes", James M Snell par Fwdays
"Node.js vs workers — A comparison of two JavaScript runtimes", James M Snell"Node.js vs workers — A comparison of two JavaScript runtimes", James M Snell
"Node.js vs workers — A comparison of two JavaScript runtimes", James M Snell
Fwdays14 vues
"AI and how to integrate ChatGPT as a customer support agent", Sergey Dyachok par Fwdays
"AI and how to integrate ChatGPT as a customer support agent",  Sergey Dyachok"AI and how to integrate ChatGPT as a customer support agent",  Sergey Dyachok
"AI and how to integrate ChatGPT as a customer support agent", Sergey Dyachok
Fwdays39 vues
"Node.js Development in 2024: trends and tools", Nikita Galkin par Fwdays
"Node.js Development in 2024: trends and tools", Nikita Galkin "Node.js Development in 2024: trends and tools", Nikita Galkin
"Node.js Development in 2024: trends and tools", Nikita Galkin
Fwdays33 vues
"Running students' code in isolation. The hard way", Yurii Holiuk par Fwdays
"Running students' code in isolation. The hard way", Yurii Holiuk "Running students' code in isolation. The hard way", Yurii Holiuk
"Running students' code in isolation. The hard way", Yurii Holiuk
Fwdays36 vues
"Surviving highload with Node.js", Andrii Shumada par Fwdays
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada
Fwdays58 vues
"The role of CTO in a classical early-stage startup", Eugene Gusarov par Fwdays
"The role of CTO in a classical early-stage startup", Eugene Gusarov"The role of CTO in a classical early-stage startup", Eugene Gusarov
"The role of CTO in a classical early-stage startup", Eugene Gusarov
Fwdays34 vues
"Cross-functional teams: what to do when a new hire doesn’t solve the busines... par Fwdays
"Cross-functional teams: what to do when a new hire doesn’t solve the busines..."Cross-functional teams: what to do when a new hire doesn’t solve the busines...
"Cross-functional teams: what to do when a new hire doesn’t solve the busines...
Fwdays45 vues
"Ukrainian Mobile Banking Scaling in Practice. From 0 to 100 and beyond", Vad... par Fwdays
"Ukrainian Mobile Banking Scaling in Practice. From 0 to 100 and beyond", Vad..."Ukrainian Mobile Banking Scaling in Practice. From 0 to 100 and beyond", Vad...
"Ukrainian Mobile Banking Scaling in Practice. From 0 to 100 and beyond", Vad...
Fwdays50 vues
"Thriving Culture in a Product Company — Practical Story", Volodymyr Tsukur par Fwdays
"Thriving Culture in a Product Company — Practical Story", Volodymyr Tsukur"Thriving Culture in a Product Company — Practical Story", Volodymyr Tsukur
"Thriving Culture in a Product Company — Practical Story", Volodymyr Tsukur
Fwdays50 vues
"Fast Start to Building on AWS", Igor Ivaniuk par Fwdays
"Fast Start to Building on AWS", Igor Ivaniuk"Fast Start to Building on AWS", Igor Ivaniuk
"Fast Start to Building on AWS", Igor Ivaniuk
Fwdays54 vues
"Quality Assurance: Achieving Excellence in startup without a Dedicated QA", ... par Fwdays
"Quality Assurance: Achieving Excellence in startup without a Dedicated QA", ..."Quality Assurance: Achieving Excellence in startup without a Dedicated QA", ...
"Quality Assurance: Achieving Excellence in startup without a Dedicated QA", ...
Fwdays48 vues
"AI Startup Growth from Idea to 1M ARR", Oleksandr Uspenskyi par Fwdays
"AI Startup Growth from Idea to 1M ARR", Oleksandr Uspenskyi"AI Startup Growth from Idea to 1M ARR", Oleksandr Uspenskyi
"AI Startup Growth from Idea to 1M ARR", Oleksandr Uspenskyi
Fwdays33 vues
"How we switched to Kanban and how it integrates with product planning", Vady... par Fwdays
"How we switched to Kanban and how it integrates with product planning", Vady..."How we switched to Kanban and how it integrates with product planning", Vady...
"How we switched to Kanban and how it integrates with product planning", Vady...
Fwdays76 vues
"Bringing Flutter to Tide: a case study of a leading fintech platform in the ... par Fwdays
"Bringing Flutter to Tide: a case study of a leading fintech platform in the ..."Bringing Flutter to Tide: a case study of a leading fintech platform in the ...
"Bringing Flutter to Tide: a case study of a leading fintech platform in the ...
Fwdays25 vues
"Shape Up: How to Develop Quickly and Avoid Burnout", Dmytro Popov par Fwdays
"Shape Up: How to Develop Quickly and Avoid Burnout", Dmytro Popov"Shape Up: How to Develop Quickly and Avoid Burnout", Dmytro Popov
"Shape Up: How to Develop Quickly and Avoid Burnout", Dmytro Popov
Fwdays69 vues
"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy par Fwdays
"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy
"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy
Fwdays50 vues
From “T” to “E”, Dmytro Gryn par Fwdays
From “T” to “E”, Dmytro GrynFrom “T” to “E”, Dmytro Gryn
From “T” to “E”, Dmytro Gryn
Fwdays37 vues
"Why I left React in my TypeScript projects and where ", Illya Klymov par Fwdays
"Why I left React in my TypeScript projects and where ",  Illya Klymov"Why I left React in my TypeScript projects and where ",  Illya Klymov
"Why I left React in my TypeScript projects and where ", Illya Klymov
Fwdays256 vues

Dernier

Transcript: Redefining the book supply chain: A glimpse into the future - Tec... par
Transcript: Redefining the book supply chain: A glimpse into the future - Tec...Transcript: Redefining the book supply chain: A glimpse into the future - Tec...
Transcript: Redefining the book supply chain: A glimpse into the future - Tec...BookNet Canada
41 vues16 diapositives
Measurecamp Brussels - Synthetic data.pdf par
Measurecamp Brussels - Synthetic data.pdfMeasurecamp Brussels - Synthetic data.pdf
Measurecamp Brussels - Synthetic data.pdfHuman37
26 vues14 diapositives
Initiating and Advancing Your Strategic GIS Governance Strategy par
Initiating and Advancing Your Strategic GIS Governance StrategyInitiating and Advancing Your Strategic GIS Governance Strategy
Initiating and Advancing Your Strategic GIS Governance StrategySafe Software
184 vues68 diapositives
NTGapps NTG LowCode Platform par
NTGapps NTG LowCode Platform NTGapps NTG LowCode Platform
NTGapps NTG LowCode Platform Mustafa Kuğu
437 vues30 diapositives
MVP and prioritization.pdf par
MVP and prioritization.pdfMVP and prioritization.pdf
MVP and prioritization.pdfrahuldharwal141
39 vues8 diapositives
Innovation & Entrepreneurship strategies in Dairy Industry par
Innovation & Entrepreneurship strategies in Dairy IndustryInnovation & Entrepreneurship strategies in Dairy Industry
Innovation & Entrepreneurship strategies in Dairy IndustryPervaizDar1
35 vues26 diapositives

Dernier(20)

Transcript: Redefining the book supply chain: A glimpse into the future - Tec... par BookNet Canada
Transcript: Redefining the book supply chain: A glimpse into the future - Tec...Transcript: Redefining the book supply chain: A glimpse into the future - Tec...
Transcript: Redefining the book supply chain: A glimpse into the future - Tec...
BookNet Canada41 vues
Measurecamp Brussels - Synthetic data.pdf par Human37
Measurecamp Brussels - Synthetic data.pdfMeasurecamp Brussels - Synthetic data.pdf
Measurecamp Brussels - Synthetic data.pdf
Human37 26 vues
Initiating and Advancing Your Strategic GIS Governance Strategy par Safe Software
Initiating and Advancing Your Strategic GIS Governance StrategyInitiating and Advancing Your Strategic GIS Governance Strategy
Initiating and Advancing Your Strategic GIS Governance Strategy
Safe Software184 vues
NTGapps NTG LowCode Platform par Mustafa Kuğu
NTGapps NTG LowCode Platform NTGapps NTG LowCode Platform
NTGapps NTG LowCode Platform
Mustafa Kuğu437 vues
Innovation & Entrepreneurship strategies in Dairy Industry par PervaizDar1
Innovation & Entrepreneurship strategies in Dairy IndustryInnovation & Entrepreneurship strategies in Dairy Industry
Innovation & Entrepreneurship strategies in Dairy Industry
PervaizDar135 vues
Optimizing Communication to Optimize Human Behavior - LCBM par Yaman Kumar
Optimizing Communication to Optimize Human Behavior - LCBMOptimizing Communication to Optimize Human Behavior - LCBM
Optimizing Communication to Optimize Human Behavior - LCBM
Yaman Kumar38 vues
This talk was not generated with ChatGPT: how AI is changing science par Elena Simperl
This talk was not generated with ChatGPT: how AI is changing scienceThis talk was not generated with ChatGPT: how AI is changing science
This talk was not generated with ChatGPT: how AI is changing science
Elena Simperl32 vues
Mobile Core Solutions & Successful Cases.pdf par IPLOOK Networks
Mobile Core Solutions & Successful Cases.pdfMobile Core Solutions & Successful Cases.pdf
Mobile Core Solutions & Successful Cases.pdf
IPLOOK Networks14 vues
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading... par The Digital Insurer
Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading...
Business Analyst Series 2023 - Week 4 Session 7 par DianaGray10
Business Analyst Series 2023 -  Week 4 Session 7Business Analyst Series 2023 -  Week 4 Session 7
Business Analyst Series 2023 - Week 4 Session 7
DianaGray10146 vues
Cocktail of Environments. How to Mix Test and Development Environments and St... par Aleksandr Tarasov
Cocktail of Environments. How to Mix Test and Development Environments and St...Cocktail of Environments. How to Mix Test and Development Environments and St...
Cocktail of Environments. How to Mix Test and Development Environments and St...
The Power of Heat Decarbonisation Plans in the Built Environment par IES VE
The Power of Heat Decarbonisation Plans in the Built EnvironmentThe Power of Heat Decarbonisation Plans in the Built Environment
The Power of Heat Decarbonisation Plans in the Built Environment
IES VE84 vues
Business Analyst Series 2023 - Week 4 Session 8 par DianaGray10
Business Analyst Series 2023 -  Week 4 Session 8Business Analyst Series 2023 -  Week 4 Session 8
Business Analyst Series 2023 - Week 4 Session 8
DianaGray10145 vues
AIM102-S_Cognizant_CognizantCognitive par PhilipBasford
AIM102-S_Cognizant_CognizantCognitiveAIM102-S_Cognizant_CognizantCognitive
AIM102-S_Cognizant_CognizantCognitive
PhilipBasford21 vues
Discover Aura Workshop (12.5.23).pdf par Neo4j
Discover Aura Workshop (12.5.23).pdfDiscover Aura Workshop (12.5.23).pdf
Discover Aura Workshop (12.5.23).pdf
Neo4j15 vues

"Stateful app as an efficient way to build dispatching for riders and drivers", Oleksandr Chumak

  • 2. Uklon in numbers 12 130+ Engineers Product Teams 16 M Android/iOS downloads 1.5M+ Riders DAU 30+ microservices 200k+ Drivers DAU 3 Countries 30 Cities
  • 5. How to reduce CPU consumption by 10 times due to stateful-processing and ensure high reliability What is the report about?
  • 6. 3 What are the solutions employed by our competitors?
  • 7. 1 Scaling of stateful services Reliability of stateful services Workloads that make the stateless approach inefficient Basic concepts Agenda
  • 8. Workloads that make the stateless approach inefficient
  • 9. 1. massive frequent write operations are needed to track the objects' current locations. As drivers can move as fast as 20 meters per second, it is therefore important to update drivers' locations at a second. Several challenges within the ride-hailing are… 2. a K-nearest neighbour (kNN) query poses tremendous challenges, compared to a simple Get query, in a key-value data store such as Redis.
  • 10. Feature #1 Orders Dispatching Find the best driver for the order
  • 11. Feature #2 Orders Broadcasting Streaming your order to many drivers DriverApp
  • 12. Feature #3 Batch dispatching Greedy algorithm Batching algorithm The Process of Order Dispatching with Batch Windows 2 min 9 min 4 min 4 min Total wait time = 11 min Total wait time = 8 min
  • 13. image Feature #4 Driver ETA Tracker Requirements: 1. Active Orders = tens of thousands 2. Drivers send their location every 2-5 seconds
  • 14. 1. Order offers. Find the best driver near you. 2. Order broadcasts. Fan-out orders to multiple drivers. 3. Order chaining. Find the next order for the driver, while completing the current one. 4. Order batching (optimization). Reduce the total waiting time for all passengers. 5. Sector queue (airports, train stations). 6. Driver ETA tracking for accepted order. 7. Matching driver’s GPS location to map graph node. Other Workloads
  • 15. Simplified Overview of the Architecture Stateful
  • 16. ● Load balancing algorithms ● Scalability ○ Partitioning ○ Replication ● Fault tolerance and Cold start 4 Stateful architectures Open Problems
  • 17. 1 Key concept 1. Local state is stored in memory KV structures 2. The local state restored from the durable log. In same cases, local state change may have been checkpointed to remote KV store (or into a separate kafka topic) 3. Local state updates occur within a single-threaded. No concurrency, Monotonic Writes
  • 18. NFR (Kyiv only) Writes 1.1) 5000-10000 rps 1.2) 100-500 rps Reads 2.1) 500 rps (handle 100-500 drivers per request) 2.2) fetch 50000-200000 rows/sec (100-400MB/sec) driver entity: 2 KB (50 perc)/ 13 KB (99 perc) total size for 100K = 200 MB
  • 19. Key differences Stateless (remote KV) ● Provide GET/PUT/DELETE API ● A high CPU cost due to marshalling and serialization ● Additional network latency ● Frequently necessitates additional local caching Stateful (in-memory/local KV) ● Domain specific API. Ex: ○ Find nearest drivers ○ Calculate ETA ● Data locality ● Shared-nothing
  • 20. 1 Access patterns for In-memory KV 1. Key lookup 2. Index seek (Offers, Broadcast) 3. All scans / Range scans
  • 22. Two topics are described as co-partitioned if: 1. Their keys have the same schemas 2. They are materialized by topics with the same number of partitions 3. Their producers have similar 'partitioner' Concept #1: Co-partitioning
  • 23. Concept #2: Re-keying partitions ● Related events are not co-partitioned ● Well-balanced partitions ● These can be unbalanced partitions and, as a result, consumers ● Achieving data locality for the consumer
  • 24. Concept #3: Filtering + Enriching DriverLocation { "driver": 12345 "latitude": 50.30846, "longitude": 30.53419 } DriverETA { "driver": 12345 "latitude": 50.30846, "longitude": 30.53419 “order”: 98765, “eta”: “2 min” }
  • 25. How to scale? Driver Dispatching Driver Dispatching Driver Dispatching Driver Dispatching
  • 27. 1 1. geospatial indexing (geohash, S2, H3) 2. city_id (region) Some sharding strategy Consider the following points when you design a data partitioning scheme: 1. Minimize cross-partition data access operations 2. Minimize cross-partition joins
  • 28. 1 Partitioning by Region Possible challenges: ● down-time during rebalance: scale-out, rolling update ● unbalanced load: The load from Kyiv is equivalent to the load from all cities of Ukraine combined)
  • 29. 1 Try to fix: Partitioning by Region + Replication Replication: ● Standalone consumers ● No partitions rebalance ● No down-time ● Replication overhead is less than 0.1CPU per pod ● Reduced requirements for cold recovery
  • 30. 1 1. Scalability - adding Kafka partitions and deploying separate Shard-Instances for cities/countries 2. Elasticity - scale-out of consumers within a Shard Scalability
  • 32. 1 Replica synchronization ● State-based CRDT ● Last write wins (LWW) ● Optimistic replication (can become temporarily inconsistent) ● Strong Eventual Consistency (SEC)
  • 33. ● Reading Your Own Writes ● Monotonic Reads ● Consistent Prefix Reads Depends on your Domain ● Reading Your Own Writes ● Monotonic Reads ● Consistent Prefix Reads 1 Problems with Replication Lag?
  • 34. 1 1. Single infrastructure dependency - Kafka (battle tested streaming platform with high throughput, fault-tolerance, and scalability). 2. When a task instance restarts, local state is repopulated by reading its own Kafka log 3. Yes, reading and repopulating will take some time Fault tolerance with local state
  • 35. 1 1. Key-Based Retention a. Aggressive topic compaction b. Tombstones 2. Time-Based Retention Controlling State Size. How long time to rebuild the state?
  • 36. 1 1. Driver state retention: 1hour 2. Repopulate local state: a. Read driver-state from the beginning of the topic: 400k msg (8 partitions) b. Read driver-locations from the 'now - 5sec' 3. You need to implement own event for ”live processing started” How long time to rebuild the state? "Live processing started "dispatching.driver-summary-events [0]" after 00:00:01.7875633 sec (50142 msgs)" SLA level of 99.998% uptime/availability results in the following periods of allowed downtime/unavailability: ■ Daily: 1.7s
  • 37. Traffic Jams requirements 1. Reduce the cost of Google Maps API 2. High rate of Writes (20k online drivers) 3. Update traffic information every 5min
  • 38. Stateful processing ● Grouping messages by partition key ● Aggregating messages in hopping window ● MapReduce
  • 40. 4 Similar workload using Redis https://aws.amazon.com/blogs/database/optimize-redis-client-performance-for-amazon-elasticache/?utm_source=pocket_saves ○ Client: c5.4xlarge (16 vCPU 32GiB) ○ Redis: 3 nodes r6g.2xlarge (8 vCPUs 64Gib)
  • 42. Although the current design is simple, it allows flexibility to change key aspects: ○ Replication + Sharding 4 Future works
  • 43. 46 1. Stateful is not always difficult 2. Simple and Reliable solution 3. Easy to maintain 4. Much more efficient in terms of resources (2 vCPUs for all dispatching) instead of a Redis cluster with 16-24 vCPUs 5. What about MS Orleans? Lessons learned