STREAM
PROCESSINGIN
UBERMARKETPLACE
~ 68 countries / 350+ cities
Transportation as reliable as running
water, everywhere, for everyone
2
Agenda
What’s on the menu?
•Use Cases
•Problem Space
•Overall Architecture
•Choices & Tradeoffs
•Q & A
Use Case: Realtime OLAP
There is always need for quick exploration
How many open cars in the world, NOW?
How many UberXs were driving clients in SF in the past 10
minutes by hexagons?
How many UberXs were driving clients in SF in the past 10 minutes by hexagons?
Driving time and other metrics over time by hexagonal area
Use Case: Complex Event Processing
There are patterns in event streams
How many drivers cancel requests
more than 3 times in a row within a 10-
minute window?
Report riders requesting a pickup 100 miles
apart within a half hour window?
IF
This —>
Then that —>
● Sigma is similar - but for offline/batch applications
Complex Event Processing
Use Case: Supply Positioning
Clusters Of Supply & Demand
Predicted Health Metrics
Actual Health Metrics
Monitor Marketplace Health
Challenges
OLAP of Geo-spatial Temporal Data
Reasonably Large Scale
Near Real Time
• Indexing, Lookup, Rendering
• Symmetric Neighbors
• Convex & Compact Regions
• Equal Areas
• Equal Shape
Hexagons
Scale
Geo Space Vehicle Types Time Status
X X X
Granular Geo Areas
Granular Geo Areas
Over 10,000 hexagons in a city
Multiple Vehicle Types
7 vehicle types
Minute-level Time Buckets
1440 minutes in a day
Many Driver States
13 driver states
Many Cities
300 cities
Granular Data
1 day of data: 300 x 10,000 x 7 x 1440 x 13 = 393 billion
possible combinations
Unknown Query Patterns
Any combination of dimensions
Variety of Aggregations
- Heatmap
- Top N
- Histogram
- count(), avg(), sum(), percent(), geo
Large Data Volume
• Hundreds of thousands of
events per second

• At least dozens of fields in
each event
Multiple Topics
Rider States Driver States
Let’s build a stream processing pipeline
Pipeline Template
Event Collection
Multiple Event Types with Different Volume
Hundreds of Thousands of Events Per Second
Events Should Be Available Under a Second
Events Should Rarely Get Lost
Multiple Consumers
Natural Choice: Apache Kafka
- Low latency and high throughput
- Persistent events
- Distributes a topic by partitions
- G...
Event Processing
Transformation
Event Transformation Example
(Lat, Long) -> (zipcode, hexagon, S2)
Pre-aggregation
Joining Multiple Streams
Sessionization
Multi-Staged Processing
Minimum Requirements
- Statement Management
- Checkpointing
- Automatic Resource Management
- Multi-staged processing
Apache Samza
Why Apache Samza?
- DAG on Kafka
- Excellent integration with Kafka
- Built-in checkpointing
- Built-in state management
-...
Samza Is Conceptually Simple
IF
This —>
Then that —>
● Sigma is similar - but for offline/batch applications
Complex Event Processing
● Sigma is similar - but for offline/batch applications
Complex Event Processing
● Sigma is similar - but for offline/batch applications
Complex Event Processing
● Sigma is similar - but for offline/batch applications
Complex Event Processing
● Sigma is similar - but for offline/batch applications
Complex Event Processing
● Sigma is similar - but for offline/batch applications
Slightly Expanded Version
● Sigma is similar - but for offline/batch applications
Slightly Expanded Version
● Sigma is similar - but for offline/batch applications
Slightly Expanded Version
● Sigma is similar - but for offline/batch applications
Slightly Expanded Version
Applications
Dashboard of Realtime Business Metrics
Ad-Hoc Queries
Visualization with Streaming
Visualization with Streaming
LocationUpdate	where	city	=	X
LocationUpdate		
where	city	=	Y		
						and	vehicle	=	‘UberX’
1...
Visualization with Streaming
LocationUpdate	where	city	=	X
LocationUpdate		
where	city	=	Y		
						and	vehicle	=	‘UberX’
1...
Visualization with Streaming
LocationUpdate	where	city	=	X
LocationUpdate		
where	city	=	Y		
						and	vehicle	=	‘UberX’
1...
Visualization with Streaming
LocationUpdate	where	city	=	X
LocationUpdate		
where	city	=	Y		
						and	vehicle	=	‘UberX’
1...
Visualization with Streaming
LocationUpdate	where	city	=	X
LocationUpdate		
where	city	=	Y		
						and	vehicle	=	‘UberX’
1...
Visualization with Streaming
LocationUpdate	

where	city	=	‘SF’
LocationUpdate		
where	city	=	‘LA’		
						and	vehicle	
10...
Ad-hoc Exploration
A Few Trade-Offs
Lambda vs Kappa
We Use Lambda
- Spark + HDFS/S3 for batch processing
- Yes, it is painful, but
- We may need to go way back due to change ...
Processing by Event Time Is Not Always Easy
Leverage The Storage Layer
Dealing with Limitation of Samza
-No broadcasting. We have to override
SystemStreamPartitionGrouper
-No dynamic topology. ...
Thank You
Stream Processing with Kafka in Uber, Danny Yuan
Stream Processing with Kafka in Uber, Danny Yuan
Stream Processing with Kafka in Uber, Danny Yuan
Stream Processing with Kafka in Uber, Danny Yuan
Stream Processing with Kafka in Uber, Danny Yuan
Prochain SlideShare
Chargement dans…5
×

Stream Processing with Kafka in Uber, Danny Yuan

3 503 vues

Publié le

The session will discuss how Uber evolved its stream processing system to handle a number of use cases in Uber Marketplace, with a focus on how Apache Kafka and Apache Samza played an important role in building a robust and efficient data pipeline. The use cases include but not limited to realtime aggregation of geospatial time series, computing key metrics as well as forecasting of marketplace dynamics, and extracting patterns from various event streams. The session will present how Kafka and Samza are used to meet the requirements of the use cases, what additional tools are needed, and lessons learned from operating the pipeline.

Publié dans : Ingénierie

Stream Processing with Kafka in Uber, Danny Yuan

  1. 1. STREAM PROCESSINGIN UBERMARKETPLACE
  2. 2. ~ 68 countries / 350+ cities Transportation as reliable as running water, everywhere, for everyone 2
  3. 3. Agenda What’s on the menu? •Use Cases •Problem Space •Overall Architecture •Choices & Tradeoffs •Q & A
  4. 4. Use Case: Realtime OLAP
  5. 5. There is always need for quick exploration
  6. 6. How many open cars in the world, NOW?
  7. 7. How many UberXs were driving clients in SF in the past 10 minutes by hexagons?
  8. 8. How many UberXs were driving clients in SF in the past 10 minutes by hexagons?
  9. 9. Driving time and other metrics over time by hexagonal area
  10. 10. Use Case: Complex Event Processing
  11. 11. There are patterns in event streams
  12. 12. How many drivers cancel requests more than 3 times in a row within a 10- minute window?
  13. 13. Report riders requesting a pickup 100 miles apart within a half hour window?
  14. 14. IF This —> Then that —> ● Sigma is similar - but for offline/batch applications Complex Event Processing
  15. 15. Use Case: Supply Positioning
  16. 16. Clusters Of Supply & Demand
  17. 17. Predicted Health Metrics Actual Health Metrics Monitor Marketplace Health
  18. 18. Challenges
  19. 19. OLAP of Geo-spatial Temporal Data Reasonably Large Scale Near Real Time
  20. 20. • Indexing, Lookup, Rendering • Symmetric Neighbors • Convex & Compact Regions • Equal Areas • Equal Shape Hexagons
  21. 21. Scale Geo Space Vehicle Types Time Status X X X
  22. 22. Granular Geo Areas
  23. 23. Granular Geo Areas Over 10,000 hexagons in a city
  24. 24. Multiple Vehicle Types 7 vehicle types
  25. 25. Minute-level Time Buckets 1440 minutes in a day
  26. 26. Many Driver States 13 driver states
  27. 27. Many Cities 300 cities
  28. 28. Granular Data 1 day of data: 300 x 10,000 x 7 x 1440 x 13 = 393 billion possible combinations
  29. 29. Unknown Query Patterns Any combination of dimensions
  30. 30. Variety of Aggregations - Heatmap - Top N - Histogram - count(), avg(), sum(), percent(), geo
  31. 31. Large Data Volume • Hundreds of thousands of events per second
 • At least dozens of fields in each event
  32. 32. Multiple Topics Rider States Driver States
  33. 33. Let’s build a stream processing pipeline
  34. 34. Pipeline Template
  35. 35. Event Collection
  36. 36. Multiple Event Types with Different Volume
  37. 37. Hundreds of Thousands of Events Per Second
  38. 38. Events Should Be Available Under a Second
  39. 39. Events Should Rarely Get Lost
  40. 40. Multiple Consumers
  41. 41. Natural Choice: Apache Kafka - Low latency and high throughput - Persistent events - Distributes a topic by partitions - Groups consumers by consumer groups
  42. 42. Event Processing
  43. 43. Transformation
  44. 44. Event Transformation Example (Lat, Long) -> (zipcode, hexagon, S2)
  45. 45. Pre-aggregation
  46. 46. Joining Multiple Streams
  47. 47. Sessionization
  48. 48. Multi-Staged Processing
  49. 49. Minimum Requirements - Statement Management - Checkpointing - Automatic Resource Management - Multi-staged processing
  50. 50. Apache Samza
  51. 51. Why Apache Samza? - DAG on Kafka - Excellent integration with Kafka - Built-in checkpointing - Built-in state management - Excellent support from our data team
  52. 52. Samza Is Conceptually Simple
  53. 53. IF This —> Then that —> ● Sigma is similar - but for offline/batch applications Complex Event Processing
  54. 54. ● Sigma is similar - but for offline/batch applications Complex Event Processing
  55. 55. ● Sigma is similar - but for offline/batch applications Complex Event Processing
  56. 56. ● Sigma is similar - but for offline/batch applications Complex Event Processing
  57. 57. ● Sigma is similar - but for offline/batch applications Complex Event Processing
  58. 58. ● Sigma is similar - but for offline/batch applications Slightly Expanded Version
  59. 59. ● Sigma is similar - but for offline/batch applications Slightly Expanded Version
  60. 60. ● Sigma is similar - but for offline/batch applications Slightly Expanded Version
  61. 61. ● Sigma is similar - but for offline/batch applications Slightly Expanded Version
  62. 62. Applications
  63. 63. Dashboard of Realtime Business Metrics
  64. 64. Ad-Hoc Queries
  65. 65. Visualization with Streaming
  66. 66. Visualization with Streaming LocationUpdate where city = X LocationUpdate where city = Y and vehicle = ‘UberX’ 100% 100% 100% 10% 5%
  67. 67. Visualization with Streaming LocationUpdate where city = X LocationUpdate where city = Y and vehicle = ‘UberX’ 100% 100% 100% 10% 5%
  68. 68. Visualization with Streaming LocationUpdate where city = X LocationUpdate where city = Y and vehicle = ‘UberX’ 100% 100% 100% 10% 5%
  69. 69. Visualization with Streaming LocationUpdate where city = X LocationUpdate where city = Y and vehicle = ‘UberX’ 100% 100% 100% 10% 5%
  70. 70. Visualization with Streaming LocationUpdate where city = X LocationUpdate where city = Y and vehicle = ‘UberX’ 100% 100% 100% 10% 5%
  71. 71. Visualization with Streaming LocationUpdate 
 where city = ‘SF’ LocationUpdate where city = ‘LA’ and vehicle 10% 5% 100% 100%
  72. 72. Ad-hoc Exploration
  73. 73. A Few Trade-Offs
  74. 74. Lambda vs Kappa
  75. 75. We Use Lambda - Spark + HDFS/S3 for batch processing - Yes, it is painful, but - We may need to go way back due to change of business requirements - Batch process can run faster — they scale differently - It was not easy to start a new stream processing instance
  76. 76. Processing by Event Time Is Not Always Easy
  77. 77. Leverage The Storage Layer
  78. 78. Dealing with Limitation of Samza -No broadcasting. We have to override SystemStreamPartitionGrouper -No dynamic topology. Can’t have arbitrary number of nested CEP queries -Tedious configuration and deployment of jobs. In house code-gem and deployment solution
  79. 79. Thank You

×