The world of real-time data processing is constantly evolving, with new technologies and platforms emerging to meet the ever-increasing demands of modern data-driven businesses. Apache Flink and RisingWave are two powerful stream processing solutions that have gained significant traction in recent years. But which platform is right for your organization? Karin Wolok and Yingjun Wu go head-to-head to compare and contrast the strengths and limitations of Flink and RisingWave. They’ll also share real-world use cases, best practices for optimizing performance and efficiency, and key considerations for selecting the right solution for your specific business needs.
Battle of the Stream Processing Titans – Flink versus RisingWave
1. Battle of the Stream Processing Titans
– Flink versus RisingWave
Karin Wolok
Project Elevate
&
Yingjun Wu
RisingWave Labs
2. About Karin
• Developer Relations Consultant
(ProjectElevate.io)
• Ex-StarTree
• Ex-Neo4j
• Formerly ran campaigns for renowned
Individuals and orgs like Eminem, Live Nation,
ReMax, and Novartis.
• Conference speaker, presented at over 50
conferences globally
2
3. About Yingjun
• Founder and CEO of RisingWave Labs
• Ex-AWS Redshift
• Ex-IBM Almaden Research Center
• PhD, National University of Singapore
• Visiting PhD, Carnegie Mellon University
3
4. • People need real-time insights
Background
4
Stock market
monitoring
Inventory management
Parcel tracking Web clickstream
5. • People need real-time insights
Background
5
sub-second seconds minutes hours days
Freshness
Business
value
Stock market
monitoring
Inventory management
Parcel tracking Web clickstream
6. • People need real-time insights
Background
6
sub-second seconds minutes hours days
Batch processing
Freshness
Business
value
7. • People need real-time insights
Background
7
sub-second seconds minutes hours days
Batch processing
Batch processing
Freshness
Business
value
14. Flink vs. RisingWave
• Applications and use cases
• User interface
• Internal architecture
14
15. Applications and Use Cases
15
1 microsecond 1 millisecond 1 second 1 minute 1 hour 1 day
High-frequency trading Fraud detection
IoT computing
Ads recommendation
Stock dashboarding
Delivery app
Inventory tracking
ML training
Data science
Accounting
Network monitoring
Travel booking
16. Applications and Use Cases
• Streaming ETL
• Continuously ingest data from upstream systems, perform
transformations, and deliver results to downstream systems
• Streaming analytics
• Monitoring, alerting, automation, etc…
16
17. Applications and Use Cases
• Streaming ETL
• Continuously ingest data from upstream systems, perform
transformations, and deliver results to downstream systems
• Streaming analytics
• Monitoring, alerting, automation, etc…
17
Databases
Messaging
systems
File
systems
18. Applications and Use Cases
• Streaming ETL
• Continuously ingest data from upstream systems, perform
transformations, and deliver results to downstream systems
• Streaming analytics
• Monitoring, alerting, automation, etc…
18
Databases
Messaging
systems
File
systems
Serving systems
Databases
Messaging
systems
File
systems
20. User Interface
20
MapReduce-style API, SQL/Python wrapper
Flink job to represent a data processing pipeline
PostgreSQL-compatible, Python UDF
Materialized view to represent a data processing pipeline
21. User Interface
21
MapReduce-style API, SQL/Python wrapper
Flink job to represent a data processing pipeline
Each Flink job is independent
PostgreSQL-compatible, Python UDF
Materialized view to represent a data processing pipeline
Materialized views can be dependent
Flink job1
Flink job1
Flink job3 MV1
MV2
MV3
MV4
MV5
MV6
23. Internal Architecture
• Consider joining two data streams
• Impression stream
• Click stream
23
23
Output (adId, impressionTime, clickTime)
Impression (adId, impressionTime)
Click (adId, clickTime)
State
State
Hash table for click stream
Hash table for impression stream
How to manage internal states?
24. Internal Architecture
• Consider joining two data streams
• Impression stream
• Click stream
24
24
Output (adId, impressionTime, clickTime)
Impression (adId, impressionTime)
Click (adId, clickTime)
State
State
Hash table for click stream
Hash table for impression stream
Burst!
How to manage internal states?
25. Internal Architecture
25
MapReduce style, compute-storage coupled Cloud-native style, compute-storage decoupled
State
State
State
State
Storage
(S3)
Compute
(EC2)
State
Storage
(S3)
Compute
(EC2)
State
26. Internal Architecture
26
MapReduce style, compute-storage coupled Cloud-native style, compute-storage decoupled
State
State
State
State
Storage
(S3)
Compute
(EC2)
State
Storage
(S3)
Compute
(EC2)
State
Optimized for performance! Optimized for cost-efficiency!
27. Internal Architecture (Failure Recovery)
27
State State State
States
State State State
Compute
nodes
Persistent
storage
States
Checkpoint
Cache Cache Cache
“state as checkpoint”
28. Internal Architecture (Failure Recovery)
28
State State State
States
State State State
Compute
nodes
Persistent
storage
States
Checkpoint
Cache Cache Cache
“state as checkpoint”
State
Read from
remote state
Recover from
checkpoint
29. Internal Architecture (Elastic Scaling)
29
State State State
States
State State State
Compute
nodes
Persistent
storage
States
Checkpoint
Cache Cache Cache
“state as checkpoint”
Scale out Scale out
30. Summary
Applications and
use cases
Streaming ETL and streaming analytics
User interface
Low-level abstractions (Java) and high-
level wrappers (SQL and Python)
PostgreSQL-style SQL with Python UDF
support
Use Flink jobs to represent stream
processing pipelines; Flink jobs are
independent
Use materialized views to represent
stream processing pipelines;
materialized views can be dependent
with resource sharing enabled
Internal
architecture
Optimized for performance Optimized for cost-efficiency
Slow in failure recovery Fast in failure recovery
Slow in elastic scaling Fast in elastic scaling
30