This tutorial gives out an brief and interesting introduction to modern stream computing technologies. The participants can learn the essential concepts and methodologies for designing and building a advanced stream processing system. The tutorial unveils the key fundamentals behind various kinds of design choices. Some forecast of technology developments in this domain is also introduced at the last section of this tutorial.
4. Murphy’s Law
• Everything that Can
Go Wrong, Goes
Wrong
– Unstable Servers
– Unstable Network
– Unstable Data Source
– Unstable Managers
– Unstable…
5.
6. Agenda
• Section I: A Scratch
– Brief Intro to Streams
• Section II: Build From Ground Up
– Modern Stream Processing Architecture
• Section III: Could Be Much Sexier
– Stream Evolution in Progress
21. Common Partition Rules
• Feature Based
– Application Related
• Random
– Balancing Load
• Hash
– Aggregating Data by User Defined Key
• Replication
– For Improving Availability
41. While Reliability Hurts Performance
• All Reliability Solutions are Based on
– Indexing
– Snapshot
– Replay
• A Club Of Performance Penalties
42. Tuning Strategies
• State Operators VS. Stateless Operators
– Independent State Storage
– White Board Programming Model
– Lazy State Synchronization
• Micro Batching Snapshot
43. Micro Batching
Batch Size Time Window Throughput Snapshot Cost Restore Cost
1 1ms 1x very high very low
10 10ms 10x high low
100 100ms 100x medium low
1000 100ms 1000x medium low
10000 1s 5000x low high
Most Systems Are Here
May Constrained By
Network Configuration
44. Pit Stop
• Reliability is based on
– Message Backup & Replay
– Status Snapshot
• When tuning, think of
– How to handle operator status
– Micro Batching
46. How Does Fluctuation Happen?
• Data Source Fluctuation
• Fault Tolerance Operations
47. Fluctuation Handling
• Technologies To Obtain
– High Performance RPC Framework
– Auto Partitioning
– Dynamic Resource Allocation
– Global Flow Control
48. High Performance RPC Framework
• Indication of High Performance
– Over 20k QPS/sec, with 1byte payload
– On commodity server with 2 6-core CPU and Giga
Ethernet
• See Also
– SOFA Framework from Baidu.com
– https://github.com/BaiduPS/sofa-pbrpc
74. See Also
• A Reconfigurable Fabric for Accelerating Large-
Scale Datacenter Services
– 20x Performance
– ISCA 2014 by Microsoft Research
– http://research.microsoft.com/pubs/212001/Cata
pult_ISCA_2014.pdf
75. Pit Stop
• Scale-out is difficult, think of scale-up
• Reconfigurable CPU has got significant
performance improvements
76. Conclusion
• Stream Processing System can be Well
Modeled by SDL
• Trade Off between Reliability & Performance
• High level programming & Scale-Up are Future
Trends
77. References
• Stonebraker, Michael, Uǧur Çetintemel, and Stan Zdonik. "The 8 requirements of real-time
stream processing." ACM SIGMOD Record 34, no. 4 (2005): 42-47.
• Zaharia, Matei, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica.
"Discretized streams: Fault-tolerant streaming computation at scale." In Proceedings of the
Twenty-Fourth ACM Symposium on Operating Systems Principles, pp. 423-438. ACM, 2013.
• Murray, Derek G., Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martin
Abadi. "Naiad: a timely dataflow system." In Proceedings of the Twenty-Fourth ACM
Symposium on Operating Systems Principles, pp. 439-455. ACM, 2013.
• Castro Fernandez, Raul, Matteo Migliavacca, Evangelia Kalyvianaki, and Peter Pietzuch.
"Integrating scale out and fault tolerance in stream processing using operator state
management." In Proceedings of the 2013 international conference on Management of data,
pp. 725-736. ACM, 2013.