Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Reflections on Almost Two Decades of Research into Stream Processing

This is the slide deck that I used during my tutorial presentation at the ACM DEBS Conference (http://www.debs2017.org/) that was held in Barcelona between June 19 and June 23, 2017.
The tutorial paper itself can be accessed here: http://dl.acm.org/citation.cfm?id=3095110

  • Identifiez-vous pour voir les commentaires

Reflections on Almost Two Decades of Research into Stream Processing

  1. 1. 1 DEBS’17 Tutorial 5: Reflections on Almost Two Decades of Research into Stream Processing Kyumars Sheykh Esmaili Real-Time Information Processing (RTIP) research team Bell Labs, Nokia Inc 20-06-2017
  2. 2. 2 Streaming platform for IoT applications Home networks monitoring Hadoop/HDFS Stream schema, Stream provenance, Continuous query modification A Short Bio @kyumarss
  3. 3. 3 Introduction: Streaming Has Gone Mainstream
  4. 4. 4 This Tutorial: Reflections on a Research History • Highlights - trends - best practices • Based on a select set of - major stream processing systems - landmark papers • Also lists a few directions for future research
  5. 5. 5 What this Tutorial Is NOT: A Survey of the Field Cugola, Gianpaolo, et al. "Processing flows of information: From data stream to complex event processing." ACM CSUR, 2012. (based on a DEBS tutorial) Heinze, Thomas, et al. “Tutorial: Cloud-based Data Stream Processing.” ACM DEBS, 2014.
  6. 6. 6 Scope: Stream Processing vs Related Research Domains Active Databases Temporal Databases Sequence Databases CEP Systems Stream Processing
  7. 7. 7 Main DBMS Principles • Set data model - Bounded - Unordered • Relational algebra/operators • Tuples updatable/replaceable - Random access • Passive • Query plan
  8. 8. 8 All Depart from the Established Principles of DBMSs Active Databases Temporal Databases Sequence Databases CEP Systems Stream Processing • Main DBMS Principles - Set data model • Bounded • Unordered - Relational algebra/operators - Tuples updatable/replaceable • Random access - Passive - Query plan Unordered Bounded Unordered Unordered Unordered Passive Passive Random Access Random Access Passive Bounded Query Plan Relational operators
  9. 9. 9 • Introduction (~10’) • Part I: Notable Systems (~35’) • ----------- Break (10’)----------- • Part II: Trends (~15’) • Part III: Best Practices (~10’) • Part IV: Future Research Directions (~10’) Outline
  10. 10. 10 Part I: Notable Systems
  11. 11. 11 Stream Processing Timeline 1998 201620072001 2004 2010 2013
  12. 12. 12 Stream Processing Timeline 1998 201620072001 2004 2010 2013 1st Generation 2st Generation 3rd Generation 4th Generation
  13. 13. 13 Stream Processing Timeline: 1st Generation 1998 201620072001 2004 2010 2013 -Append-only model; fast sequential access (tape, live from network) -Impressive ideas: window ,multiplex, demul, flow language, sequential reads, min copy -Shared sub-queries -Upside-down tree! - Main requirements: performance and flexibility -Defines order attributes with ordering properties -GSQL (SQL + merge) -Dedicated operators -Punctuations/hearbeats to unlock operators -No explicit window -Edge processing (i.e. NIC) - University of Wisconsin-Madison -CQ subsystem of Niagara (“net” data management) -On XML datasets, using XML-QL -Key insight: large commonalities -Inter-query optimization (large scale + incremental) -It also splits queries
  14. 14. 14 Stream Processing Timeline: 1st Generation (cont.) 1998 201620072001 2004 2010 2013 - Brown Uni, Brandeis Uni, MIT -Aimed at Monitoring streams -Lots of emphasis on QoS, approximate query answering -Arrows and Boxes (via GUI) -Notation of Slack and Bounded Sort -UC Berkeley -Next step in the Telegraph project -focused on adaptive query processing -Eddies -Flux -Fjords -Initially in Java. - re-implemented based on PostgreSQL. -Stanford Uni -DSMS for processing continuous queries over streams and relations -An abstract semantics -CQL: a concrete declarative query language
  15. 15. 15 Stream Processing Timeline: 2nd Generation 1998 201620072001 2004 2010 2013 -Initially named Aurora* -Focused on distribution -Relies on Aurora for single node stream processing and Medusa for the distribution. -Revision processing -HA -Connection Point and time travel (replay mechanism) - One of the most mature systems out there -SPC & SPL -SPC: -Distributed, dynamic, and scalable -Beyond relational operators -Processing Elements (PEs) and PE Containers -Notions such as subscription & discovery -A very elaborate transport layer (Data Fabric) -SPL: -Custom language -Procedural -Code generation (C++) -Originally SPADE (mostly, relational operators) -SPL focuses on UDFs -Operator spec includes selectivity, partitionability -Optional deployment and optimization hints
  16. 16. 16 Stream Processing Timeline: 3rd Generation 1998 201620072001 2004 2010 2013 - Partially fault-tolerant -No node addition/removal from the cluster. -Design influenced by System S and MapReduce -One PE per key value -TTL-based removal -Abandoned in favor of Storm - First popular streaming platform -Simple abstractions: spout and bolts. -Allows to build topologies. -Platform takes care of shuffling, transport. -At-least once semantics -Enriched with Trident: -Overhauled in Heron -UC Berkeley - Hadoop Online Prototype -Pipeline data between MapReduce operators -Co-scheduling -Pull-based Reduce => push-based Map -Retains the fault tolerance properties of Hadoop -Can run unmodified MapReduce programs
  17. 17. 17 Stream Processing Timeline: 4th Generation 1998 201620072001 2004 2010 2013 - UC Berkeley -Builds upon the Spark Core features -Micro-batching -A few new operators -State is also treated as RDD -Inherits fault tolerance capabilities of Spark -Offers exactly once -High-throughput, “high” latency -Taken backseat due to Structured Streaming - Real use cases at Google -UDFs -Out-of-order processing (via watermarks) -fault tolerance and exactly- once semantics -state management
  18. 18. 18 Stream Processing Timeline: 4th Generation (cont.) 1998 201620072001 2004 2010 2013 - Streaming as superset of batch -Session windows -Windowing, watermarks, trigger, refinement -FlumeJava + Millwheel -Built on top of Kafka -Heavily tied to it -Unix philosophy -At least once semantics -Relies on Yarn for deployment -Alternative: Kafka Streams -TU Berlin -A collection batch of academic prototypes -Aiming at batch and iterative computations -Native support for streaming -UDFs as first class citizens -Stateful is default -Stratosphere => Flink
  19. 19. 19 Part II: Trends
  20. 20. 20 Trends: Overview 1. From DSMSs to Big “Streaming” Data Frameworks 2. Domain-specific to General-purpose 3. Increased Importance of Exact Results 4. Richer Window Specifications 5. Unification of Batch and Streaming Models
  21. 21. 21 Trend 1: From DSMSs to Big “Streaming” Data Frameworks
  22. 22. 22 Primary Influencer: DBMS vs Big Data Frameworks 1998 201620072001 2004 2010 2013
  23. 23. 23 Examples of DBMS Influence on Early Stream Processing Systems Golab, Lukasz, and M. Tamer Özsu. "Issues in data stream management." ACM Sigmod Record, 2003.
  24. 24. 24 Examples of DBMS Influence on Early Stream Processing Systems
  25. 25. 25 Trend 2: Domain-specific to General-purpose
  26. 26. 26 Initial Streaming Use Cases: Network Traffic + Sensor Networks 1998 201620072001 2004 2010 2013
  27. 27. 27 Early Use Cases Golab, Lukasz, and M. Tamer Özsu. "Issues in data stream management." ACM Sigmod Record, 2003.
  28. 28. 28 Trend 3: Increased Importance of Exact Results
  29. 29. 29 Approximate Query Processing vs Exact Results 1998 201620072001 2004 2010 2013
  30. 30. 30 Example: Approximate Query Processing in Aurora/Borealis
  31. 31. 31 Another Angle: One-pass Computation vs Replayability 1998 201620072001 2004 2010 2013
  32. 32. 32 Going Beyond Guaranteed Delivery: Transactional Stream Processing Meehan, John, et al. "S-store: Streaming meets transaction processing.“ VLDB, 2015. Affetti, Lorenzo, et al. "FlowDB: Integrating Stream Processing and Consistent State Management.“ACM DEBS, 2017.
  33. 33. 33 Trend 4: Richer Window Specifications
  34. 34. 34 Window Types Supported by Almost All Systems https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
  35. 35. 35 Sessions: New Window Type
  36. 36. 36 Support for Session Windows 1998 201620072001 2004 2010 2013
  37. 37. 37 Frames: Data-driven Windows 1998 201620072001 2004 2010 2013
  38. 38. 38 Frames Grossniklaus, Michael, et al. “Frames: data-driven windows.” ACM DEBS, 2016.
  39. 39. 39 A Little More on “Semantic” Windows • Has always been supported by CEP systems • Main Challenge: Unpredictability Artikis, Alexander, et al. "Complex Event Recognition Languages: Tutorial.“ ACM DEBS, 2017.
  40. 40. 40 Trend 5: Unification of Batch and Streaming Models
  41. 41. 41 First Attempt: Lambda Architecture
  42. 42. 42 The New Alternative: Unified Engines 1998 201620072001 2004 2010 2013
  43. 43. 43 Examples of Unified Engines
  44. 44. 44 Part III: Best Practices
  45. 45. 45 Best Practices: Overview 1. Simplified Reasoning and Coordination via Punctuation 2. System-wide State Management
  46. 46. 46 Best Practice1: Simplified Reasoning and Coordination via Punctuation
  47. 47. 47 Use of Punctuation in Stream Processing Platforms 1998 201620072001 2004 2010 2013
  48. 48. 48 Use of Punctuation for Optimization Tucker, Peter A., et al. "Exploiting punctuation semantics in continuous data streams." IEEE TKDE, 2003. Li, Jin, et al. "Semantics and evaluation techniques for window aggregates in data streams.“ ACM SIGMOD, 2005.
  49. 49. 49 Use of Punctuation for Query Modification Sheykh Esmaili, Kyumars, et al. “Changing flights in mid-air: a model for safely modifying continuous queries”, ACM SIGMOD, 2011.
  50. 50. 50 Use of Punctuation for Snapshotting Carbone, Paris, et al. "Lightweight asynchronous snapshots for distributed dataflows." arXiv preprint arXiv:1506.08603 (2015).
  51. 51. 51 Best Practice2: System-wide State Management
  52. 52. 52 State Management: Different Aspects To, Quoc-Cuong, et al. "A Survey of State Management in Big Data Processing Systems." arXiv preprint arXiv:1702.01596 (2017).
  53. 53. 53 State Management in Stream Processing: Main Cases
  54. 54. 54 Support for State Management in Stream Processing Systems 1998 201620072001 2004 2010 2013
  55. 55. 55 State Management Examples: Load Balancing and Auto-Parallelization Gedik, Buğra, et al. "Elastic scaling for data stream processing." IEEE Transactions on Parallel and Distributed Systems, 2014. Shah, Mehul A., et al. "Flux: An adaptive partitioning operator for continuous query systems." ICDE, 2003.
  56. 56. 56 State Management Examples: Fault Tolerance
  57. 57. 57 Part IV: Future Research Directions
  58. 58. 58 IoT-induced Requirements for Stream Processing Platforms Van Raemdonck, W. , et al. "Building Connected Car Applications on Top of the World-Wide Streams Platform.“ ACM DEBS, 2017.
  59. 59. 59 Nokia Bell Lab’s World Wide Streams (WWS) Platform: Bird’s Eye View XStream Language & XStream Studio DeployerDeployer Placement Algorithm Site Monitor Media Processor Processing Sites XStream Processor Geo Processor Media Server Message Broker StreamBridge Dispatcher Registry Gateway Compiler Orchestration LayerExternal Interfaces Architecture
  60. 60. 60 Reference • Esmaili, Kyumars Sheykh. "Reflections on Almost Two Decades of Research into Stream Processing.” ACM DEBS, 2017. • Van Raemdonck, W. , et al. "Building Connected Car Applications on Top of the World- Wide Streams Platform.” ACM DEBS, 2017.

×