This slide deck explores trends in stream processing, how streaming SQL has become a standard, the advantages of streaming SQL and more.
View video: https://wso2.com/library/conference/2018/07/wso2con-usa-2018-the-rise-of-streaming-sql/
2. What is Streaming Data?
A series of events/data having the same schema/format
appearing continuously
Coke 24 Fanta 14 Sprite 20 Coke 4
<coke>24</coke> <fanta>14</fanta> <sprite>20</sprite> <coke>4</coke>
3. Almost All Data is Streaming!
All data is generated one by one,
hence batch data is at one point streaming
● Logs
● Transaction data
● Sensor data
● Traffic data
Data is streaming at
the source!
4. ● Process data at the source or process before we store
● Identify insights in real-time and act immediately
● Reduce unnecessary data storage and batch processing
Streaming Data Processing
Stream Processing
Logs
Senors
Devices
Apps
Services
Alerts
Dashboards
Services
Databases
5. Streaming Data
Processing
Operations
● Event driven architecture
● Steaming data integration
● Streaming data preprocessing
● Data store integration
● Service integration
● Streaming data summarization
● KPI analysis and alerts
● Event correlation
● Pattern matching
● Trend analysis
● Real-time prediction
● Streaming machine learning
● … more
6. Positives
● Analytics and machine
learning use cases shifting to
stream processing
● Positive trends
○ Microservices and observability
○ Rise of IoT
○ Security analytics
○ ETL and messaging
Stream Processing Market
Negatives
● Lack of proficient
developers are slowing it
down
● Success depends on the
success of the analytics
and integration market
● Market size
○ 300 ~ 500 million having 30%
7. 1. Code it yourself
+ Customized for your
requirement
− A lot of glue code needs to
be written
2. Stream Processors
+ Code only actors and data
handlers
+ Can scale and handle failure
− Hard to maintain and change
Building Streaming Apps
3. Graphical Tools
+ Good for primitive users & can
visualize the topology
− Inefficient for advanced users
4. Streaming SQL
+ Good for advanced users
+ Easier to understand and
faster implementation
− Not easy to visualize the
topology
8. History of Stream Processing
Databases: Users query when they need data
9. History of Stream Processing
Databases: Users query when they need data
Active Databases: Users want to act when data meets a condition
10. History of Stream Processing
Databases: Users query when they need data
Active Databases: Users want to act when data meets a condition
TelegraphCQ (based PostgreSQL):
Long-running continuous queries over data streams
11. History of Stream Processing
TelegraphCQ (based PostgreSQL):
Long-running continuous queries over data streams
Complex Event Processing:
Detect complex event patterns
and correlations,
1 or 2 nodes & not scalable
E.g. SASE, Esper, Cayuga, and
Siddhi (powers WSO2 SP),
Apama, IBM Infosphere
Stream Processing:
Scalable processing of data
using a graph of actors
run on many nodes & scales
E.g. Aurora, PIPES, STREAM,
Borealis (academic)
12. History of Stream Processing
Complex Event Processing:
Detect complex event patterns
and correlations,
1 or 2 nodes & not scalable
E.g. SASE, Esper, Cayuga, and
Siddhi (powers WSO2 SP),
Apama, IBM Infosphere
Stream Processing:
Scalable processing of data
using a graph of actors
run on many nodes & scales
E.g. Aurora, PIPES, STREAM,
Borealis (academic)
Niche Applications:
Stock markets, monitoring and alerts, & surveillance
13. History of Stream Processing
Niche Applications:
Stock markets, monitoring and alerts, & surveillance
Stream Processing Enters Big Data:
Yahoo S4 (2010) , Twitter Storm (2011) was donated to Apache
14. History of Stream Processing
Niche Applications:
Stock markets, monitoring and alerts, & surveillance
Stream Processing enter Big Data:
Yahoo S4 (2010) , Twitter Storm (2011) was donated to Apache
Described as “like Hadoop, but in real-time”
Wide adoption and visibility:
Spark Streaming, Samza, Flink
15. History of Stream Processing
Big Data Switched to SQL:
From coding based MapReduce
16. History of Stream Processing
Big Data Switched to SQL:
From coding based MapReduce
Stream Processing + CEP Merge:
Support SQL over many nodes in real-time
17. History of Stream Processing
Big Data Switched to SQL:
From coding based MapReduce
Stream Processing + CEP Merge:
Support SQL over many nodes in real-time
Streaming SQL :
Apache Storm, Apache Flink, WSO2 SP, Apache Kafka (KSQL), Apache
Samza and Calcite
19. SQL vs Streaming SQL
SQL
● Work on a finite data table
● Queries run over static
data
● Synchronous response
Streaming SQL
● Works on infinite data
table == data stream
● Data runs over static
queries
● Asynchronous response
data data
data data
Query
data data Query data data
20. Siddhi Streaming SQL Overview
@app:name(‘Sweet-Factory-Analytics’)
@source(type = mqtt, …, @map(type = json, …))
define stream SweetProductionStream(name string, amount double);
from SweetProductionStream[amount < 100 and name == ‘candy’]
select name, sum(amount) as cost
group by name
insert into LawCostCandyProdcutionStream ;
@store(type=‘rdbms’, … )
@primaryKey(‘id’)
@Index(amount)
define table ProductionTable(name string, cost double);
Source/Sink & Streams
Queries
Tables
22. Challenges
In streaming SQL
● Not easy to visualize the topology
In stream processing
● Inability to handle state
● Needs multiple nodes
● Does not support online machine learning
● Does not support long running aggregates in real-time
26. ● Graphical stream
SQL query editor
● Drag & drop
support
● Switch to source
& design
Challenge: Not Easy to Visualize Topology
27. Challenge: Handle State & Need for Multi Nodes
• 2 node minimum HA
– Process upto 100k
events/sec
– While most other stream
processing systems need
around 5+ nodes
• Scale more with Kafka
• Incremental state
persistence and recovery
Stream Processor
Stream Processor
Event Sources
Dashboard
Notification
Invocation
Data Source
Siddhi App
Siddhi App
Siddhi App
Siddhi App
Siddhi App
Siddhi App
Event
Store
28. Running PMML Models for predictions
● Build PMML models via Apache Spark MLlib, H2O.ai, R or Python
● Load built PMML Model into Siddhi and predict in real-time
Supporting native prediction models:
● Spark MLlib Models, and Java based Tensorflow Models
Online Learning and predictions
● Regression analytics
● Markov models
● Anomaly detections
● K-Means clustering
● …more
Challenge: Lack of Knowledge About Future
29. ● Incremental aggregation
○ Aggregation for every second, minute, hour, … , year
● Built on top of architecture
● No big data storage is necessary
● Current values in memory and others in disk
● Executed in a single query
Challenge: Cannot Run Long Running
Aggregates
Current Min
Current Hour
Sec
Min
Hour
0 - 1 - 5 ...
- 1
- 2 - 3 - 4 - 64 - 65 ...
- 2
- 124
30. 1. Start with 2 nodes and scale without changing queries
2. Detect complex event patterns over time
3. Run machine learning models to perform online learning
4. Fuse data in motion and data at rest
5. Perform aggregations from seconds to years
6. Let end users tweak queries
7. Achieve real-time ETL
8. Run rule-based decision making
9. ....more
When to Use WSO2 Stream Processor