[WSO2Con USA 2018] The Rise of Streaming SQL

Director, WSO2
The Rise of Streaming SQL
Sriskandarajah Suhothayan

What is Streaming Data?
A series of events/data having the same schema/format
appearing continuously
Coke 24 Fanta 14 Sprite 20 Coke 4
<coke>24</coke> <fanta>14</fanta> <sprite>20</sprite> <coke>4</coke>

Almost All Data is Streaming!
All data is generated one by one,
hence batch data is at one point streaming
● Logs
● Transaction data
● Sensor data
● Traffic data
Data is streaming at
the source!

● Process data at the source or process before we store
● Identify insights in real-time and act immediately
● Reduce unnecessary data storage and batch processing
Streaming Data Processing
Stream Processing
Logs
Senors
Devices
Apps
Services
Alerts
Dashboards
Services
Databases

Streaming Data
Processing
Operations
● Event driven architecture
● Steaming data integration
● Streaming data preprocessing
● Data store integration
● Service integration
● Streaming data summarization
● KPI analysis and alerts
● Event correlation
● Pattern matching
● Trend analysis
● Real-time prediction
● Streaming machine learning
● … more

Positives
● Analytics and machine
learning use cases shifting to
stream processing
● Positive trends
○ Microservices and observability
○ Rise of IoT
○ Security analytics
○ ETL and messaging
Stream Processing Market
Negatives
● Lack of proficient
developers are slowing it
down
● Success depends on the
success of the analytics
and integration market
● Market size
○ 300 ~ 500 million having 30%

1. Code it yourself
+ Customized for your
requirement
− A lot of glue code needs to
be written
2. Stream Processors
+ Code only actors and data
handlers
+ Can scale and handle failure
− Hard to maintain and change
Building Streaming Apps
3. Graphical Tools
+ Good for primitive users & can
visualize the topology
− Inefficient for advanced users
4. Streaming SQL
+ Good for advanced users
+ Easier to understand and
faster implementation
− Not easy to visualize the
topology

History of Stream Processing
Databases: Users query when they need data

Active Databases: Users want to act when data meets a condition

Active Databases: Users want to act when data meets a condition
TelegraphCQ (based PostgreSQL):
Long-running continuous queries over data streams

TelegraphCQ (based PostgreSQL):
Long-running continuous queries over data streams
Complex Event Processing:
Detect complex event patterns
and correlations,
1 or 2 nodes & not scalable
E.g. SASE, Esper, Cayuga, and
Siddhi (powers WSO2 SP),
Apama, IBM Infosphere
Stream Processing:
Scalable processing of data
using a graph of actors
run on many nodes & scales
E.g. Aurora, PIPES, STREAM,
Borealis (academic)

Complex Event Processing:
Detect complex event patterns
and correlations,
1 or 2 nodes & not scalable
E.g. SASE, Esper, Cayuga, and
Siddhi (powers WSO2 SP),
Apama, IBM Infosphere
Stream Processing:
Scalable processing of data
using a graph of actors
run on many nodes & scales
E.g. Aurora, PIPES, STREAM,
Borealis (academic)
Niche Applications:
Stock markets, monitoring and alerts, & surveillance

Niche Applications:
Stream Processing Enters Big Data:
Yahoo S4 (2010) , Twitter Storm (2011) was donated to Apache

Niche Applications:
Stream Processing enter Big Data:
Yahoo S4 (2010) , Twitter Storm (2011) was donated to Apache
Described as “like Hadoop, but in real-time”
Wide adoption and visibility:
Spark Streaming, Samza, Flink

Big Data Switched to SQL:
From coding based MapReduce

Stream Processing + CEP Merge:
Support SQL over many nodes in real-time

Stream Processing + CEP Merge:
Support SQL over many nodes in real-time
Streaming SQL :
Apache Storm, Apache Flink, WSO2 SP, Apache Kafka (KSQL), Apache
Samza and Calcite

Streaming SQL
Source :https://tdwi.org/articles/2017/08/07/data-all-enabling-real-time-enterprise-with-data-streaming.aspx

SQL vs Streaming SQL
SQL
● Work on a finite data table
● Queries run over static
data
● Synchronous response
Streaming SQL
● Works on infinite data
table == data stream
● Data runs over static
queries
● Asynchronous response
data data
data data
Query
data data Query data data

Siddhi Streaming SQL Overview
@app:name(‘Sweet-Factory-Analytics’)
@source(type = mqtt, …, @map(type = json, …))
define stream SweetProductionStream(name string, amount double);
from SweetProductionStream[amount < 100 and name == ‘candy’]
select name, sum(amount) as cost
group by name
insert into LawCostCandyProdcutionStream ;
@store(type=‘rdbms’, … )
@primaryKey(‘id’)
@Index(amount)
define table ProductionTable(name string, cost double);
Source/Sink & Streams
Queries
Tables

ChallengesChallenges
Source : https://www.pardot.com/blog/3-pressing-b2b-marketing-challenges-solved-with-marketing-automation/

Challenges
In streaming SQL
● Not easy to visualize the topology
In stream processing
● Inability to handle state
● Needs multiple nodes
● Does not support online machine learning
● Does not support long running aggregates in real-time

How Does WSO2
Stream Processor
Solve Them?

● Graphical stream
SQL query editor
● Drag & drop
support
● Switch to source
& design
Challenge: Not Easy to Visualize Topology

Challenge: Handle State & Need for Multi Nodes
• 2 node minimum HA
– Process upto 100k
events/sec
– While most other stream
processing systems need
around 5+ nodes
• Scale more with Kafka
• Incremental state
persistence and recovery
Stream Processor
Stream Processor
Event Sources
Dashboard
Notification
Invocation
Data Source
Siddhi App
Siddhi App
Siddhi App
Siddhi App
Siddhi App
Siddhi App
Event
Store

Running PMML Models for predictions
● Build PMML models via Apache Spark MLlib, H2O.ai, R or Python
● Load built PMML Model into Siddhi and predict in real-time
Supporting native prediction models:
● Spark MLlib Models, and Java based Tensorflow Models
Online Learning and predictions
● Regression analytics
● Markov models
● Anomaly detections
● K-Means clustering
● …more
Challenge: Lack of Knowledge About Future

● Incremental aggregation
○ Aggregation for every second, minute, hour, … , year
● Built on top of architecture
● No big data storage is necessary
● Current values in memory and others in disk
● Executed in a single query
Challenge: Cannot Run Long Running
Aggregates
Current Min
Current Hour
Sec
Min
Hour
0 - 1 - 5 ...
- 1
- 2 - 3 - 4 - 64 - 65 ...
- 2
- 124

1. Start with 2 nodes and scale without changing queries
2. Detect complex event patterns over time
3. Run machine learning models to perform online learning
4. Fuse data in motion and data at rest
5. Perform aggregations from seconds to years
6. Let end users tweak queries
7. Achieve real-time ETL
8. Run rule-based decision making
9. ....more
When to Use WSO2 Stream Processor

[WSO2Con USA 2018] The Rise of Streaming SQL

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to [WSO2Con USA 2018] The Rise of Streaming SQL

Similar to [WSO2Con USA 2018] The Rise of Streaming SQL (20)

More from WSO2

More from WSO2 (20)

Recently uploaded

Recently uploaded (20)

[WSO2Con USA 2018] The Rise of Streaming SQL