We often need to build applications that analyze Kafka data to unlock the most value from event streams, so how can organizations build these real-time analytics applications? In this talk, we examine an indexing approach that enables fast SQL analytics on data from Kafka, without data flattening or denormalization. Rockset is the real-time indexing database that builds an inverted index, a columnar index and a row index on all fields of your Kafka messages, including nested fields and arrays. This Converged Index accelerates various types of analytic queries–search, aggregations and joins–without the need to denormalize or transform data for performance reasons. With indexing delivering significant gains in query performance, we also need to index new data in a timely manner. We discuss several strategies used for efficient ingestion and indexing from Kafka, including rollups, write optimizations on the underlying RocksDB storage engine, and the disaggregation of ingest and query compute.
3. Unlocking Value from Event Streams
Event Streams
● Online advertising
● Web clicks
● Online gaming interactions
● Online purchases and bookings
● Financial transactions
● IoT - sensor data
3
Applications
● Real-time customer 360
● Real-time personalization
● Logistics tracking
● Security analytics
● Operational analytics
4. ETL
The Need for Real-Time Analytics
4
Past
Present
Event streams Data lake Data warehouse Offline reporting
Event streams Data lake Data warehouse
ETL
Offline reporting
Real-time
database
Real-time data
applications
5. Apache Kafka and Real-Time Analytics
5
● Apache Kafka is a foundational platform for
real-time analytics
○ Central location for collecting event
data and making it available in real time
○ Low latency and high write throughput
○ Queue: First-in, first-out
Source: https://kafka.apache.org/powered-by
6. Rockset and Real-Time Analytics
6
Real-time indexing database
for modern data applications
at massive scale
without operational overhead
7. How Kafka and Rockset Work Together
7
Events from apps,
devices, sensors
KSQL
Enrichment
Real-time analytics
applications
OLTP database or data
lake
SQL, REST
9. Query Latency
● Ad-hoc queries and drilldowns in real-time
● Millisecond-latency queries to support live dashboards and data APIs
● How to get achieve low-latency queries?
9
10. Optimize Query Latency by Indexing
Traditional approach:
Parallelize and scan
10
event data MapReduce reports event data Converged
Indexing
ad hoc
analytics
Real-time analytics:
Parallelize and index
Column store Column, Inverted and Row store
11. ● All fields are indexed in inverted, columnar and row indexes
● Accelerates search, aggregation and join queries
● No index definition required
Converged Index
<doc 0>
{
“name”: “Igor”
}
<doc 1>
{
“name”: “Dhruba”
}
Key Value
R.0.name Igor Row Store
R.1.name Dhruba
C.name.0 Igor Column Store
C.name.1 Dhruba
S.name.Dhruba.1 Search index
S.name.Igor.0
11
12. 12
Query Optimizer
● Low latency for both highly selective queries and large scans
● Optimizer picks between
○ inverted index (Index Filter operator)
○ columnar format (Column Scan operator)
○ inverted index (Index Scan operator)
14. Complex Queries
● Support for expressive query
language
● Ability to perform joins,
aggregations, sorting, filtering, etc.
14
15. Read-Time JOINs
● Streams are most useful when joined with other data
15
Streaming
event data
Query
Analytics backend
Other Data Sources
(e.g. Amazon S3)
16. Flexibility with Data and Schema
● Allow values of different types in the same column
● Ability to ingest new data without needing data cleaning at write time
○ Avoid flattening or denormalization for performance reasons
● Type binding not done at write time (but done later at query time)
16
20. Continuous Ingestion and Indexing from Kafka
● Fast ingestion
○ New data is visible in query results in seconds
○ Complex ETL processes can add minutes to hours before the data is
available to query
● Live sync
○ Continuous sync of new data from Kafka
20
live
sync
within seconds
21. Ingest Rollups
● SQL rollups and transformations
○ Pre-aggregate data at ingest time to increase performance and reduce size
○ Familiar SQL syntax
21