Sub-Second SQL Search, Aggregations and Joins with Kafka and Rockset | Dhruba Borthakur, Rockset

•

0 j'aime•393 vues

We often need to build applications that analyze Kafka data to unlock the most value from event streams, so how can organizations build these real-time analytics applications? In this talk, we examine an indexing approach that enables fast SQL analytics on data from Kafka, without data flattening or denormalization. Rockset is the real-time indexing database that builds an inverted index, a columnar index and a row index on all fields of your Kafka messages, including nested fields and arrays. This Converged Index accelerates various types of analytic queries–search, aggregations and joins–without the need to denormalize or transform data for performance reasons. With indexing delivering significant gains in query performance, we also need to index new data in a timely manner. We discuss several strategies used for efficient ingestion and indexing from Kafka, including rollups, write optimizations on the underlying RocksDB storage engine, and the disaggregation of ingest and query compute.

Technologie

Sub-Second SQL Search, Aggregations
and Joins with Kafka and Rockset
Dhruba Borthakur / CTO, Rockset

Presenter
Dhruba Borthakur
Co-Founder & CTO
2

Unlocking Value from Event Streams
Event Streams
● Online advertising
● Web clicks
● Online gaming interactions
● Online purchases and bookings
● Financial transactions
● IoT - sensor data
3
Applications
● Real-time customer 360
● Real-time personalization
● Logistics tracking
● Security analytics
● Operational analytics

ETL
The Need for Real-Time Analytics
4
Past
Present
Event streams Data lake Data warehouse Offline reporting
Event streams Data lake Data warehouse
ETL
Offline reporting
Real-time
database
Real-time data
applications

Apache Kafka and Real-Time Analytics
5
● Apache Kafka is a foundational platform for
real-time analytics
○ Central location for collecting event
data and making it available in real time
○ Low latency and high write throughput
○ Queue: First-in, first-out
Source: https://kafka.apache.org/powered-by

Rockset and Real-Time Analytics
6
Real-time indexing database
for modern data applications
at massive scale
without operational overhead

How Kafka and Rockset Work Together
7
Events from apps,
devices, sensors
KSQL
Enrichment
Real-time analytics
applications
OLTP database or data
lake
SQL, REST

An indexing database to serve
queries from Kafka data

Query Latency
● Ad-hoc queries and drilldowns in real-time
● Millisecond-latency queries to support live dashboards and data APIs
● How to get achieve low-latency queries?
9

Optimize Query Latency by Indexing
Traditional approach:
Parallelize and scan
10
event data MapReduce reports event data Converged
Indexing
ad hoc
analytics
Real-time analytics:
Parallelize and index
Column store Column, Inverted and Row store

● All fields are indexed in inverted, columnar and row indexes
● Accelerates search, aggregation and join queries
● No index definition required
Converged Index
<doc 0>
{
“name”: “Igor”
}
<doc 1>
{
“name”: “Dhruba”
}
Key Value
R.0.name Igor Row Store
R.1.name Dhruba
C.name.0 Igor Column Store
C.name.1 Dhruba
S.name.Dhruba.1 Search index
S.name.Igor.0
11

12
Query Optimizer
● Low latency for both highly selective queries and large scans
● Optimizer picks between
○ inverted index (Index Filter operator)
○ columnar format (Column Scan operator)
○ inverted index (Index Scan operator)

Complex Queries
● Support for expressive query
language
● Ability to perform joins,
aggregations, sorting, filtering, etc.
14

Read-Time JOINs
● Streams are most useful when joined with other data
15
Streaming
event data
Query
Analytics backend
Other Data Sources
(e.g. Amazon S3)

Flexibility with Data and Schema
● Allow values of different types in the same column
● Ability to ingest new data without needing data cleaning at write time
○ Avoid flattening or denormalization for performance reasons
● Type binding not done at write time (but done later at query time)
16

Disaggregated, Cloud-Native Architecture
18
Aggregator Leaf Tailer Architecture

19
Scaling Query Compute
Aggregator Leaf Tailer Architecture

Continuous Ingestion and Indexing from Kafka
● Fast ingestion
○ New data is visible in query results in seconds
○ Complex ETL processes can add minutes to hours before the data is
available to query
● Live sync
○ Continuous sync of new data from Kafka
20
live
sync
within seconds

Ingest Rollups
● SQL rollups and transformations
○ Pre-aggregate data at ingest time to increase performance and reduce size
○ Familiar SQL syntax
21

Learn More
23
Request demo or get started for free at rockset.com
Reach out at dhruba@rockset,com

Thank you
Dhruba Borthakur / dhruba@rockset.com

25
● Fields are dynamically typed
Strong Dynamic Typing

26
● Fields are dynamically typed
● Queries are strongly typed
Strong Dynamic Typing

27
● Fields are dynamically typed
● Queries are strongly typed
● Smart schemas
Strong Dynamic Typing

Contenu connexe

Tendances

(Stephen Parente + Jeff Field, Blizzard) Kafka Summit SF 2018 Blizzard’s global data platform has become a driving force in both business and operational analytics. As more internal customers onboard with the system, there is increasing demand for custom applications to access this data in near real time. In order to avoid many independent teams with varying levels of Kafka expertise all accessing the firehose from our critical production Kafkas, we developed our own pub-sub system on top of Kafka to provide specific datasets to customers on their own cloud deployed Kafka clusters.

You Must Construct Additional Pipelines: Pub-Sub on Kafka at Blizzard

confluent

As a 120 year-old company, Nordstrom was facing numerous challenges as a result of an aging, service-oriented, architecture. Developers needing to implement reporting for analytics separately from core functionality resulted in questionable data quality for analytical purposes. Scaling dependent services in harmony to not overwhelm each other was a struggle faced by many, if not most, teams. Several years into a company-wide transition to an event-sourced architecture, Nordstrom has solved these and various other problems. By leveraging the capabilities of Apache Kafka and Confluent, combined with a deep organizational focus on well-defined business event schemas, a singular event can be used for analytical, functional, operational, and model building purposes. This session will describe this architecture and the lessons learned while building it, with a focus on the internally built, multi-tenant, multi-cluster, Kafka-as-a-Service platform that enables it.

Nordstrom's Event-Sourced Architecture and Kafka-as-a-Service | Adam Weyant a...

HostedbyConfluent

Apache Kafka users who want to leverage Google Cloud Platform's (GCPs) data analytics platform and open source hosting capabilities can bridge their existing Kafka infrastructure on-premise or in other clouds to GCP using Confluent's replicator tool and managed Kafka service on GCP. Using actual customer examples and a reference architecture, we'll showcase how existing Kafka users can stream data to GCP and use it in popular tools like Apache Beam on Dataflow, BigQuery, Google Cloud Storage (GCS), Spark on Dataproc, and Tensorflow for data warehousing, data processing, data storage, and advanced analytics using AI and ML.

Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...

HostedbyConfluent

(Bob Lehmann, Bayer) Kafka Summit SF 2018 You’ve built your streaming data platform. The early adopters are “all in” and have developed producers, consumers and stream processing apps for a number of use cases. A large percentage of the enterprise, however, has expressed interest but hasn’t made the leap. Why? In 2014, Bayer Crop Science (formerly Monsanto) adopted a cloud first strategy and started a multi-year transition to the cloud. A Kafka-based cross-datacenter DataHub was created to facilitate this migration and to drive the shift to real-time stream processing. The DataHub has seen strong enterprise adoption and supports a myriad of use cases. Data is ingested from a wide variety of sources and the data can move effortlessly between an on premise datacenter, AWS and Google Cloud. The DataHub has evolved continuously over time to meet the current and anticipated needs of our internal customers. The “cost of admission” for the platform has been lowered dramatically over time via our DataHub Portal and technologies such as Kafka Connect, Kubernetes and Presto. Most operations are now self-service, onboarding of new data sources is relatively painless and stream processing via KSQL and other technologies is being incorporated into the core DataHub platform. In this talk, Bob Lehmann will describe the origins and evolution of the Enterprise DataHub with an emphasis on steps that were taken to drive user adoption. Bob will also talk about integrations between the DataHub and other key data platforms at Bayer, lessons learned and the future direction for streaming data and stream processing at Bayer.

Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...

confluent

Apache Druid is a high-performance distributed analytics store for modern analytics applications. It supports ingesting millions of events per second and sub-second query processing. Druid supports various types of data sources for ingestion, including Apache Kafka. You can immediately query on stream events once they get ingested into Druid. Since Kafka provides scalable and robust data delivery while Druid supports advanced complex analysis on streams, Kafka and Druid are widely used together for BI and operational analytics use cases, which require interactivity, scalability, real-time, and performance. This talk is based on our real-world experiences building out streaming analytics stacks powering production use cases across many industries.

Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...

HostedbyConfluent

Today, many companies that have lots of data are still struggling to derive value from machine learning (ML) and data science investments. Why? Accessing the data may be difficult. Or maybe it’s poorly labeled. Or vital context is missing. Or there are questions around data integrity. Or standing up an ML service can be cumbersome and complex. At Nuuly, we offer an innovative clothing rental subscription model and are continually evolving our ML solutions to gain insight into the behaviors of our unique customer base as well as provide personalized services. In this session, I’ll share how we used event streaming with Apache Kafka® and Confluent Cloud to address many of the challenges that may be keeping your organization from maximizing the business value of machine learning and data science. First, you’ll see how we ensure that every customer interaction and its business context is collected. Next, I’ll explain how we can replay entire interaction histories using Kafka as a transport layer as well as a persistence layer and a business application processing layer. Order management, inventory management, logistics, subscription management – all of it integrates with Kafka as the common backbone. These data streams enable Nuuly to rapidly prototype and deploy dynamic ML models to support various domains, including pricing, recommendations, product similarity, and warehouse optimization. Join us and learn how Kafka can help improve machine learning and data science initiatives that may not be delivered to their full potential.

Maximize the Business Value of Machine Learning and Data Science with Kafka (...

confluent

Are you looking for a cloud-based architecture that includes the best of breed streaming and database technologies? In this session you will learn how to setup and configure the Confluent Cloud with MongoDB Atlas. We'll start the journey learning about the basic connectivity between the two cloud services and end with a brief discovery of what you can do with data once it is in MongoDB Atlas. By the end of this session you will know how to securely setup and configure the MongoDB Atlas connectors in the Confluent Cloud in both a source and sink configuration.

Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...

HostedbyConfluent

Studying the ""how"" of Kafka makes you better at using Kafka, but studying its ""whys"" makes you better at so much more. In looking at the tradeoffs behind a system like Kafka, we learn to reason more clearly about distributed systems and to make high-stakes technology adoption decisions more effectively. These are skills we all want to improve! In this talk, we'll examine trade-offs on which our favorite distributed messaging system takes opinionated positions: - Whether to store data contiguously or using an index - How many storage tiers are best? - Where should metadata live? - And more. It's always useful to dissect a modern distributed system with the goal of understanding it better, and it's even better to learn to deeper architectural principles in the process. Come to this talk for a generous helping of both.

Why Kafka Works the Way It Does (And Not Some Other Way) | Tim Berglund, Conf...

HostedbyConfluent

Cybersecurity is a broad topic and many commercial products are related to it. We demonstrate a fundamental concept in network analysis: re-construction and visualization of temporal networks. Furthermore, we apply the method to describe operational conditions of a Hadoop cluster. Our experiments provide first results and allow a classification of the cluster state related to current workloads. The temporal networks show significant differences for different operation modes. In reallity we would expect mixed workloads. If such workload parameters are known, we are able to handle a-typical events accordingly - which means, we are able to create alerts based on context information, rather than only the package content. We show an end-to-end example: (1) Data collection is done via python, using the sniffer script; (2) using Apache Hive and Apache Spark we analyze the network traffic data and create the temporary network. Finally, we are able to visualize the results using Gephi in step (3). In a next step, we plan to contribute to the Apache Spot project.

PCAP Graphs for Cybersecurity and System Tuning

Dr. Mirko Kämpf

Using Kafka to stream data into TigerGraph, a distributed graph database, is a common pattern in our customers’ data architecture. In the TigerGraph database, Kafka Connect framework was used to build the native S3 data loader. In TigerGraph Cloud, we will be building native integration with many data sources such as Azure Blob Storage and Google Cloud Storage using Kafka as an integrated component for the Cloud Portal. In this session, we will be discussing both architectures: 1. built-in Kafka Connect framework within TigerGraph database; 2. using Kafka cluster for cloud native integration with other popular data sources. Demo will be provided for both data streaming processes.

How a distributed graph analytics platform uses Apache Kafka for data ingesti...

HostedbyConfluent

Quality Matters … and as event-driven architectures (EDA) become increasingly popular in the microservices space, ensuring the delivery and performance of your EDA increases in importance. But while it’s powerful architecture, it does come with its challenges, especially from a testing perspective. For example, most organizations are not reliant on Kafka alone, but a multitude of interconnected APIs like REST, GraphQL and gRPC. One of the questions that arise from this challenge: How do you build end-to-end tests when the APIs are completely different technologies—without relying on fragile scripts? In our talk, we’ll tackle this question and many more when it comes to the testing of Apache Kafka endpoints and your services architecture. We’ll cover what makes testing in EDA difficult; technologies that can help you; and how we at SmartBear are thinking about these testing problems and, most importantly, how we are trying to solve for them.

Testing Event Driven Architectures: How to Broker the Complexity | Frank Kilc...

HostedbyConfluent

The Apache Kafka ecosystem is very rich with components and pieces that make for designing and implementing secure, efficient, fault-tolerant and scalable event stream processing (ESP) systems. Using real-world examples, this talk covers why Apache Kafka is an excellent choice for cloud-native and hybrid architectures, how to go about designing, implementing and maintaining ESP systems, best practices and patterns for migrating to the cloud or hybrid configurations, when to go with PaaS or IaaS, what options are available for running Kafka in cloud or hybrid environments and what you need to build and maintain successful ESP systems that are secure, performant, reliable, highly-available and scalable.

Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...

HostedbyConfluent

Enterprise Metadata Integration

Dr. Mirko Kämpf

Despite great advances in Kafka's SaaS offerings it can still be challenging to create a sustainable event-driven ecosystem. Often platform engineers become de facto ‘gatekeepers’ of events & topics, yet their day job is not about data modelling or domain expertise. We've all seen the bottlenecks these unsustainable processes create. Realising the potential of event streams requires much more than infrastructure. Beyond an event-driven mindset, it requires domain experts to lead creation of well-defined discoverable events through fit-for-purpose governance. AsyncAPI is the OpenAPI for events that can form the basis of the required self-governing, self-service eventing framework. This session will introduce a self-governing framework using AsyncAPI and share how the Bank of New Zealand applied this framework to leverage a passionate Kafka community and embed event-driven thinking. You’ll leave with a tangible set of ideas to give your own events a bit more swagger using AsyncAPI.

Self-service Events & Decentralised Governance with AsyncAPI: A Real World Ex...

HostedbyConfluent

At Under Armour Connected Fitness, we’ve built an event streaming platform on top of Kafka and the Confluent stack that makes it easy for developers to produce and consume schema-based events without requiring direct knowledge of Kafka. We are constantly trying to improve the developer experience. The platform consists of multiple federated Kafka clusters, a schema registry, a topology service, an archiver and specialized client libraries and Web / CLI tools that assist developers with producer and consumer workflows. In this talk, we will take a deeper dive into the design and implementation of a Scala/Java implementation of our client library that allows developers to produce or consume events without worrying about the underlying infrastructure and their location while enjoying the benefits of data compatibility through schemas. We’ll also look at an HTTP based client proxy that exposes the same API but for languages without our native support. Finally, we’ll walk through Web and CLI tools we built to make working with the platform easier. The content of this talk will be primarily aimed at software developers looking for ideas on how to build Kafka client tools that allow producer/consumer interactions protected by schema-based event definitions while hiding details of the underlying infrastructure.

Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...

confluent

Keynote: Jay Kreps, Confluent | Kafka ♥ Cloud | Kafka Summit 2020

confluent

Guru Sattanathan, Confluent, Senior Solutions Engineer Enterprise Integration technologies (aka Middleware) are the key enablers when it comes to Real-time data flows or Event Driven Architecture. Starting from real-time payments, e-commerce, travel booking systems, etc, everything is powered by a middleware underneath. It did transform a lot of things but with caveats! Are ESB’s & MQ’s enough for today’s integration needs? Do you know their technical debts? If you are someone looking at integrating your applications or an Integration Architect this session is for you. It's time to refresh yourself and see how organizations are building integrations today. In this session, we will go in this order: -Recap on Enterprise Integration technologies -What are the key flaws & What needs improvement? -What is Apache Kafka? -Rethinking Integration using Apache Kafka https://www.meetup.com/KafkaMelbourne/events/280590162/

Death of the dumb pipes: Using Apache Kafka® for Integration projects

HostedbyConfluent

Watch this talk here: https://www.confluent.io/online-talks/siem-modernization-build-a-situationally-aware-organization-with-apache-kafka Of all security breaches, 85% are conducted with compromised credentials, often at the administration level or higher. A lot of IT groups think “security” means authentication, authorization and encryption (AAE), but these are often tick-boxes that rarely stop breaches. The internal threat surfaces of data streams or disk drives in a raidset in a data centre are not the threat surface of interest. Cyber or Threat organizations must conduct internal investigations of IT, subcontractors and supply chains without implicating the innocent. Therefore, they are organizationally air-gapped from IT. Some surveys indicate up to 10% of IT is under investigation at any given time. Deploying a signal processing platform, such as Confluent Platform, allows organizations to evaluate data as soon as it becomes available enabling them to assess and mitigate risk before it arises. In Cyber or Threat Intelligence, events can be considered signals, and when analysts are hunting for threat actors, these don't appear as a single needle in a haystack, but as a series of needles. In this paradigm, streams of signals aggregate into signatures. This session shows how various sub-systems in Apache Kafka can be used to aggregate, integrate and attribute these signals into signatures of interest. In this talk you will learn: -The current threat landscape -The difference between Security and Threat Intelligence -The value of Confluent platform as an ideal complement to hardware endpoint detection systems and batch-based SIEM warehouses

SIEM Modernization: Build a Situationally Aware Organization with Apache Kafka®

confluent

At Wells-Fargo, we move 150 TB of logs data from our syslogs to Splunk forwarders that get indexed and organized for analytic queries. As we modernize and migrate our applications to our hybrid cloud the performance expectations for this infrastructure will proportionately increase. Those improvements include the resilience of the end to end infrastructure. First, we decoupled the applications from their logging interface through a loglibrary which split the streams of logs from their sources to KAFKA which routed them to two separate destinations Splunk and ELK respectively. We also used prometheus and grafana for monitoring the metrics. We also deployed KAFKA, Splunk, ELK, Prometheus and Grafana on the Kubernetes clusters. Confluent had released a version of KAFKA without Zookeeper and replaced its functionality with Quorum Controller. The Quorum-Controller version exhibited better disposability one of the 12factors that's important for Cloud-Nativeness. We packaged this version into a Kubernetes operator called Keda and deployed this for auto-scaling. We tested this to simulate the amount of logdata that we typically generate in production. Based on the above we have also implemented distributed tracing and help make it just as resilient. We will share our lessons learnt, the patterns and practices to modernize both our underlying runtime platforms and our applications with highly performing and resilient event-driven architectures.

Moving 150 TB of data resiliently on Kafka With Quorum Controller on Kubernet...

HostedbyConfluent

Kafka Summit NYC 2017 - Venice: A Distributed Database on top of Kafka

confluent

Tendances (20)