Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
A REAL TIME DATA QUERY ENGINE
Michael Natkovich & Nate Speidel
Allow Myself to Introduce . . . Myself
■ Nate Speidel
● nspeidel@oath.com
● Software Engineer
● 2+ years of solving data p...
Allow Myself to Introduce . . . Myself
■ Michael Natkovich
● mln@oath.com
● Director Engineer
● 10+ years of causing data ...
Motivation: Cycle of Sadness
■ Instrumentation validation is unbearably slow
● Needs to be seconds not hours
● Needs to be...
Typical Query Engine
Data Flow
Persistence
Queries
Look Forward Query Engine
Data Flow
Query Engine
Current Queryable Data
Future Queryable Data Old Un-Queryable Data
Query ...
Typical Streaming Query Cost
Storm Query 1 Storm Query 2 Storm Query 3 Spark Query 1
Input: 2MM events/sec
Throughput: 1K ...
Bullet Query Cost
Bullet Query 1 Bullet Query 2 Bullet Query 3 Bullet Query 4
Input: 2MM events/sec
Throughput: 1K events/...
Bullet
■ Retrieves data that arrives after query submission
● Look Forward!
■ No persistence layer
■ Light-weight, fast, a...
What It’s For
Single stream,
multiple
consumers
Adhoc interactive
usage
Programmatic
short lived queries
What It’s Not For
Repeatable
queries
Currently no joins Not meant for ETL
Querying in Bullet
■ Support filtering, logical operators on typed data
■ Supports aggregations
● Group By, Count Distinct...
Streaming Aggregations
■ Motivation
● Calculating cardinality
● Getting live latency distributions
● Validate experimentat...
Overwhelm Single Combiner
Count Distinct: Naive
1. Read Input
2. Round Robin
3. Extract Field
4. Send to Combiner
5. Count...
Vulnerable to Data Skew
Count Distinct: Typical
1. Read Input
2. Round Robin
3. Extract Field
4. Hash Partition
5. Count D...
Count Distinct: Sketches
1. Read Input
2. Round Robin
3. Build Sketch
4. Send to Combiner
5. Merge Sketches
Data Sketches
■ Sketches are a class of stochastic
streaming algorithms
■ Provides approximate results (if data
is too lar...
Data Sketches in Streams
■ Accurate to a Point
● Sketches sized correctly will be 100% accurate
● Error rate is inversely ...
Bullet’s Use of Data Sketches
Data Sketch Query Type
Theta Sketch Count Distinct
Tuple Sketch Group By
Quantile Sketch Dis...
Windowing
■ A way of breaking up an endless stream into digestible
components
■ Typically broken using time or records
■ N...
Windowing
■ Tumbling Windows*
● Contiguous non-overlapping windows at regular intervals
■ Hopping Windows
● Contiguous (po...
Why Windowing
■ Example: Number of distinct users in the next 60 seconds
■ Option 1: Wait 60 secs to get results
● No feed...
Tumbling Window
0 5 10 15 20 25 30
1 2 3 4 5 6 7 8 9
1 2 3 4 5
6 7
8 9
10 second window
Tumbling Window
3 record window
0 5 10 15 20 25 30
1 2 3 4 5 6 7 8 9
1 2 3
4 5 6
7 8 9
Sliding Window
3 record window
1 record slide
0 5 10
1 2 3 4 5
1
1 2
1 2 3
2 3 4
3 4 5
Query
& ID
Request
Processor
Data
Processor
Combiner
Bullet Data Stream
Bullet
WS
Performance Stats
Sensor Data
User Activ...
Core Design Principles
■ No persistence
● Tradeoff: Query Speed, Low Storage Cost > Repeatability
■ Scale for data and que...
Overall Architecture
Backend Layer Detailed Architecture: Storm
Backend Layer Detailed Architecture: Spark
Performance: Linearly Scales for Data
Performance: Linearly Scales for Queries
Demos
■ Bullet Reddit
● https://youtu.be/p6rOy9F7K8U
■ Bullet Finance
● https://youtu.be/RMMT4Phdhr8
In Summary
■ Bullet is a lightweight and cheap stream query engine
■ It offers raw record and OLAP style queries
■ Leverag...
Future Work
■ BQL: SQL-like interface support (already supported in WS)
■ More stream processor support (Flink)
■ All the ...
Links
■ Documentation: https://bullet-db.github.io/
■ Github: https://github.com/bullet-db
■ Contact Us
● Developers: bull...
QUESTIONS
Bullet - Open Source Real-Time Data Query Engine, Michael Natkovich, Director Software Dev Engineering & Nate Speidel, Sof...
Bullet - Open Source Real-Time Data Query Engine, Michael Natkovich, Director Software Dev Engineering & Nate Speidel, Sof...
Bullet - Open Source Real-Time Data Query Engine, Michael Natkovich, Director Software Dev Engineering & Nate Speidel, Sof...
Prochain SlideShare
Chargement dans…5
×

Bullet - Open Source Real-Time Data Query Engine, Michael Natkovich, Director Software Dev Engineering & Nate Speidel, Software Engineer, Oath

91 vues

Publié le

Bullet (https://github.com/bullet-db) is an open-sourced, lightweight, scalable, pluggable, multi-tenant query system that lets you query any data flowing through a streaming system without having to store it. Bullet queries look forward in time - they are submitted first and operate on data flowing through the system from the point of submission and can run forever. Bullet addresses the challenges of supporting intractable Big Data aggregations like Top K, Counting Distincts, and Windowing efficiently without having a storage layer using Sketch-based algorithms.

Publié dans : Technologie
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Bullet - Open Source Real-Time Data Query Engine, Michael Natkovich, Director Software Dev Engineering & Nate Speidel, Software Engineer, Oath

  1. 1. A REAL TIME DATA QUERY ENGINE Michael Natkovich & Nate Speidel
  2. 2. Allow Myself to Introduce . . . Myself ■ Nate Speidel ● nspeidel@oath.com ● Software Engineer ● 2+ years of solving data problems at Yahoo
  3. 3. Allow Myself to Introduce . . . Myself ■ Michael Natkovich ● mln@oath.com ● Director Engineer ● 10+ years of causing data problems at Yahoo
  4. 4. Motivation: Cycle of Sadness ■ Instrumentation validation is unbearably slow ● Needs to be seconds not hours ● Needs to be easy to query ● Needs programmatic access
  5. 5. Typical Query Engine Data Flow Persistence Queries
  6. 6. Look Forward Query Engine Data Flow Query Engine Current Queryable Data Future Queryable Data Old Un-Queryable Data Query Results
  7. 7. Typical Streaming Query Cost Storm Query 1 Storm Query 2 Storm Query 3 Spark Query 1 Input: 2MM events/sec Throughput: 1K events/sec/core Resources: 2K cores/query Total: 8K cores
  8. 8. Bullet Query Cost Bullet Query 1 Bullet Query 2 Bullet Query 3 Bullet Query 4 Input: 2MM events/sec Throughput: 1K events/sec/core Resources: 2K cores Total: 2K cores
  9. 9. Bullet ■ Retrieves data that arrives after query submission ● Look Forward! ■ No persistence layer ■ Light-weight, fast, and scalable ■ UI for Ad-Hoc queries ■ API for programmatic querying ■ Pluggable interface to integrate with streaming data
  10. 10. What It’s For Single stream, multiple consumers Adhoc interactive usage Programmatic short lived queries
  11. 11. What It’s Not For Repeatable queries Currently no joins Not meant for ETL
  12. 12. Querying in Bullet ■ Support filtering, logical operators on typed data ■ Supports aggregations ● Group By, Count Distincts, Top K, Distributions ● DataSketches based ■ Queries have life spans ● All queries run for a specified duration (or infinitely) ■ Results are Windowed ● Windows can be time or record based ● Raw record or aggregate based
  13. 13. Streaming Aggregations ■ Motivation ● Calculating cardinality ● Getting live latency distributions ● Validate experimentation bucket sizes ■ Aggregations are Hard ● Data skew ● Intermediate results are large and expensive to move ● The longer you run, the more memory you need ● Incremental results can’t be combined
  14. 14. Overwhelm Single Combiner Count Distinct: Naive 1. Read Input 2. Round Robin 3. Extract Field 4. Send to Combiner 5. Count Distincts
  15. 15. Vulnerable to Data Skew Count Distinct: Typical 1. Read Input 2. Round Robin 3. Extract Field 4. Hash Partition 5. Count Distincts 6. Send Count 7. Combine Counts
  16. 16. Count Distinct: Sketches 1. Read Input 2. Round Robin 3. Build Sketch 4. Send to Combiner 5. Merge Sketches
  17. 17. Data Sketches ■ Sketches are a class of stochastic streaming algorithms ■ Provides approximate results (if data is too large) ■ Provable error bounds ■ Fixed memory footprint ■ Mergeable, allowing for parallel processing
  18. 18. Data Sketches in Streams ■ Accurate to a Point ● Sketches sized correctly will be 100% accurate ● Error rate is inversely proportional to size of a Sketch ■ Fixed Memory Ceiling ● Maximum Sketch size is configured in advance ● Memory cost of a query is thus known in advance ■ Allows Non-additive Operations to be Additive ● Sketches can be merged into a single Sketch without over counting ● Allows tasks to be parallelized and cheaply combined later ● Allows results to be combined across windows of execution
  19. 19. Bullet’s Use of Data Sketches Data Sketch Query Type Theta Sketch Count Distinct Tuple Sketch Group By Quantile Sketch Distributions Frequent Items Sketch Top K
  20. 20. Windowing ■ A way of breaking up an endless stream into digestible components ■ Typically broken using time or records ■ Needed for incremental results ■ A window is the unit of incrementation
  21. 21. Windowing ■ Tumbling Windows* ● Contiguous non-overlapping windows at regular intervals ■ Hopping Windows ● Contiguous (possibly) overlapping windows at regular intervals ■ Sliding Windows* ● Event based windows looking back at regular event intervals ■ Cascading Windows ● Sliding windows that reset at a regular intervals too ■ Session Windows ● Sliding windows that reset if distance between events is exceeded
  22. 22. Why Windowing ■ Example: Number of distinct users in the next 60 seconds ■ Option 1: Wait 60 secs to get results ● No feedback :( ■ Option 2: Every 5 secs, get current state until end ● Continuous feedback with same final results ● Stop queries early (sufficient information gleaned, query bad, etc.) ● Quickly iterate queries
  23. 23. Tumbling Window 0 5 10 15 20 25 30 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 10 second window
  24. 24. Tumbling Window 3 record window 0 5 10 15 20 25 30 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
  25. 25. Sliding Window 3 record window 1 record slide 0 5 10 1 2 3 4 5 1 1 2 1 2 3 2 3 4 3 4 5
  26. 26. Query & ID Request Processor Data Processor Combiner Bullet Data Stream Bullet WS Performance Stats Sensor Data User Activity IoT Data Query Results Results Query & ID Query & ID Data Records Matching Events & ID
  27. 27. Core Design Principles ■ No persistence ● Tradeoff: Query Speed, Low Storage Cost > Repeatability ■ Scale for data and queries ● Each query cost is fixed and negligible, relative to data ingestion ■ Pluggable everything ● Run on top of any stream processor (Spark, Storm, etc.) ● Read from any data source (Kafka, Kinesis, etc.) ● Choose an implementation of the PubSub (Kafka, REST, etc.) ■ Tune everything ● Example: Sketch size vs Sketch accuracy
  28. 28. Overall Architecture
  29. 29. Backend Layer Detailed Architecture: Storm
  30. 30. Backend Layer Detailed Architecture: Spark
  31. 31. Performance: Linearly Scales for Data
  32. 32. Performance: Linearly Scales for Queries
  33. 33. Demos ■ Bullet Reddit ● https://youtu.be/p6rOy9F7K8U ■ Bullet Finance ● https://youtu.be/RMMT4Phdhr8
  34. 34. In Summary ■ Bullet is a lightweight and cheap stream query engine ■ It offers raw record and OLAP style queries ■ Leverages the power of Data Sketches ■ Only need to enough hardware to read data ● Queries are basically free! ■ Abstraction layer that can sit on any Stream Framework ● Implementations available for Storm and Spark ■ Pluggable allowing for consumption from any data source ■ Fully open sourced!!
  35. 35. Future Work ■ BQL: SQL-like interface support (already supported in WS) ■ More stream processor support (Flink) ■ All the Windows! ■ More aggregations (Group By Count Distinct)
  36. 36. Links ■ Documentation: https://bullet-db.github.io/ ■ Github: https://github.com/bullet-db ■ Contact Us ● Developers: bullet-dev@googlegroups.com ● Users: bullet-users@googlegroups.com ■ Data Sketches: https://datasketches.github.io/ ■ Reddit API: https://www.reddit.com/dev/api/
  37. 37. QUESTIONS

×