theScore is one of the leading sports news and media platforms in North America. With the recent growth of sports betting, theScore is diving in head-first. Province and state regulations often enforce strict data-locality requirements often prohibiting the use of centralized hosted data storage solutions.
In this talk, we'll learn how theScore built Datadex, a geographically distributed system for low-latency queries and real-time updates. We'll touch on the architecture of the system (Aggregator Leaf Tailer), the underlying technologies (RocksDB, Kafka, and the JVM), as well as some nitty-gritty tips for optimizing the JVM and RocksDB for high throughput.
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
Aggregator Leaf Tailer: Bringing Data to Your Users with Ultra Low Latency
1. Brought to you by
Aggregator Leaf Tailer:
Bringing Data to Your Users with
Ultra Low Latency
Jeffery Utter
Staff Developer at theScore
2. Jeffery Utter
Staff Developer, theScore
■ Built half of a distributed database
■ P99.*s matter to 100% of users
■ Wrote my first line of Java at age 30-something
■ I lead a double-life as a double-bassist
3. Table of Contents
■ What is Aggregator, Leaf, Tailor:
■ Goals and Constraints
■ Why not < insert your favorite (distributed) database >
■ Datadex
■ Performance Tips
● Java
● RocksDB
■ Overview
● Architecture
● Future
■ Conclusion
5. Aggregator, Leaf, Tailer
■ Term started getting use in 2019
■ Largely promoted by Rockset
● Rockset Concepts, Design & Architecture [1]
● Aggregator Leaf Tailer: An Alternative to Lambda Architecture for Real-Time Analytics [2]
● Rockset’s Aggregator-Leaf-Tailer Architecture for SQL on semi structured data [3]
■ Prior Art
● Facebook: Science and the Social Graph (2008) [4]
● Serving Facebook Multifeed: Efficiency, performance gains through redesign (2015) [5]
● FollowFeed: LinkedIn's Feed Made Faster and Smarter (2016) [6]
6. ■ Aggregator — Low latency aggregation of data stored in one or more Leaf
■ Leaf — All data stored and indexed in one or more leaf
■ Tailer — Pulls new data from various sources and inserts it into the leaves
Aggregator, Leaf, Tailer
10. Traditional RDBMS (Postgres)
■ Duplication of effort to populate databases
■ Operational overhead - database setup, maintenance, scaling
11. ■ Implicit shared “schema”
■ Good scalability via hosted offerings
■ Operational overhead for on-prem
Cloud NoSQL (MongoDB)
12. Kafka-native (Rockset/kSQL)
■ Not an exact match with our querying needs (kSQL has no secondary
indexes)
■ Both seem geared towards analytic workflows
■ Operational overhead for on-prem
14. ■ Aggregator — Low latency aggregation of data stored in one or more Leaf
■ Leaf — All data stored and indexed in one or more leaf
■ Tailer — Pulls new data from various sources and inserts it into the leaves
Aggregator, Leaf, Tailer
17. ■ Single codebase
■ Single deployable unit (for now)
■ Instances managed by Kubernetes Operator
■ Deploy / Release / Upgrade cycle similar to other backend applications
Low Operational Complexity
18. ■ Simple configuration through CRD
■ Elixir Client library
■ Simple query “language”
■ “Watch” feature to stream updates to downstream services
Developer Ergonomics
19. Scalability
■ Fast scale-out through snapshot/backup/restore mechanism
■ Future improvements to independently scale Aggregator/Leaf/Tailer
22. Minimize Allocations
Ops/sec Error
Before 3,053.990 ± 742.316
After 3,964.574 ± 240.020
~ 30% Increase
Throughput
■ Re-use buffers for key
serialization/deserialization
■ Re-use buffer for reading values - up to
a certain size (fastGet)
25. Resources
1. Rockset Concepts, Design & Architecture:
https://rockset.com/Rockset_Concepts_Design_Architecture.pdf
2. Aggregator Leaf Tailer: An Alternative to Lambda Architecture for Real-Time Analytics:
https://rockset.com/blog/aggregator-leaf-tailer-an-architecture-for-live-analytics-on-event-streams/
3. Rockset’s Aggregator-Leaf-Tailer Architecture for SQL on semi structured data:
http://www.hpts.ws/papers/2019/RocksetHPTS19.pdf
4. Facebook: Science and the Social Graph
https://www.infoq.com/presentations/Facebook-Software-Stack/ (about 53 minutes in)
5. Serving Facebook Multifeed: Efficiency, performance gains through redesign:
https://engineering.fb.com/2015/03/10/production-engineering/serving-facebook-multifeed-efficiency-performance-gains-
through-redesign/
6. FollowFeed: LinkedIn's Feed Made Faster and Smarter:
https://engineering.linkedin.com/blog/2016/03/followfeed--linkedin-s-feed-made-faster-and-smarter
Realtime Indexing for Fast Queries on Massive Semi-Structured Data:
https://www.p99conf.io/session/realtime-indexing-for-fast-queries-on-massive-semi-structured-data/
26. Brought to you by
Jeffery Utter
jeff@jeffutter.com
@jeffutter