Many organizations struggle to balance traditional big data infrastructure with NoSQL databases. Other organizations do the smart thing and consolidate the two. This presentation explores Numberly’s experience migrating an intensive and join-hungry production workload from MongoDB and Hive to Scylla.
Join Alexys Jacob, CTO of Numberly, to learn how they joined billions of rows in seconds and dramatically reduced operational and development complexity by using a single database for their hybrid analytical use case.
As a bonus, Alexys will also cover benchmarks for Dask (a flexible parallel computing library for analytic computing) and Spark, highlighting their differences and lessons learned along the way.
Time Series Foundation Models - current state and future directions
Numberly on Joining Billions of Rows in Seconds: Replacing MongoDB and Hive with Scylla
1. Joining Billions of Rows
in Seconds: Replacing
MongoDB and Hive with Scylla
Alexys Jacob - CTO, Numberly
2. Moderator - Peter Corless, ScyllaDB
Peter has a 29-year career in Silicon Valley that
threads through stints at e2f, Aerospike, Cisco and
Apple. He is passionate about technology, customer
success, engendering community, and social media. In
his off hours he enjoys playing 4X strategy games.
Twitter: @petercorless
2
3. 3
+ The Real-Time Big Data Database
+ Drop-in replacement for Cassandra
+ 10X the performance & low tail latency
+ Open source and enterprise editions
+ Founded by the creators of KVM hypervisor
+ HQs: Palo Alto, CA; Herzelia, Israel
+ Learn more at scylladb.com
About ScyllaDB
5. 1 Eiffel Tower
2 Soccer World Cups
15 Years in the Data industry
Pythonista
OSS enthusiast & contributor
Gentoo Linux developer
CTO at Numberly - living in Paris, France
whoami
@ultrabug
5
6. Business context of Numberly
Digital Marketing Technologist (MarTech)
Handling the relationship between brands and people (People based)
Dealing with multiple sources and a wide range of data types (Events)
Mixing and correlating a massive amount of different types of events...
...which all have their own identifiers (think primary keys)
6
7. Business context of Numberly
Web navigation tracking (browser ID: cookie)
CRM databases (email address, customer ID)
Partners’ digital platforms (cookie ID, hash(email address))
Mobile phone apps (device ID: IDFA, GAID)
Ability to synchronize and translate identifiers between all
data sources and destinations.
➔ For this we use ID matching tables.
7
8. ID matching tables
JOIN
1. SELECT reference population
2. JOIN with the ID matching table
3. MATCHED population is usable
by partner
Queried AND updated all the time!
➔ High read AND write workload
8
9. Real life example: retargeting
From a database (email) to a web banner (cookie)
Previous
donors
generous@coconut.fr
isupportu@lab.com
wiki4ever@wp.eu
openinternet@free.fr
https://kitty.eu
AppNexus
...
Google
ID
matching
table
Cookie id = 123
Cookie id = 297
?
Cookie id = 896
Ad Exchange User cookie id 123
SELECT MATCH ACTIVATE
9
13. Future implementation using Scylla?
Events
Message
queues
Real time
Programs
Batch
Calculation
Scylla
Batch pipeline
Real time pipeline
13
14. Proof Of Concept hardware
Recycled hardware…
▪ 2x DELL R510
• 19GB RAM, 16 cores, RAID0 SAS spinning disks, 1Gbps NIC
▪ 1x DELL R710
• 19GB RAM, 8 cores, RAID0 SAS spinning disks, 1Gbps NIC
➔ Compete with our production? Scylla is in!
14
15. Finding the right schema model
Query based AND test-driven data modeling
1. What are all the cookie IDs associated to the given partner ID
over the last N months?
2. What is the last cookie ID/date for the given partner ID?
Gotcha: the reverse questions are also to be answered!
➔ Denormalization
➔ Prototype with your language of choice!
15
16. Schema tip!
> What is the last cookie ID for the given partner ID?
TIP: CLUSTERING ORDER
▪ Defaults to ASC
➔ Latest value at the end of the
sstable!
▪ Change “date” ordering to
DESC
➔ Latest value at the top of the
sstable
➔ Reduced read latency!
16
17. scylla-grafana-monitoring
Set it up and test it!
▪ Use cassandra-stress
Key graphs:
▪ number of open connections
▪ cache hits / misses
▪ per shard/node distribution
▪ sstable reads
TIP: reduce default scrape interval
▪ scrape_interval: 2s (4s default)
▪ scrape_timeout: 1s (5s default)
17
18. Reference data and metrics
Reference dataset
▪ 10M population
▪ 400M ID matching table
➔ Representative volumes
Measured on our production stack, with real load
NOT a benchmark!
18
21. Testing with Scylla
Distinguish between hot and cold cache scenarios
▪ Cold cache: mostly disk I/O bound
▪ Hot cache: mostly memory bound
Push your Scylla cluster to its limits!
21
24. Spark 2 tuning (1/2)
Use a fixed number of executors
▪ spark.dynamicAllocation.enabled=false
▪ spark.executor.instances=30
Change Spark split size to match Scylla for read performance
▪ spark.cassandra.input.split.size_in_mb=1
Adjust reads per seconds
▪ spark.cassandra.input.reads_per_sec=6666
24
25. Spark 2 tuning (2/2)
Tune the number of connections opened by each executor
▪ spark.cassandra.connection.connections_per_executor_max=100
Align driver timeouts with server timeouts (check scylla.yaml)
▪ spark.cassandra.connection.timeout_ms=150000
▪ spark.cassandra.read.timeout_ms=150000
ScyllaDB blog posts & webinar
▪ https://www.scylladb.com/2018/07/31/spark-scylla/
▪ https://www.scylladb.com/2018/08/21/spark-scylla-2/
▪ https://www.scylladb.com/2018/10/08/hooking-up-spark-and-scylla-part-3/
▪ https://www.scylladb.com/2018/07/17/spark-webinar-questions-answered/
25
27. OK for Scala, what about Python?
No joinWithCassandraTable when
using pyspark...
Maybe we don’t need Spark 2 at all!
1. Load the 10M rows from Hive
2. For every row lookup the ID matching table from Scylla
3. Count the resulting number of matches
27
32. Python+Scylla with Parquet tips!
▪ Use execute_concurrent()
▪ Increase concurrency parameter (defaults to 100)
▪ Use libev as connection_class instead of asyncore
▪ Use hdfs3 + pyarrow to read and load Parquet files: