At NMC (Nielsen Marketing Cloud) we need to present to our clients the number of unique users who meet a given criteria. The condition is typically a set-theoretic expression over a stream of events for a given time range. Historically, we have used ElasticSearch to answer these types of questions, however, we have encountered major scaling issues. In this presentation we will detail the journey of researching, benchmarking and productionizing a new technology, Druid, with DataSketches, to overcome the limitations we were facing
2. Introduction
Yakir Buskilla Itai Yaffe
● Software Architect
● Focusing on Big
Data and Machine
Learning problems
● Big Data
Infrastructure
Developer
● Dealing with Big
Data challenges for
the last 5 years
3. Nielsen Marketing Cloud (NMC)
● eXelate was acquired by Nielsen 2 years ago
● A leader in the Ad Tech and Marketing Tech industry
● What do we do ?
○ Data as a Service (DaaS)
○ Software as a Service (SaaS)
5. The need
● Nielsen Marketing Cloud business question
○ How many unique devices we have encountered:
■ over a given date range
■ for a given set of attributes (segments, regions, etc.)
● Find the number of distinct elements in a data stream which
may contain repeated elements in real time
8. ● Store everything
● Store only 1 bit per device
○ 10B Devices-1.25 GB/day
○ 10B Devices*80K attributes - 100 TB/day
● Approximate
Possible solutions
Naive
Bit VectorApprox.
9. Our journey
● Elasticsearch
○ Indexing data
■ 250 GB of daily data, 10 hours
■ Affect query time
○ Querying
■ Low concurrency
■ Scans on all the shards of the corresponding index
10. What we tried
● Preprocessing
● Statistical algorithms (e.g HyperLogLog)
11. ● K Minimum Values (KMV)
● Estimate set cardinality
● Supports set-theoretic operations
X Y
● ThetaSketch mathematical framework - generalization of KMV
X Y
ThetaSketch
20. Guidelines and pitfalls
● Data modeling
○ Reduce the number of intersections
○ Different datasources for different use cases
2016-11-15
2016-11-15
2016-11-15
Timestamp Attribute
Count
Distinct
Timestamp Attribute Region
Count
Distinct
US XXXXXX US
Porsche
Intent
XXXXXX
Porsche
Intent
... ......
XXXXXX
...
21. Guidelines and pitfalls
● Query optimization
○ Combine multiple queries into single query
○ Use filters
22. Guidelines and pitfalls
● Batch Ingestion
○ EMR Tuning
■ 140-nodes cluster
● 85% spot instances => ~80% cost reduction
○ Druid input file format - Parquet vs CSV
■ Reduced indexing time by X4
■ Reduced used storage by X10
Daas = marketplace for device level data connecting buyers and sellers
Saas - Nielsen Marketing cloud platform which help brands to connect with their customers by using our big data sets and our analytics tools
Our serving layer(Front End) aggregates data from various online + offline sources
We aggregate around 10B events per day
Past…
Mention “cardinality” and “real-time dashboard”
Explain the need to union and intersect
-Bit vector - Elastic search /Redis is an example of such system
We tried to introduce new cluster dedicated for indexing only and then use backup and restore to the second cluster
This method was very expensive and was partially helpful
Tuning for better performance also didn’t help too much
Preprocessing - Too many combinations - The formula length is not bounded (show some numbers)
HyperLogLog
-Implementation in ElasticSearch was too slow (done on query time)
- Set operations increase the error dramatically
Unions and Intersections increase the error
The problematic case is intersection of very small set with very big set
The larger the K the smaller the Error
However larger K means more memory & storage needed
So we talked about statistical algorithms, which is nice, but we needed a practical solution…
OOTB supports ThetaSketch algorithm
Timeseries database - first thing you need to know about Druid
Column types :
Timestamp
Dimensions
Metrics
Together they comprise a Datasource
There are different types of roll-ups (sum, count, etc.)
Agg is done on ingestion time (outcome is much smaller in size)
In query time, it’s closer to a key-value search
We have 3 types of processes - ingestion, querying, managementAll processes are decoupled and scalable
Ingestion (real time - e.g from Kafka, batch - talk about deep storage, how data is aggregated in ingestion time)Querying (brokers, historicals, query performance during ingestion)
Lambda architecture
Explain the tuple and what is happening during the aggregation
Setup is not easy
Separate config/servers/tuning
Caused the deployment to take a few months
Use the Druid recommendation for Production configuration
Monitoring Your System
Druid has built in support for Graphite ( exports many metrics )
Data Modeling
If using Theta sketch - reduce the number of intersections (show a slide of the old and new data model).It didn’t solve all use-cases, but it gives you an idea of how you can approach the problem
Different datasources - e.g lower accuracy for faster queries VS higher accuracy with a bit slower queries
Combine multiple queries over the REST API
There can be billions of rows, so filter the data as part of the query (as early as possible)
Ingestion doesn’t affect query + sub-second response for even 100s or 1000s of concurrent queries
Cost is for the entire solution (Druid cluster, EMR, etc.)
With Druid and ThetaSketch, we’ve improved our ingestion volume and query performance and concurrency by an order of magnitude with a lesser cost, compared to our old solution
(We’ve achieved a more performant, scalable, cost-effective solution)