Using druid for interactive count distinct queries at scale

•Télécharger en tant que PPTX, PDF•

2 j'aime•1,105 vues

At NMC (Nielsen Marketing Cloud) we need to present to our clients the number of unique users who meet a given criteria. The condition is typically a set-theoretic expression over a stream of events for a given time range. Historically, we have used ElasticSearch to answer these types of questions, however, we have encountered major scaling issues. In this presentation we will detail the journey of researching, benchmarking and productionizing a new technology, Druid, with DataSketches, to overcome the limitations we were facing

Données & analyses

USING DRUID
FOR INTERACTIVE COUNT-DISTINCT QUERIES AT SCALE

Introduction
Yakir Buskilla Itai Yaffe
● Software Architect
● Focusing on Big
Data and Machine
Learning problems
● Big Data
Infrastructure
Developer
● Dealing with Big
Data challenges for
the last 5 years

Nielsen Marketing Cloud (NMC)
● eXelate was acquired by Nielsen 2 years ago
● A leader in the Ad Tech and Marketing Tech industry
● What do we do ?
○ Data as a Service (DaaS)
○ Software as a Service (SaaS)

The need
● Nielsen Marketing Cloud business question
○ How many unique devices we have encountered:
■ over a given date range
■ for a given set of attributes (segments, regions, etc.)
● Find the number of distinct elements in a data stream which
may contain repeated elements in real time

● Store everything
● Store only 1 bit per device
○ 10B Devices-1.25 GB/day
○ 10B Devices*80K attributes - 100 TB/day
● Approximate
Possible solutions
Naive
Bit VectorApprox.

Our journey
● Elasticsearch
○ Indexing data
■ 250 GB of daily data, 10 hours
■ Affect query time
○ Querying
■ Low concurrency
■ Scans on all the shards of the corresponding index

What we tried
● Preprocessing
● Statistical algorithms (e.g HyperLogLog)

● K Minimum Values (KMV)
● Estimate set cardinality
● Supports set-theoretic operations
X Y
● ThetaSketch mathematical framework - generalization of KMV
X Y
ThetaSketch

Number of Std Dev 1 2
Confidence Interval 68.27% 95.45%
16,384 0.78% 1.56%
32,768 0.55% 1.10%
65,536 0.39% 0.78%
ThetaSketch error

“Very fast highly scalable columnar data-store”
DRUID

Roll-up
ThetaSketchAggregator
2016-11-15
Timestamp Attribute Device ID
11111 3a4c1f2d84a5c179435c1fea86e6ae02
2016-11-15 22222 3a4c1f2d84a5c179435c1fea86e6ae02
2016-11-15 11111 5dd59f9bd068f802a7c6dd832bf60d02
2016-11-15 22222 5dd59f9bd068f802a7c6dd832bf60d02
2016-11-15 333333 5dd59f9bd068f802a7c6dd832bf60d02
Timestamp Attribute Count Distinct
2016-11-15
2016-11-15
2016-11-15
11111
22222
33333
2
2
1

Guidelines and pitfalls
● Setup is not easy

Guidelines and pitfalls
● Monitoring your system

Guidelines and pitfalls
● Data modeling
○ Reduce the number of intersections
○ Different datasources for different use cases
2016-11-15
2016-11-15
2016-11-15
Timestamp Attribute
Count
Distinct
Timestamp Attribute Region
Count
Distinct
US XXXXXX US
Porsche
Intent
XXXXXX
Porsche
Intent
... ......
XXXXXX
...

Guidelines and pitfalls
● Query optimization
○ Combine multiple queries into single query
○ Use filters

Guidelines and pitfalls
● Batch Ingestion
○ EMR Tuning
■ 140-nodes cluster
● 85% spot instances => ~80% cost reduction
○ Druid input file format - Parquet vs CSV
■ Reduced indexing time by X4
■ Reduced used storage by X10

Summary
10TB/day
4 Hours/day
15GB/day
280ms-350ms
$55K/month
DRUID
250GB/day
10 Hours/day
2.5TB (total)
500ms-6000ms
$80K/month
ES

THANK YOU!
https://www.linkedin.com/in/itaiy/
https://www.linkedin.com/in/yakirbuskilla/

Recommandé

A Brief Introduction of TiDB (Percona Live)PingCAP

Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016DataStax

Big Data Analytics with MariaDB ColumnStoreMariaDB plc

Trino at linkedIn - 2021Akshay Rai

The Apache Spark File Format EcosystemDatabricks

[245] presto 내부구조 파헤치기NAVER D2

Low latency Java appsSimon Ritter

Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Spark Summit

Recommandé

A Brief Introduction of TiDB (Percona Live)PingCAP

Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016DataStax

Big Data Analytics with MariaDB ColumnStoreMariaDB plc

Trino at linkedIn - 2021Akshay Rai

The Apache Spark File Format EcosystemDatabricks

[245] presto 내부구조 파헤치기NAVER D2

Low latency Java appsSimon Ritter

Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Spark Summit

When Apache Spark Meets TiDB with Xiaoyu MaDatabricks

CassandraUpaang Saxena

Apache Impalaパフォーマンスチューニング #dbts2018Cloudera Japan

Querying Druid in SQL with SupersetDataWorks Summit

Dimensional ModelingBryan Cafferky

ETL VS ELT.pdfBOSupport

How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 MillionDataWorks Summit

How To Connect Spark To Your Own DatasourceMongoDB

Hive + Tez: A Performance Deep DiveDataWorks Summit

Scalability, Availability & Stability PatternsJonas Bonér

High Performance Mysqlliufabin 66688

Cassandra vs. ScyllaDB: Evolutionary DifferencesScyllaDB

File Format Benchmark - Avro, JSON, ORC and ParquetDataWorks Summit/Hadoop Summit

File Format Benchmarks - Avro, JSON, ORC, & ParquetOwen O'Malley

Security features In MySQL 8.0Mydbops

Streaming SQL with Apache CalciteJulian Hyde

Introduction of Oracle Database Architecture（抜粋版） - JPOUG Oracle Database入学式 ...Ryota Watabe

Big Data Testing StrategiesKnoldus Inc.

Comparing Accumulo, Cassandra, and HBaseAccumulo Summit

What is new in Apache Hive 3.0?DataWorks Summit

Using druid for interactive count distinct queries at scale @ nmcIdo Shilon

Our journey with druid - from initial research to full production scaleItai Yaffe

Contenu connexe

Tendances

When Apache Spark Meets TiDB with Xiaoyu MaDatabricks

CassandraUpaang Saxena

Apache Impalaパフォーマンスチューニング #dbts2018Cloudera Japan

Querying Druid in SQL with SupersetDataWorks Summit

Dimensional ModelingBryan Cafferky

ETL VS ELT.pdfBOSupport

How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 MillionDataWorks Summit

How To Connect Spark To Your Own DatasourceMongoDB

Hive + Tez: A Performance Deep DiveDataWorks Summit

Scalability, Availability & Stability PatternsJonas Bonér

High Performance Mysqlliufabin 66688

Cassandra vs. ScyllaDB: Evolutionary DifferencesScyllaDB

File Format Benchmark - Avro, JSON, ORC and ParquetDataWorks Summit/Hadoop Summit

File Format Benchmarks - Avro, JSON, ORC, & ParquetOwen O'Malley

Security features In MySQL 8.0Mydbops

Streaming SQL with Apache CalciteJulian Hyde

Introduction of Oracle Database Architecture（抜粋版） - JPOUG Oracle Database入学式 ...Ryota Watabe

Big Data Testing StrategiesKnoldus Inc.

Comparing Accumulo, Cassandra, and HBaseAccumulo Summit

What is new in Apache Hive 3.0?DataWorks Summit

Tendances (20)

When Apache Spark Meets TiDB with Xiaoyu Ma

Cassandra

Apache Impalaパフォーマンスチューニング #dbts2018

Querying Druid in SQL with Superset

Dimensional Modeling

ETL VS ELT.pdf

How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million

How To Connect Spark To Your Own Datasource

Hive + Tez: A Performance Deep Dive

Scalability, Availability & Stability Patterns

High Performance Mysql

Cassandra vs. ScyllaDB: Evolutionary Differences

File Format Benchmark - Avro, JSON, ORC and Parquet

File Format Benchmarks - Avro, JSON, ORC, & Parquet

Security features In MySQL 8.0

Streaming SQL with Apache Calcite

Introduction of Oracle Database Architecture（抜粋版） - JPOUG Oracle Database入学式 ...

Big Data Testing Strategies

Comparing Accumulo, Cassandra, and HBase

What is new in Apache Hive 3.0?

Similaire à Using druid for interactive count distinct queries at scale

Using druid for interactive count distinct queries at scale @ nmcIdo Shilon

Our journey with druid - from initial research to full production scaleItai Yaffe

Counting Unique Users in Real-Time: Here's a Challenge for You!DataWorks Summit

Druid - DevconTLV XYakir Buskilla

Introducing TiDB @ SF DevOps MeetupKevin Xu

Introducing TiDB [Delivered: 09/27/18 at NYC SQL Meetup]Kevin Xu

TiDB + Mobike by Kevin Xu (@kevinsxu)Kevin Xu

TiDB IntroductionMorgan Tocker

SAS Institute on Changing All Four Tires While Driving an AdTech Engine at Fu...ScyllaDB

Security Monitoring for big Infrastructures without a Million Dollar budgetJuan Berner

Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Hernan Costante

Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Codemotion

Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Demi Ben-Ari

Scale Relational Database with NewSQLPingCAP

Challenges of monitoring distributed systemsNenad Bozic

Big Data, Bigger AnalyticsItzhak Kameli

Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Demi Ben-Ari

MongoDB World 2019: Near Real-Time Analytical Data Hub with MongoDBMongoDB

Auditing data and answering the life long question, is it the end of the day ...Simona Meriam

Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKVKevin Xu

Similaire à Using druid for interactive count distinct queries at scale (20)

Using druid for interactive count distinct queries at scale @ nmc

Our journey with druid - from initial research to full production scale

Counting Unique Users in Real-Time: Here's a Challenge for You!

Druid - DevconTLV X

Introducing TiDB @ SF DevOps Meetup

Introducing TiDB [Delivered: 09/27/18 at NYC SQL Meetup]

TiDB + Mobike by Kevin Xu (@kevinsxu)

TiDB Introduction

SAS Institute on Changing All Four Tires While Driving an AdTech Engine at Fu...

Security Monitoring for big Infrastructures without a Million Dollar budget

Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...

Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...

Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...

Scale Relational Database with NewSQL

Challenges of monitoring distributed systems

Big Data, Bigger Analytics

Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017

MongoDB World 2019: Near Real-Time Analytical Data Hub with MongoDB

Auditing data and answering the life long question, is it the end of the day ...

Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV

Plus de Itai Yaffe

Mastering Partitioning for High-Volume Data ProcessingItai Yaffe

Solving Data Engineers Velocity - Wix's Data Warehouse AutomationItai Yaffe

Lessons Learnt from Running Thousands of On-demand Spark ApplicationsItai Yaffe

Why do the majority of Data Science projects never make it to production?Itai Yaffe

Planning a data solution - "By Failing to prepare, you are preparing to fail"Itai Yaffe

Evaluating Big Data & ML Solutions - Opening NotesItai Yaffe

Big data serving: Processing and inference at scale in real timeItai Yaffe

Data Lakes on Public Cloud: Breaking Data Management MonolithsItai Yaffe

Unleashing the Power of your DataItai Yaffe

Data Lake on Public Cloud - Opening NotesItai Yaffe

Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...Itai Yaffe

DevTalks Reimagined 2020 - Funnel Analysis with Spark and DruidItai Yaffe

Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)Itai Yaffe

Introducing Kafka Connect and Implementing Custom ConnectorsItai Yaffe

A Day in the Life of a Druid Implementor and Druid's RoadmapItai Yaffe

Scalable Incremental Index for DruidItai Yaffe

Funnel Analysis with Spark and DruidItai Yaffe

The benefits of running Spark on your own DockerItai Yaffe

Optimizing Spark-based data pipelines - are you up for it?Itai Yaffe

Scheduling big data workloads on serverless infrastructureItai Yaffe

Plus de Itai Yaffe (20)

Mastering Partitioning for High-Volume Data Processing

Solving Data Engineers Velocity - Wix's Data Warehouse Automation

Lessons Learnt from Running Thousands of On-demand Spark Applications

Why do the majority of Data Science projects never make it to production?

Planning a data solution - "By Failing to prepare, you are preparing to fail"

Evaluating Big Data & ML Solutions - Opening Notes

Big data serving: Processing and inference at scale in real time

Data Lakes on Public Cloud: Breaking Data Management Monoliths

Unleashing the Power of your Data

Data Lake on Public Cloud - Opening Notes

Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...

DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid

Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)

Introducing Kafka Connect and Implementing Custom Connectors

A Day in the Life of a Druid Implementor and Druid's Roadmap

Scalable Incremental Index for Druid

Funnel Analysis with Spark and Druid

The benefits of running Spark on your own Docker

Optimizing Spark-based data pipelines - are you up for it?

Scheduling big data workloads on serverless infrastructure

Dernier

Ukraine War presentation: KNOW THE BASICSAishani27

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor

Data-Analysis for Chicago Crime Data 2023ymrp368

Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann

BigBuy dropshipping via API with DroFx.pptxolyaivanovalion

VidaXL dropshipping via API with DroFx.pptxolyaivanovalion

Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson

Halmar dropshipping via API with DroFxolyaivanovalion

04242024_CCC TUG_Joins and Relationshipsccctableauusergroup

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh9953056974 Low Rate Call Girls In Saket, Delhi NCR

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor

Industrialised data - the key to AI success.pdfLars Albertsson

Brighton SEO | April 2024 | Data StorytellingNeil Barnes

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal

Ravak dropshipping via API with DroFx.pptxolyaivanovalion

B2 Creative Industry Response Evaluation.docxStephen266013

BabyOno dropshipping via API with DroFx.pptxolyaivanovalion

RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh

Dernier (20)

Ukraine War presentation: KNOW THE BASICS

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai

Data-Analysis for Chicago Crime Data 2023

Generative AI on Enterprise Cloud with NiFi and Milvus

BigBuy dropshipping via API with DroFx.pptx

VidaXL dropshipping via API with DroFx.pptx

Unveiling Insights: The Role of a Data Analyst

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Schema on read is obsolete. Welcome metaprogramming..pdf

Halmar dropshipping via API with DroFx

04242024_CCC TUG_Joins and Relationships

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...

Industrialised data - the key to AI success.pdf

Brighton SEO | April 2024 | Data Storytelling

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure

Ravak dropshipping via API with DroFx.pptx

B2 Creative Industry Response Evaluation.docx

BabyOno dropshipping via API with DroFx.pptx

RA-11058_IRR-COMPRESS Do 198 series of 1998

Using druid for interactive count distinct queries at scale

1. USING DRUID FOR INTERACTIVE COUNT-DISTINCT QUERIES AT SCALE

2. Introduction Yakir Buskilla Itai Yaffe ● Software Architect ● Focusing on Big Data and Machine Learning problems ● Big Data Infrastructure Developer ● Dealing with Big Data challenges for the last 5 years

3. Nielsen Marketing Cloud (NMC) ● eXelate was acquired by Nielsen 2 years ago ● A leader in the Ad Tech and Marketing Tech industry ● What do we do ? ○ Data as a Service (DaaS) ○ Software as a Service (SaaS)

4. NMC high-level architecture

5. The need ● Nielsen Marketing Cloud business question ○ How many unique devices we have encountered: ■ over a given date range ■ for a given set of attributes (segments, regions, etc.) ● Find the number of distinct elements in a data stream which may contain repeated elements in real time

6. The need

7. The need

8. ● Store everything ● Store only 1 bit per device ○ 10B Devices-1.25 GB/day ○ 10B Devices*80K attributes - 100 TB/day ● Approximate Possible solutions Naive Bit VectorApprox.

9. Our journey ● Elasticsearch ○ Indexing data ■ 250 GB of daily data, 10 hours ■ Affect query time ○ Querying ■ Low concurrency ■ Scans on all the shards of the corresponding index

10. What we tried ● Preprocessing ● Statistical algorithms (e.g HyperLogLog)

11. ● K Minimum Values (KMV) ● Estimate set cardinality ● Supports set-theoretic operations X Y ● ThetaSketch mathematical framework - generalization of KMV X Y ThetaSketch

12. KMV intuition

13. Number of Std Dev 1 2 Confidence Interval 68.27% 95.45% 16,384 0.78% 1.56% 32,768 0.55% 1.10% 65,536 0.39% 0.78% ThetaSketch error

14. “Very fast highly scalable columnar data-store” DRUID

15. Roll-up ThetaSketchAggregator 2016-11-15 Timestamp Attribute Device ID 11111 3a4c1f2d84a5c179435c1fea86e6ae02 2016-11-15 22222 3a4c1f2d84a5c179435c1fea86e6ae02 2016-11-15 11111 5dd59f9bd068f802a7c6dd832bf60d02 2016-11-15 22222 5dd59f9bd068f802a7c6dd832bf60d02 2016-11-15 333333 5dd59f9bd068f802a7c6dd832bf60d02 Timestamp Attribute Count Distinct 2016-11-15 2016-11-15 2016-11-15 11111 22222 33333 2 2 1

16. Druid architecture

17. How do we use Druid

18. Guidelines and pitfalls ● Setup is not easy

19. Guidelines and pitfalls ● Monitoring your system

20. Guidelines and pitfalls ● Data modeling ○ Reduce the number of intersections ○ Different datasources for different use cases 2016-11-15 2016-11-15 2016-11-15 Timestamp Attribute Count Distinct Timestamp Attribute Region Count Distinct US XXXXXX US Porsche Intent XXXXXX Porsche Intent ... ...... XXXXXX ...

21. Guidelines and pitfalls ● Query optimization ○ Combine multiple queries into single query ○ Use filters

22. Guidelines and pitfalls ● Batch Ingestion ○ EMR Tuning ■ 140-nodes cluster ● 85% spot instances => ~80% cost reduction ○ Druid input file format - Parquet vs CSV ■ Reduced indexing time by X4 ■ Reduced used storage by X10

23. Guidelines and pitfalls ● Community

24. Summary 10TB/day 4 Hours/day 15GB/day 280ms-350ms $55K/month DRUID 250GB/day 10 Hours/day 2.5TB (total) 500ms-6000ms $80K/month ES

25. QUESTIONS?

26. THANK YOU! https://www.linkedin.com/in/itaiy/ https://www.linkedin.com/in/yakirbuskilla/

Notes de l'éditeur

Intro of us + NMC
Daas = marketplace for device level data connecting buyers and sellers Saas - Nielsen Marketing cloud platform which help brands to connect with their customers by using our big data sets and our analytics tools
Our serving layer(Front End) aggregates data from various online + offline sources We aggregate around 10B events per day
Past… Mention “cardinality” and “real-time dashboard” Explain the need to union and intersect
-Bit vector - Elastic search /Redis is an example of such system
We tried to introduce new cluster dedicated for indexing only and then use backup and restore to the second cluster This method was very expensive and was partially helpful Tuning for better performance also didn’t help too much
Preprocessing - Too many combinations - The formula length is not bounded (show some numbers) HyperLogLog -Implementation in ElasticSearch was too slow (done on query time) - Set operations increase the error dramatically
Unions and Intersections increase the error The problematic case is intersection of very small set with very big set
The larger the K the smaller the Error However larger K means more memory & storage needed
So we talked about statistical algorithms, which is nice, but we needed a practical solution… OOTB supports ThetaSketch algorithm
Timeseries database - first thing you need to know about Druid Column types : Timestamp Dimensions Metrics Together they comprise a Datasource There are different types of roll-ups (sum, count, etc.) Agg is done on ingestion time (outcome is much smaller in size) In query time, it’s closer to a key-value search
We have 3 types of processes - ingestion, querying, managementAll processes are decoupled and scalable Ingestion (real time - e.g from Kafka, batch - talk about deep storage, how data is aggregated in ingestion time)Querying (brokers, historicals, query performance during ingestion) Lambda architecture
Explain the tuple and what is happening during the aggregation
Setup is not easy Separate config/servers/tuning Caused the deployment to take a few months Use the Druid recommendation for Production configuration
Monitoring Your System Druid has built in support for Graphite ( exports many metrics )
Data Modeling If using Theta sketch - reduce the number of intersections (show a slide of the old and new data model).It didn’t solve all use-cases, but it gives you an idea of how you can approach the problem Different datasources - e.g lower accuracy for faster queries VS higher accuracy with a bit slower queries
Combine multiple queries over the REST API There can be billions of rows, so filter the data as part of the query (as early as possible)
EMR tuning (spot instances (80% cost reduction), druid MR prod config) Use Parquet
Ingestion doesn’t affect query + sub-second response for even 100s or 1000s of concurrent queries Cost is for the entire solution (Druid cluster, EMR, etc.) With Druid and ThetaSketch, we’ve improved our ingestion volume and query performance and concurrency by an order of magnitude with a lesser cost, compared to our old solution (We’ve achieved a more performant, scalable, cost-effective solution)