20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, real-time, analytical datastore

20 years of enterprise architecture experience
CRM, EDRM, ERP, EIP, Digital Services, Security, BI,
Analytics, and MDM. TOGAF certiﬁed and a BA (hons)
degree in Theology (!) and Computer Studies from the
University of Birmingham in the United Kingdom.
Peter Marshall
EMEA Field Engineering
peter.marshall@imply.io
https://www.linkedin.com/in/amillionbytes
+44 781 361 8895

Agenda
What inspired Druid?
How is it built?
What does it do to the data?
The Druid Story
A Druid Cluster
Data in Druid
Q&A

Agenda
Why was Druid developed?
Who developed it?
The Druid Story
A Druid Cluster
Data in Druid
Q&A

unscalable poor query and ingestion scalability
disjointed event data and batch data separately
limiting expensive, product-constraining precalculation
slow high latencies from production to presentation

high performance low-latency, distributed query execution and high
throughput ingestion
real-time
event data (clickstream, network ﬂows, user
behavior, programmatic advertising, server
metrics, IoT…)
analytics counting, ranking, statistics...
database
highly-available, time-sharded, partitioned,
columnar, indexed, compressed, versioned
materialized view

2012 Druid open sourced
Druid 0.5 - Query prioritization, spatial indexing, group limits and ordering, indexing improvements
PyDruid promoted
Druid 0.6 - Union, topN, alpha sorting, smarter brokers
Apache 2.0 License applied
Druid 0.7.1 - Case sensitivity, pluggable metadata DB, LZ4 compression, improved balance, GROUP BY hashing
Druid 0.7.2
Community Project formed
Druid 0.8.3 - Broker performance improvements, Azure Blob storage support, TopN improvements, IPv6, more metrics
Druid 0.9.0 - Extensions overhaul, doubleMax / Min aggregators, Avro support, Graphite emitter, regex search, dimension ordering
Druid 0.9.1 - Query-time lookups, native Kafka indexing, statsD emitter, authorization, cost-based distribution, filter optimization
Druid 0.9.2 - New GROUP BY engine, roll-up made optional, Long filtering and encoding, datasketches, ORC support
Druid 010.0 - Apache Calcite SQL, Kerberos, LIKE, 2GB+ dims, GROUP BY improved, Ambari emitter, more DB support
First Powered By list includes Yahoo!, Walmart, TripleLift, PayPal, Netflix, Hulu, eBay, and AirBnB
Druid 0.10.1
Druid 0.11.0 - Double columns, TLS encryption, auth extensions, Redis cache, GROUP BY improvements, MORE SQL
Druid becomes an incubating Apache project
Druid 0.12.0 - Incremental Kafka hand-off, compaction task, Quantiles sketch, basic auth, query queue improvements, MORE SQL
Druid 0.12.1 - Kerberos and Kafka improvements
Druid 0.12.2 - Ingestion improvements and better query caching
Druid 0.12.3 - Even more ingestion and query improvements
Druid 0.13 - Parallel batch, auto-compaction, SYSTEM, HyperLogLog, NULLs, blooms, new aggregators, OpenTSDB emitter
Druid 0.14 - New web console, Kinesis indexer, Parquet support, DogStatsD emitter, improved data balance
Druid 0.14.2 - Improved datasketch support
Druid 0.15 - Dataloader UI, moving averages, Moments datasketch, Orc core extensions, MORE SQL
Druid 0.16 - Server / SQL / Data UI, vectorization, minor compaction, GROUP BY arrays, Batch Shuffle, Docker imageexcel
Druid 0.17 - Superbatch, parallel auto-compaction, optimistic result merging, SQL NULLs, LDAP, Tasks sys table, lazy loading
Apache Druid Summit, San Francisco
2013
2014
2015
2016
2017
2018
2019
2020

“It can do very large, OLAP-style processing
on the ﬂy in hundreds of milliseconds instead
of precalculating everything…
“The performance is great ... some of the
tables that we have internally in Druid have
billions and billions of events in them, and
we’re scanning them in under a second.”
https://www.infoworld.com/article/2949168/hadoop/yahoo-struts-its-hadoop-stuﬀ.html
TIM TULLY
VP Engineering

Message Buses & Queues
Relational Databases
NoSql Databases
Applications & APIs
Specialised Collectors
Consolidation
Enrichment
Transformation & Stripping
Veriﬁcation & Validation
Filtering & Sorting
Logical Reasoning
Feature & Structure Discovery
Segmentation & Classiﬁcation
Recommendation
Forecasting & Prediction
Anomaly Detection
Automated Decision Making
Statistical Calculations
Real-time Analytics
BI Reporting & Dashboards
Search
Buses, Queues & Databases
Applications & APIs
Data Procurement Data Preparation Data Processing Data Presentation
Data Pipeline

Complete Scalable Focused
parallelised ingestion and
distributed storage
broad, detailed, current and
historical data sets
sub-second ad-hoc
statistical calculation

Complete Scalable Focused
parallelised ingestion and
distributed storage
broad, detailed, current and
historical data sets
sub-second ad-hoc
statistical calculation
search
real-time ingest, flexible
schema, text search, combined
view
columnar
efficient storage,
fast analytic queries
timeseries
low latency, time-based
datasets and functions

Agenda
The Druid Story
A Druid Cluster
Data in Druid
Q&A

Agenda
What are the key processes?
What do they do?
How scalable is it?
How do I keep an eye on health?
The Druid Story
A Druid Cluster
Data in Druid
Q&A

Zookeeper
Data
INGEST DATA ⬩ STORE DATA ⬩ RESPOND TO QUERIES
Query
LOCATE DATA ⬩ ISSUE QUERY ⬩ MERGE RESULTS
Deep Store
Master
ISSUE TASKS ⬩ CATALOGUE DATA ⬩ MONITOR STATE
Data Distribution Task Co-ordination
Segment Tracking
Retention, Load Balancing, Health Goals
Ingestion Task State Recording
Ingestion Task Management
overlordcoordinator
Metadata DB

Data
Zookeeper
Query
Deep Store
Master
EVENT INGEST
BATCH INGEST
CloudStorage
Web
Disk
Hadoop
EventHubs
middlemanager
Stream & Batch Data Ingestion Tasks
Data Segment Creation
Deep Storage Writing
Fresh Data Querying (Memory / Disk Cache)
Data Compaction Processing
REGISTER
Segment Tracking
overlordcoordinator
Metadata DB

Zookeeper
Data
Query
Deep Store
Master
middlemanager
EVENT INGEST
BATCH INGEST
CloudStorage
Web
Disk
Hadoop
EventHubs
REGISTER
Segment Tracking
overlordcoordinator
Metadata DB
historical
historical
LOAD
historical
historical
historical
historical
Deep Data Storage & Retrieval
Deep Data Disposal
Deep Data Replication & Transfer
Master Data Lookups
historical

Zookeeper
Data
Deep Store
historical
historical
historical
historical
historical
historical
Master
Segment Tracking
overlordcoordinator
Metadata DB
Query
middlemanager
middlemanager
middlemanager
middlemanager
middlemanager
middlemanager

Zookeeper
Query
Master
Data
In Memory Disk
Query Cache
Deep Store
8091
Last Known Good Query Distribution
Results Consolidation & Aggregation
Druid Metadata Views
Apache Calcite SQL-to-Druid JSON
SQL, REST, JDBC interfaces
broker
API
HISTORICAL RESULTS
Segment Tracking
overlordcoordinator
Metadata DB
REAL-TIME RESULTS
In Memory
historical
historical
historical
historical
historical
historical
middlemanager
middlemanager
middlemanager
middlemanager
middlemanager
middlemanager
LOCATION

Zookeeper
Data
Query
Last Known Good Query Distribution
Results Consolidation & Aggregation
Druid Metadata Views
Apache Calcite SQL-to-Druid JSON
SQL, REST, JDBC interfaces
broker
Deep Data Storage & Retrieval
Deep Data Disposal
Deep Data Replication & Transfer
Master Data Lookups
historical
Deep Store
Master
Segment Tracking
overlordcoordinator
Metadata DB
Query Cache
In Memory Disk
In Memory
middlemanager
8082
8083
8081
8090
8091

Agenda
What are Druid’s data optimisations?
Why are those optimisations so cool?
What types of data work best?
The Druid Story
A Druid Cluster
Data in Druid
Q&A

Source Format
Segment Period
Dimensions
Row Filters
Transforms

Build your ingestion spec in a GUI... Monitor tasks, check segments, run SQL...

Columnar
Dictionary Encoded
Bitmap Indexed
Compressed
Source Format
Segment Period
Dimensions
Row Filters
Transforms

when who what stockChange
12:30 peter purchased 5
12:32 paul purchased 6
12:40 peter returned -2
12:50 ahmed purchased 2
satisfaction
5
3
4
2
1
2
3
4
2
1

when what stockChange
12:30 1 5
12:32 6
12:33 9
12:35 2
12:40 2 -2
12:42 3
12:45 8
12:50 2
12:52 4
12:55 -5
satisfactionwho
1
2
1
3
1
1
2
3
peter
paul
ahmed
1
2
purchased
returned
2
2
2
5
3
4
2
1
2
3
4
2
1
2
2
1
1
1
1 1
1
1
1

12:30
1
5
12:32 6
12:33 9
12:35 2
12:40 2 -2
12:42
1
3
12:45 8
12:50 2
12:52 4
12:55 -5
satisfactionwho
1
2
1
3
1
1
2
3
peter
paul
ahmed
1
2
purchased
returned
2
2
2
5
3
4
2
1
2
3
4
2
1
1
2
3
1000110001
0111001010
0000000100
1
2
1111011110
0000100001

12:30
1
5
12:32 6
12:33 9
12:35 2
12:40 2 -2
12:42
1
3
12:45 8
12:50 2
12:52 4
12:55 -5
satisfactionwho
1
2
1
3
1 2
2
2
5
3
4
2
1
2
3
4
2
1
Dimensions
Filters ~ Sorting ~ Functions ~ Aggregate ~ Top N ~ Group By ~ Bucketing
Windowed Aggregates
Metrics ~ Sets ~ Approx Count / Quantiles

when what
1
2
1
who
1
2
1
3
1 2
2
2
stockChange
5
6
9
2
-2
3
8
2
4
-5
satisfaction
5
3
4
2
1
2
3
4
2
1
12:30
12:40
12:50
Dimensions
Windowed Aggregates

satisfaction
5
3
4
2
1
2
3
4
2
1
when
12:30
12:40
12:50
what
1
2
1
2
stockChange
5
6
9
2
-2
3
8
2
4
-5
Dimensions
sumStkChg
22
9
1
maxSat
5
3
4
Windowed Aggregates
who
1
2
1
3
1
2
2

when
12:30
12:40
12:50
what
1
2
1
who
1
2
1
3
1 2
2
2
Dimensions
whoHLL
fgndnbe
bsdsfdb
bdfbdsf
Windowed Aggregates
sumStkChg
22
9
1
maxSat
5
3
4

1
2
3
peter
paul
ahmed
1
2
purchased
returned
wn
12:30
12:40
12:50
wt
1
2
1
2
ssc
22
9
1
ms
5
3
4
wH
fgndnbe
bsdsfdb
bdfbdsf

when who what stockChange
12:50 ahmed purchased 2
satisfaction
5
3
4
2
1
2
3
4
2
1
1
2
3
peter
paul
ahmed
1
2
purchased
returned
wn
12:30
12:40
12:50
wt
1
2
1
2
ssc
22
9
1
ms
5
3
4
wH
fgndnbe
bsdsfdb
bdfbdsf

https://www.researchgate.net/publication/333831332_Challenging_SQL-on-Hadoop_Performance_with_Apache_Druid

http://druid.apache.org/
Druid Forum
https://groups.google.com/forum/#!forum/druid-user
Apache Distribution
https://github.com/apache/druid
Druid Community
https://druid.apache.org/community/
Imply Distribution
https://imply.io/get-started

20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, real-time, analytical datastore

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à 20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, real-time, analytical datastore

Similaire à 20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, real-time, analytical datastore (20)

Plus de Athens Big Data

Plus de Athens Big Data (20)

Dernier

Dernier (20)

20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, real-time, analytical datastore