Apache Drill Architecture – High-Performance SQL with a JSON Data Model

© 2015 MapR Technologies 1© 2015 MapR Technologies
How Drill achieves Flexibility with Performance

© 2015 MapR Technologies 2
Drill Supports Schema Discovery On-The-Fly
• Fixed schema
• Leverage schema in centralized
repository (Hive Metastore)
• Fixed schema, evolving schema or
schema-less
• Leverage schema in centralized
repository or self-describing data
2Schema Discovered On-The-FlySchema Declared In Advance
SCHEMA ON
WRITE
SCHEMA
BEFORE READ
SCHEMA ON THE
FLY

Drill’s Data Model is Flexible
JSON
BSON
HBase
Parquet
Avro
CSV
TSV
Dynamic
schema
Fixed schema
Complex
Flat
Flexibility
Name Gender Age
Michael M 6
Jennifer F 3
{
name: {
first: Michael,
last: Smith
},
hobbies: [ski, soccer],
district: Los Altos
}
{
name: {
first: Jennifer,
last: Gates
},
hobbies: [sing],
preschool: CCLC
}
RDBMS/SQL-on-Hadoop table
Apache Drill table
Flexibility

- Sub-directory
- HBase namespace
- Hive database
Drill enables ‘SQL on Everything’
SELECT * FROM dfs.yelp.`business.json`
Workspace
- Pathnames
- Hive table
- HBase table
Table
- DFS (Text, Parquet, JSON)
- HBase/MapRDB
- Hive Metastore/Hcatalog
- Easy API to go beyond Hadoop
Storage plugin instance

Drill is a Distributed SQL query engine
drillbit
DataNode/Regi
onServer
drillbit
DataNode/Regi
onServer
drillbit
DataNode/Regi
onServer
ZooKeeper
ZooKeeper
ZooKeeper
…
 Scale out
 Columnar and Vectorized execution
 Optimistic and pipelined execution (no MR, Spark, Tez)
 Late binding
 Extensible

Drill allows reuse of existing SQL Tools and Skills
Leverage SQL-compatible tools
(BI, query builders, etc.) via Drill’s
standard ODBC, JDBC and ANSI
SQL support
Enable business analysts, technical
analysts and data scientists to
explore and analyze large volumes
of real-time data

Drill is Designed For A Wide Set Of Use Cases
Raw Data Exploration JSON Analytics DWH Offload …
Hive HBaseFiles Directories
…
{JSON}, Parquet
Text Files …

MapR Optimized Data Architecture
Sources
RELATIONAL,
SAAS,
MAINFRAME
DOCUMENTS,
EMAILS
LOG FILES,
CLICKSTREAMS
SENSORS
BLOGS,
TWEETS,
LINK DATA
DATA WAREHOUSE
Data Movement
Data Access
Analytics
Search
Schema-less
data exploration
BI, reporting
Ad-hoc integrated
analytics
Data Transformation, Enrichment
and Integration
Operational Apps
Recommendations
Fraud Detection
Logistics
Optimized Data Architecture Machine Learning
MAPR DISTRIBUTION FOR HADOOP
Streaming
(Spark Streaming,
Storm)
MapR Data Platform
MapR-DB
MAPR DISTRIBUTION FOR HADOOP
Batch
(MapReduce,
Spark, Hive, Pig)
MapR-FS
Interactive
(Drill,
Impala)

© 2015 MapR Technologies 9© 2015 MapR Technologies
Architecture – Under the hood

High Level Architecture
Cluster of commodity servers
– Daemon (drillbit) on each node
ZooKeeper maintains ephemeral cluster membership information
– Drillbit uses ZooKeeper to find other drillbits in the cluster
– Client uses ZooKeeper to find drillbits
Built-in, optimistic query execution engine. Doesn’t require a
particular storage or execution system (MapReduce, Spark, Tez)
– Better performance and manageability
Data processing unit is columnar record batches
– Enables schema flexibility with negligible performance impact

Basic Process
Zookeeper
DFS/HBase/H
ive
DFS/HBase/H
ive
DFS/HBase/H
ive
Drillbit Drillbit Drillbit
Query 1. Query comes to any Drillbit (JDBC, ODBC, CLI, REST)
2. Drillbit generates execution plan based on query optimization & locality
3. Fragments are farmed to individual nodes
4. Result is returned to driving node

Core Modules within drillbit
SQL Parser
Hive
HBase
StoragePlugins
MongoDB
DFS
PhysicalPlan
ExecutionLogicalPlan Optimizer
RPC Endpoint

A Query engine that is…
• Columnar/Vectorized
• Optimistic/pipelined
• Runtime compilation
• Late binding
• Extensible

Columnar representation
A B C D E
A
B
C
D
On disk
E

Columnar Encoding
• Values in a col. stored next to one-another
– Better compression
– Range-map: save min-max, can skip if not present
• Only retrieve columns participating in query
• Drill optimizes for BOTH columnar storage
and Execution
A
B
C
D
On disk
E

Vectorization
Drill operates on more than one record at a time
– Word-sized manipulations
– SIMD instructions (GCC, LLVM and JVM all do various optimizations
automatically)
– Manually code algorithms
Logical Vectorization
– Bitmaps allow lightning fast null-checks
– Avoid branching to speed CPU pipeline

Optimistic Execution
With a short time horizon, failures infrequent
– Don’t spend energy and time creating boundaries and checkpoints to
minimize recovery time
– Rerun entire query in face of failure
No barriers
No persistence unless memory overflow

Pipelining
Record batch is the unit of work for Drill
– Operators work on a record batch ( )
Record batches are pipelined between nodes
– ~256kB usually
Operator reconfiguration happens
at batch boundaries
DrillBit
DrillBit DrillBit

Runtime Compilation is Faster
Trivial
500
450
400
350
300
250
200
150
100
50
0
Simple Moderate
Timefor1millionevaluations(ms)
Source: http://bit.ly/16Xk32x
Janino interpreted
Trivial

Drill compiler
Loaded class
Merge byte-code of
the two classes
Janino compiles
runtime
byte-code
CodeModel
generates code
Precompiled byte-
code templates

Cost-based Optimization
Pluggable rules, and cost model
Rules for distributed plan generation
- Insert Exchange operator into physical plan
- Parallel query plans
Pluggable cost model
- CPU, IO, memory, network cost (data locality)
- Storage engine features (HDFS vs HIVE vs HBase)
Pluggable
rulesQuery
Optimizer Pluggable
rules

Integration and extensibility points
Support UDFs
– UDFs/UDAFs using high performance Java API
Not Hadoop centric
– Work with other NoSQL solutions including MongoDB, Cassandra, Riak, etc.
– Build one distributed query engine together than per technology
Built in classpath scanning and plugin concept to add additional storage
engines, function and operators with zero configuration
Support direct execution of strongly specified JSON based logical and physical
plans
– Simplifies testing
– Enables integration of alternative query languages

Additional Resources
Download
Apache Drill
Tutorial: Apache
Drill in 10 Minutes
Whiteboard Video
with Tomer Shiran

Apache Drill Architecture – High-Performance SQL with a JSON Data Model

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (7)

More from MapR Technologies

More from MapR Technologies (20)

Recently uploaded

Recently uploaded (20)

Apache Drill Architecture – High-Performance SQL with a JSON Data Model

Editor's Notes