This document provides an introduction and overview of Apache Drill, an open source distributed SQL query engine designed for interactive analysis of large-scale datasets. It describes Drill's architecture as being inspired by Google's Dremel, with support for standard SQL queries, pluggable data sources, and schema flexibility. Drill distributes query execution across multiple nodes to maximize data locality and parallelism. Key features highlighted include full ANSI SQL support, support for nested data, optional schemas, and extensibility points.
5. 5
Use
Case
I
• Jane,
a
markeAng
analyst
• Determine
target
segments
• Data
from
different
sources
6. 6
Use
Case
II
• LogisAcs
–
supplier
status
• Queries
– How
many
shipments
from
supplier
X?
– How
many
shipments
in
region
Y?
SUPPLIER_ID
NAME
REGION
ACM
ACME
Corp
US
GAL
GotALot
Inc
US
BAP
Bits
and
Pieces
Ltd
Europe
ZUP
Zu
Pli
Asia
{
"shipment": 100123,
"supplier": "ACM",
“timestamp": "2013-02-01",
"description": ”first delivery today”
},
{
"shipment": 100124,
"supplier": "BAP",
"timestamp": "2013-02-02",
"description": "hope you enjoy it”
}
…
7. 7
Today’s
SoluAons
• RDBMS-‐focused
– ETL
data
from
MongoDB
and
Hadoop
– Query
data
using
SQL
• MapReduce-‐focused
– ETL
from
RDBMS
and
MongoDB
– Use
Hive,
etc.
8. 8
Requirements
• Support
for
different
data
sources
• Support
for
different
query
interfaces
• Low-‐latency/real-‐Ame
• Ad-‐hoc
queries
• Scalable,
reliable
10. 10
Apache
Drill
Overview
• Inspired
by
Google’s
Dremel
• Standard
SQL
2003
support
• Other
QL
possible
• Plug-‐able
data
sources
• Support
for
nested
data
• Schema
is
opAonal
• Community
driven,
open,
100’s
involved
13. 13
High-‐level
Architecture
• Each
node:
Drillbit
-‐
maximize
data
locality
• Co-‐ordinaAon,
query
planning,
execuAon,
etc,
are
distributed
• By
default
Drillbits
hold
all
roles
• Any
node
can
act
as
endpoint
for
a
query
Storage
Process
Drillbit
node
Storage
Process
Drillbit
node
Storage
Process
Drillbit
node
Storage
Process
Drillbit
node
14. 14
High-‐level
Architecture
• Zookeeper
for
ephemeral
cluster
membership
info
• Distributed
cache
(Hazelcast)
for
metadata,
locality
informaAon,
etc.
Zookeeper
Distributed
Cache
Storage
Process
Drillbit
node
Storage
Process
Drillbit
node
Storage
Process
Drillbit
node
Storage
Process
Drillbit
node
Distributed
Cache
Distributed
Cache
Distributed
Cache
15. 15
High-‐level
Architecture
• Origina)ng
Drillbit
acts
as
foreman,
manages
query
execuAon,
scheduling,
locality
informaAon,
etc.
• Streaming
data
communica)on
avoiding
SerDe
Zookeeper
Distributed
Cache
Storage
Process
Drillbit
node
Storage
Process
Drillbit
node
Storage
Process
Drillbit
node
Storage
Process
Drillbit
node
Distributed
Cache
Distributed
Cache
Distributed
Cache
18. 18
Key
Features
• Full
SQL
2003
• Nested
data
• OpAonal
schema
• Extensibility
points
19. 19
Full
SQL
–
ANSI
SQL
2003
• SQL-‐like
is
oken
not
enough
• IntegraAon
with
exisAng
tools
– Datameer,
Tableau,
Excel,
SAP
Crystal
Reports
– Use
standard
ODBC/JDBC
driver
20. 20
Nested
Data
• Nested
data
becoming
prevalent
– JSON/BSON,
XML,
ProtoBuf,
Avro
– Some
data
sources
support
it
naAvely
(MongoDB,
etc.)
• FlaJening
nested
data
is
error-‐prone
• Extension
to
ANSI
SQL
2003
21. 21
OpAonal
Schema
• Many
data
sources
don’t
have
rigid
schemas
– Schema
changes
rapidly
– Different
schema
per
record
(e.g.
HBase)
• Supports
queries
against
unknown
schema
• User
can
define
schema
or
via
discovery
22. 22
Extensibility
Points
• Source
query
–
parser
API
• Custom
operators,
UDF
–
logical
plan
• OpAmizer
• Data
sources
and
formats
–
scanner
API
Source
Query
Parser
Logical
Plan
OpAmizer
Physical
Plan
ExecuAon
23. 23
…
and
Hadoop?
• HDFS
can
be
a
data
source
• Complementary
use
cases
…
• …
use
Apache
Drill
– Find
record
with
specified
condiAon
– AggregaAon
under
dynamic
condiAons
• …
use
MapReduce
– Data
mining
with
mulAple
iteraAons
– ETL
23
hJps://cloud.google.com/files/BigQueryTechnicalWP.pdf
25. 25
Status
• Heavy
development
by
mulAple
organizaAons
• Available
– Logical
plan
(ADSP)
– Reference
interpreter
– Basic
SQL
parser
– Basic
demo
– Basic
HBase
back-‐end
26. 26
Status
March/April
• Larger
SQL
syntax
• Physical
plan
• In-‐memory
compressed
data
interfaces
• Distributed
execuAon
focused
on
large
cluster
high
performance
sort,
aggregaAon
and
join
28. 28
ContribuAng
• DRILL-‐48
RPC
interface
for
query
submission
and
physical
plan
execuAon
• DRILL-‐53
Setup
cluster
configuraAon
and
membership
mgmt
system
– ZK
for
coordinaAon
– Helix
for
parAAon
and
resource
assignment
(?)
• Further
schedule
– Alpha
Q2
– Beta
Q3
29. 29
Kudos
to
…
• Julian
Hyde,
Pentaho
• Timothy
Chen,
Microsok
• Chris
Merrick,
RJMetrics
• David
Alves,
UT
AusAn
• Sree
Vaadi,
SSS/NGData
• Jacques
Nadeau,
MapR
• Ted
Dunning,
MapR
30. 30
Engage!
• Follow
@ApacheDrill
on
TwiJer
• Sign
up
at
mailing
lists
(user
|
dev)
hJp://incubator.apache.org/drill/mailing-‐lists.html
• Learn
where
and
how
to
contribute
hJps://cwiki.apache.org/confluence/display/DRILL/ContribuAng
• Keep
an
eye
on
hJp://drill-‐user.org/