Swiss Big Data User Group - Introduction to Apache Drill

1

Introduc)on
to
Apache
Drill

Michael
Hausenblas,
Chief
Data
Engineer
EMEA,
MapR

6th
Swiss
Big
Data
User
Group
MeeAng,
Zurich,
2013-‐03-‐25

2

2

Kudos
to
hJp://cmx.io/

3

Workloads

•  Batch
processing
(MapReduce)

•  Light-‐weight
OLTP
(HBase,
Cassandra,
etc.)

•  Stream
processing
(Storm,
S4)

•  Search
(Solr,
ElasAcsearch)

•  Interac)ve,
ad-‐hoc
query
and
analysis
(?)

4

Impala
InteracAve
Query
at
Scale

low-‐latency

5

Use
Case
I

•  Jane,
a
markeAng
analyst

•  Determine
target
segments

•  Data
from
diﬀerent
sources

6

Use
Case
II

•  LogisAcs
–
supplier
status

•  Queries

– How
many
shipments
from
supplier
X?

– How
many
shipments
in
region
Y?

SUPPLIER_ID
NAME
REGION

ACM
ACME
Corp
US

GAL
GotALot
Inc
US

BAP
Bits
and
Pieces
Ltd
Europe

ZUP
Zu
Pli
Asia

{
"shipment": 100123,
"supplier": "ACM",
“timestamp": "2013-02-01",
"description": ”first delivery today”
},
{
"shipment": 100124,
"supplier": "BAP",
"timestamp": "2013-02-02",
"description": "hope you enjoy it”
}
…

7

Today’s
SoluAons

•  RDBMS-‐focused

–  ETL
data
from
MongoDB
and
Hadoop

–  Query
data
using
SQL

•  MapReduce-‐focused

–  ETL
from
RDBMS
and
MongoDB

–  Use
Hive,
etc.

8

Requirements

•  Support
for
diﬀerent
data
sources

•  Support
for
diﬀerent
query
interfaces

•  Low-‐latency/real-‐Ame

•  Ad-‐hoc
queries

•  Scalable,
reliable

9

Google’s
Dremel

hJp://research.google.com/pubs/pub36632.html

10

Apache
Drill
Overview

•  Inspired
by
Google’s
Dremel

•  Standard

SQL
2003
support

•  Other
QL
possible

•  Plug-‐able
data
sources

•  Support
for
nested
data

•  Schema
is
opAonal

•  Community
driven,
open,
100’s
involved

11

Apache
Drill
Overview

12

High-‐level
Architecture

13

High-‐level
Architecture

•  Each
node:
Drillbit
-‐
maximize
data
locality

•  Co-‐ordinaAon,
query
planning,
execuAon,
etc,
are
distributed

•  By
default
Drillbits
hold
all
roles

•  Any
node
can
act
as
endpoint
for
a
query

Storage

Process

Drillbit

node

Storage

Process

Drillbit

node

Storage

Process

Drillbit

node

Storage

Process

Drillbit

node

14

High-‐level
Architecture

•  Zookeeper
for
ephemeral
cluster
membership
info

•  Distributed
cache
(Hazelcast)
for
metadata,
locality

informaAon,
etc.

Zookeeper

Distributed
Cache

Storage

Process

Drillbit

node

Storage

Process

Drillbit

node

Storage

Process

Drillbit

node

Storage

Process

Drillbit

node

Distributed
Cache
Distributed
Cache
Distributed
Cache

15

High-‐level
Architecture

•  Origina)ng
Drillbit
acts
as
foreman,
manages
query
execuAon,

scheduling,
locality
informaAon,
etc.

•  Streaming
data
communica)on
avoiding
SerDe

Zookeeper

Distributed
Cache

Storage

Process

Drillbit

node

Storage

Process

Drillbit

node

Storage

Process

Drillbit

node

Storage

Process

Drillbit

node

Distributed
Cache
Distributed
Cache
Distributed
Cache

16

Principled
Query
ExecuAon

Source

Query
Parser

Logical

Plan
OpAmizer

Physical

Plan
ExecuAon

SQL
2003

DrQL

MongoQL

DSL

scanner
API
topology
query: [
{
@id: "log",
op: "sequence",
do: [
{
op: "scan",
source: “logs”
},
{
op:
"filter",
condition:
"x > 3”
},
parser
API

17

Drillbit
Modules

DFS
Engine

HBase
Engine

RPC
Endpoint

SQL

HiveQL

Pig

Parser

Distributed
Cache

Logical
Plan

Physical
Plan

OpAmizer

Storage
Engine
Interface

Scheduler

Foreman

Operators

Mongo

18

Key
Features

•  Full
SQL
2003

•  Nested
data

•  OpAonal
schema

•  Extensibility
points

19

Full
SQL
–
ANSI
SQL
2003

•  SQL-‐like
is
oken
not
enough

•  IntegraAon
with
exisAng
tools

–  Datameer,
Tableau,
Excel,
SAP
Crystal
Reports

–  Use
standard
ODBC/JDBC
driver

20

Nested
Data

•  Nested
data
becoming
prevalent

–  JSON/BSON,
XML,
ProtoBuf,
Avro

–  Some
data
sources
support
it
naAvely

(MongoDB,
etc.)

•  FlaJening
nested
data
is
error-‐prone

•  Extension
to
ANSI
SQL
2003

21

OpAonal
Schema

•  Many
data
sources
don’t
have
rigid
schemas

–  Schema
changes
rapidly

–  Diﬀerent
schema
per
record
(e.g.
HBase)

•  Supports
queries
against
unknown
schema

•  User
can
deﬁne
schema
or
via
discovery

22

Extensibility
Points

•  Source
query
–
parser
API

•  Custom
operators,
UDF
–
logical
plan

•  OpAmizer

•  Data
sources
and
formats
–
scanner
API

Source

Query
Parser

Logical

Plan
OpAmizer

Physical

Plan
ExecuAon

23

…
and
Hadoop?

•  HDFS
can
be
a
data
source

•  Complementary
use
cases
…

•  …
use
Apache
Drill

–  Find
record
with
speciﬁed
condiAon

–  AggregaAon
under
dynamic
condiAons

•  …
use
MapReduce

–  Data
mining
with
mulAple
iteraAons

–  ETL

23

hJps://cloud.google.com/ﬁles/BigQueryTechnicalWP.pdf

24

Example

hJps://cwiki.apache.org/conﬂuence/display/DRILL/Demo+HowTo

{
"id": "0001",
"type": "donut",
”ppu": 0.55,
"batters":
{
"batter”:
[
{ "id": "1001", "type": "Regular" },
{ "id": "1002", "type": "Chocolate" },
…
data
source:
donuts.json

query:[ {
op:"sequence",
do:[
{
op: "scan",
ref: "donuts",
source: "local-logs",
selection: {data: "activity"}
},
{
op: "filter",
expr: "donuts.ppu < 2.00"
},
…
logical
plan:
simple_plan.json

result:
out.json

{
"sales" : 700.0,
"typeCount" : 1,
"quantity" : 700,
"ppu" : 1.0
}
{
"sales" : 109.71,
"typeCount" : 2,
"quantity" : 159,
"ppu" : 0.69
}
{
"sales" : 184.25,
"typeCount" : 2,
"quantity" : 335,
"ppu" : 0.55
}

25

Status

•  Heavy
development
by
mulAple
organizaAons

•  Available

– Logical
plan
(ADSP)

– Reference
interpreter

– Basic
SQL
parser

– Basic
demo

– Basic
HBase
back-‐end

26

Status

March/April

•  Larger
SQL
syntax

•  Physical
plan

•  In-‐memory
compressed
data
interfaces

•  Distributed
execuAon
focused
on
large
cluster

high
performance
sort,
aggregaAon
and
join

27

ContribuAng

•  Dremel-‐inspired
columnar
format:
TwiJer’s
Parquet

and

Hive’s
ORC
ﬁle

•  IntegraAon
with
Hive
metastore
(?)

•  DRILL-‐13
Storage
Engine:
Deﬁne
Java
Interface

•  DRILL-‐15
Build
HBase
storage
engine
implementaAon

28

ContribuAng

•  DRILL-‐48
RPC
interface
for
query
submission
and
physical
plan

execuAon

•  DRILL-‐53
Setup
cluster
conﬁguraAon
and
membership
mgmt

system

–  ZK
for
coordinaAon

–  Helix
for
parAAon
and
resource
assignment
(?)

•  Further
schedule

–  Alpha
Q2

–  Beta
Q3

29

Kudos
to
…

•  Julian
Hyde,
Pentaho

•  Timothy
Chen,
Microsok

•  Chris
Merrick,
RJMetrics

•  David
Alves,
UT
AusAn

•  Sree
Vaadi,
SSS/NGData

•  Jacques
Nadeau,
MapR

•  Ted
Dunning,
MapR

30

Engage!

•  Follow
@ApacheDrill
on
TwiJer

•  Sign
up
at
mailing
lists
(user
|
dev)

hJp://incubator.apache.org/drill/mailing-‐lists.html

•  Learn
where
and
how
to
contribute

hJps://cwiki.apache.org/conﬂuence/display/DRILL/ContribuAng

•  Keep
an
eye
on
hJp://drill-‐user.org/

Swiss Big Data User Group - Introduction to Apache Drill

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Swiss Big Data User Group - Introduction to Apache Drill

Similaire à Swiss Big Data User Group - Introduction to Apache Drill (20)

Plus de MapR Technologies

Plus de MapR Technologies (20)

Dernier

Dernier (20)

Swiss Big Data User Group - Introduction to Apache Drill