Using Apache Drill: Interactive SQL for Hadoop and Beyond

© 2014 MapR Technologies 1© 2014 MapR Technologies
Using Apache Drill

© 2014 MapR Technologies 2
Agenda
• About Apache Drill
• Query Execution
• Demonstration
• Q and A

About Apache Drill

Community
• Mentors
– MapR, Lucid Works, Elasticsearch, University members
• Notable Committers
– MapR, Microsoft, Hortonworks, Concurrent, Oracle, Ohm Data

• Pioneering Data Agility for Hadoop
• Apache open source project
• Scale-out execution engine for low-latency queries
• Unified SQL-based API for analytics & operational applications
APACHE DRILL
40+ contributors
150+ years of experience building
databases and distributed systems

Rethink SQL for Big Data
• ANSI SQL
– Ubiquitous
• Familiar
– No context switch BI/Analytics
• One technology
– Painful to manage different
technologies
• Enterprise ready
– System-of-record, HA, DR,
Security, multi-tenancy, …
• Flexible data-model
– Allow schemas to evolve rapidly
– Support semi-structured data
types
• Agility
– Self-service possible when
developer and DBA is same
• Scalability
– In all dimensions: schemas,
processes, management
Preserve Invent

Drill Supports Schema Discovery On-The-Fly
• Fixed schema
• Leverage schema in centralized
repository (Hive Metastore)
• Fixed schema, evolving schema or
schema-less
• Leverage schema in centralized
repository or self-describing data
2Schema Discovered On-The-FlySchema Declared In Advance
SCHEMA ON
WRITE
SCHEMA
BEFORE READ
SCHEMA ON THE
FLY

SQL
select * from A
where A.a in (
select B.b from B where B.b = A.c);
Did you know Apache HIVE cannot compute this query?
– e.g. Hive, Impala, Spark SQL

YOU CAN’T HANDLE REAL SQL!

Semi-structured Data
select cf.month, cf.year
from hbase.table1;
• Of course you know an RDBMS cannot handle this query?
– Nor can HIVE and its variants like Impala, Spark SQL
• There’s no meta-store definition available

YOU CAN’T HANDLE AN HBASE API!

Interactive SQL-on-Hadoop options
Drill 1.0 Hive 0.13 w/
Tez
Impala 1.x Shark 0.9 Presto 0.56
Latency Low Medium Low Medium Low
Files Yes (all Hive file
formats, plus
JSON, Text, …)
Yes (all Hive file
formats)
Yes (Parquet,
Sequence, …)
Yes (all Hive file
formats)
Yes (RC,
Sequence, Text)
HBase/MapR-DB Yes Yes Various issues Yes No
Schema Hive or schema-
less
Hive Hive Hive Hive
SQL support ANSI SQL HiveQL HiveQL (subset) HiveQL ANSI SQL
Client support ODBC/JDBC ODBC/JDBC ODBC/JDBC ODBC/JDBC JDBC
Hive compat High High Low High High
Large joins Yes Yes No No No
Nested data Yes Limited No Limited Limited
Concurrency High Limited Medium Limited Medium

Data is Stored in Many Forms
• Flat files in DFS
– Complex data (Thrift, Avro, protobuf)
– Columnar data (Parquet, ORC)
– Loosely defined (JSON)
– Traditional files (CSV, TSV)
• Data stored in NoSQL stores
– Relational-like (rows, columns)
– Sparse data (NoSQL maps)
– Embedded blobs (JSON)
– Document stores (nested objects)
{
name: {
first: Michael,
last: Smith
},
hobbies: [skiing, soccer],
district: Los Altos
}
{
name: {
first: Jennifer,
last: Gates
},
hobbies: [singing],
preschool: CCLC
}

Drill’s Data Model is Flexible
HBase
JSON
BSON
CSV
TSV
Parquet
Avro
Schema-lessFixed schema
Flat
Complex
Flexibility
Flexibility
Name Gender Age
Michael M 6
Jennifer F 3
{
name: {
first: Michael,
last: Smith
},
hobbies: [skiing, soccer],
district: Los Altos
}
{
name: {
first: Jennifer,
last: Gates
},
hobbies: [singing],
preschool: CCLC
}
RDBMS/SQL-on-Hadoop table
Apache Drill table

Query Execution

A storage engine instance
- DFS
- HBase
- Hive Metastore/HCatalog
A workspace
- Sub-directory
- Hive database
- HBase namespace
A table
- pathnames
- HBase table
- Hive table
Data Source is in the Query
SELECT timestamp, message
FROM dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet`
WHERE errorLevel > 2

Runtime Compilation is Faster
• JIT is smart, but
more gains with
runtime
compilation
• Janino: Java-
based Java
compiler
From http://bit.ly/16Xk32x

Drill Compiler
Loaded class
Merge byte-code
of the two classes
Janino compiles
runtime
byte-code
CodeModel
generates code
Precompiled
byte-code
templates

Basic query flow
Zookeeper
DFS /
HBase
DFS /
HBase
DFS /
HBase
Drillbit
Distributed Cache
Drillbit
Distributed Cache
Drillbit
Distributed Cache
Query
1. Query comes to any Drillbit (JDBC, ODBC, CLI)
2. Drillbit generates execution plan based on query optimization & locality
3. Fragments are farmed to individual nodes
4. Data is returned to driving node
*Curator/Zookeeper for ephemeral cluster membership info

Demonstration

Download and try Drill!
http://incubator.apache.org/drill/

Q&A
@mapr maprtech
jscott@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies

Using Apache Drill: Interactive SQL for Hadoop and Beyond

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (16)

Similaire à Using Apache Drill: Interactive SQL for Hadoop and Beyond

Similaire à Using Apache Drill: Interactive SQL for Hadoop and Beyond (20)

Plus de Chicago Hadoop Users Group

Plus de Chicago Hadoop Users Group (18)

Dernier

Dernier (20)

Using Apache Drill: Interactive SQL for Hadoop and Beyond

Notes de l'éditeur