TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
Apache Drill (ver. 0.1, check ver. 0.2)
1. Apache Drill
Design proposal from
OpenDremel team
Camuel Gilyadov & Constantine Peresypkin,
Email: Camuel@BigDataCraft.com
2. OpenDremel Story: 2010
• Camuel Gilyadov started Dremel implementation on
summer 2010 named OpenDremel.
• David Gruzman joined the effort a few months later
followed by Constantine Peresypkin.
• There wasn’t a comprehensive design or architecture.
The goal was to get hierarchal-columnar transformation
working smoothly and in strict accordance to the
Dremel paper. Several working implementations are
published by us under Apache License.
• Hong San was hired as first full-timer to speedup the
development. Metaxa milestone was set.
3. OpenDremel Story: 2011
• OpenDremel early design was found too naive, mainly due to
Java underperformance in inner number-crunching loops.
• After fierce brainstorming, project was restarted from scratch
under new name Dazo. With Dazo, query plan is an arbitrary
piece of executable native code with Java frontend.
• From now on we got inspiration from BigQuery as opposed to
from Dremel paper.
• We decided to use Google NaCl as sandboxing technology to
isolate queries as well as meter resource consumption. The new
sandbox was named ZeroVM.
• As for storage we decided to use OpenStack Swift.
4. OpenDremel Story: 2012
• Four people full-time, several others part time, we still
don’t have fully integrated version but we are satisfied
with what we have achieved and convinced that the
decisions behind Dazo were correct.
• We believe ZeroVM could be a disruptive technology in
itself revolutionizing BigData@Cloud space.
• We are excited by Apache Drill initiative and hope to be
useful for it.
5. Design Tenet #1
• Apache Drill must support multi-tenant semantics
internally and not to be run in guest VMs altogether.
• It should be inspired by BigQuery and not only by
Dremel/PowerDrill/Tenzing papers.
• It is not practical to setup a dedicated cloud (billed
hourly) just to be able to run a query for a few seconds.
• The codebase must be clearly divided into trusted part
and untrusted part. Trusted part must be kept to
absolute minimum and must be peer-reviewed, secured,
audited and metered.
6. Design Tenet #2
• Apache Drill must be extremely flexible and
customizable.
• Schema-on-read concept must be supported.
Imperative high-performance parser code must be
possible to be embedded into the query.
• SQL is no longer enough. New query languages must
be easily added as plug-ins or as user-defined-functions
(UDF).
• Additionally various data-formats must be supported
like column-stores, row-stores, PAX, RCFiles and etc.
7. Design Tenet #2 (cont.)
• We suggest that query plan format will be relaxed to
arbitrary distributed executable code and data
format relaxed to arbitrary opaque BLOB.
• This way new query languages and new data formats
could be easily supported without changing backend.
• As added benefit backend becomes generic lightweight
homogeneous compute-storage cloud.
• Such approach exhibits good separation of control.
Cloud operator controls an bills for generic
infrastructure and the query engine is left completely in
the control of the tenant/user.
8. Design Tenet #3
• Apache Drill requests/queries must be hyper-elastic
meaning capability to exploit compute capacity of
thousands of servers for short duration of just a few
seconds. No resources must be kept spinning per user
between queries or when idle.
• Traditional VMs are too heavyweight for that.
Container approach such as OpenVZ/LXC and etc. are
not secure enough in multi-tenancy context.
• We suggest making sandboxing pluggable and
supporting ZeroVM ( developed for OpenDremel ) and
LXC (is fine for private clouds) to begin with.
9. Design Tenet #4
• Apache Drill must be efficient.
• Value-per-byte is extremely low with BigData.
• Overhead in the inner loop must be kept to minimum.
• Java was found inefficient for general number
crunching (such as data compression). The main
problem with Java is that GC overhead is unavoidable
for the whole data corpus being scanned. We went so
far as to keep all data in byte arrays and auto-generate
transformation code and it still underperformed and
code complexity went through the roof.
10. Suggested Architecture
Browser / Client Single-Tenant Multi-Tenant
Frontend Backend
running inside scale-out object store
traditional guest VM and in-situ compute
JVM
Query Query
Compiler
Custom
executable job
11. OpenDremel/Dazo
Two separate We call it Metaxa We call it Zwift
unfinished jQuery (historic reasons) (Swift + ZeroVM)
apps & cmdline app BQL Parser, unfinished
with no particular compiler based on Alpha Quality
codenames Apache Velocity
JVM
Query Query
Compiler
Custom
executable job
12. What is Swift?
“Swift is a highly available, distributed,
eventually consistent object/blob store.
Organizations can use Swift to store
lots of data efficiently, safely, and
cheaply.”
14. What is ZeroVM?
Highly-secure, low-overhead, low-latency container-style
virtualization based on Google Native Client project. The
critical security code is transferred verbatim from Chrome
Browser project and therefore is as secure as Chrome
Browser. More info: http://ZeroVM.org and
http://news.ycombinator.com/item?id=3746222
15. ZeroVM highlights
1. Disposable VM per request
2. HyperElasticity per request
3. Embeddable into everything
4. High-performance (x86/ARM)
5. Erlang inspired clustering
6. Written in pure C, not deps