2014 08-20-pit-hug

®
© 2014 MapR Technologies 1
®
© 2014 MapR Technologies
Apache Drill
Andy Pernsteiner
2014-08-20 : Pittsburgh HUG

®
• Pioneering Data Agility for Hadoop
• Apache open source project
• Scale-out execution engine for low-latency queries
• Unified SQL-based API for analytics & operational applications
APACHE DRILL
40+ contributors
150+ years of experience building
databases and distributed systems

®
Active Drill Community
•  Large community, growing rapidly
–  35-40 contributors, 16 committers
–  Microsoft, Linked-in, Oracle, Facebook, Visa,
Lucidworks,Hortonworks, Concurrent, many universities
•  In 2014
–  over 20 meet-ups, many more coming soon
–  2 hackathons, with 40+ participants
•  Encourage you to join, learn, contribute and have fun …

®
© 2014 MapR Technologies 4© 2014 MapR Technologies
®
Why drill?

®
Hadoop an augmentation for EDW—Why?

®

®
Consolidating multiple schemas is very hard.
Why? Since schema-on-write, retrieval is pre-determined.

®
Silos make analysis very difficult
•  How do I identify a
unique {customer,
trade} across data
sets?
•  How can I guarantee
the lack of anomalous
behavior if I can’t see
all data?

®
Why Hadoop

®
SQL is here to stay

®
YOU CAN’T HANDLE REAL SQL

®
SQL
select * from A
where exists (
select 1 from B where B.b < 100 );
•  Did you know Apache HIVE cannot compute it?
–  eg, Hive, Impala, Spark/Shark

®
Self-described Data
select cf.month, cf.year
from hbase.table1;
•  Did you know normal SQL cannot handle the above?
•  Nor can HIVE and its variants like Impala, Shark?
•  Because there’s no meta-store definition available

®
Rethink SQL for Big Data
Preserve
•  ANSI SQL
•  Familiar and ubiquitous
•  Performance
•  Interactive nature crucial for BI/Analytics
•  One technology
•  Painful to manage different technologies
•  Enterprise ready
•  System-of-record, HA, DR, Security, Multi-
tenancy, …
Invent
•  Flexible data-model
•  Allow schemas to evolve rapidly
•  Support semi-structured data types
•  Agility
•  Self-service possible when developer and DBA
is same
•  Scalability
•  In all dimensions: data, speed, schemas,
processes, management

®
Distance to Data
Business
(analysts, developers)
“Plumbing”
development
MapReduce
Business
Modeling and
transformations
Hive and other
SQL-on-Hadoop
Existing approaches
require a middleman (IT)
Data
Data

®
Real-World Data Modeling and Transformations

®

®
Distance to Data
Business
“Plumbing”
development
MapReduce
Hive and other
SQL-on-Hadoop
Business
(analysts, developers)Data Agility
Existing approaches
require a middleman (IT)
Data
Data
Data
Business
Modeling and
transformations

®
Why Improve Distance to Data?
•  Enable rapid data exploration and
application development
•  IT should provide a valuable
service without “getting in the way”
•  Can’t add DBAs to keep up with
the exponential data growth
•  Minimize “unnecessary work” so IT
can focus on value-added activities
and become a partner to the
business users
2Reduce the burden on ITImprove time to value

®
®
Self-Service Data Exploration

®
Evolution Towards Self-Service Data Exploration
Data Modeling and
Transformation
Data Visualization
IT-driven
IT-driven
IT-driven
Self-service
IT-driven
Self-service
Optional
Self-service
Traditional BI
w/ RDBMS
Self-Service BI
w/ RDBMS
SQL-on-Hadoop
Self-Service
Data Exploration
Zero-day analytics

®
(1) Self-Describing Data is Ubiquitous
Flat files in DFS
•  Complex data (Thrift, Avro, protobuf)
•  Columnar data (Parquet, ORC)
•  Loosely defined (JSON)
•  Traditional files (CSV, TSV)
Data stored in NoSQL stores
•  Relational-like (rows, columns)
•  Sparse data (NoSQL maps)
•  Embedded blobs (JSON)
•  Document stores (nested objects)
{!
name: {!
first: Michael,!
last: Smith!
},!
hobbies: [ski, soccer],!
district: Los Altos!
}!
{!
name: {!
first: Jennifer,!
last: Gates!
},!
hobbies: [sing],!
preschool: CCLC!
}!

®
(2) Drill’s Data Model is Flexible
HBase
JSON
BSON
CSV
TSV
Parquet
Avro
Schema-lessFixed schema
Flat
Complex
Flexibility
Flexibility
Name! Gender! Age!
Michael! M! 6!
Jennifer! F! 3!
{!
name: {!
first: Michael,!
last: Smith!
},!
hobbies: [ski, soccer],!
district: Los Altos!
}!
{!
name: {!
first: Jennifer,!
last: Gates!
},!
hobbies: [sing],!
preschool: CCLC!
}!
RDBMS/SQL-on-Hadoop table
Apache Drill table

®
(3) Drill Supports Schema Discovery On-The-Fly
•  Fixed schema
•  Leverage schema in centralized
repository (Hive Metastore)
•  Fixed schema, evolving schema or
schema-less
•  Leverage schema in centralized
repository or self-describing data
2Schema Discovered On-The-FlySchema Declared In Advance
SCHEMA ON
WRITE
SCHEMA
BEFORE READ
SCHEMA ON THE
FLY

®
Seamless integration with Apache Hive
•  Low latency queries on Hive tables
•  Support for 100s of Hive file formats
•  Ability to reuse Hive UDFs
•  Support for multiple Hive Metastores in a single query

®
Apache Drill: Self Service SQL for Big data
AGILITY
INSTANT INSIGHTS TO BIG DATA
FLEXIBILITY
ONE INTERFACE
FOR HADOOP & NOSQL
FAMILIARITY
EXISTING SKILLS &
TECHNOLOGIES
•  Direct queries on self
describing data
•  No schemas or ETL
required
•  Query HBase and
other NoSQL stores
•  Use SQL to natively
operate on complex
data types (such as
JSON)
•  Leverage ANSI SQL
skills and BI tools
•  Plug-n-play with Hive
schema, file formats,
UDF’s

®
Enterprise Hadoop from MapR
Management
MapR Data Platform
APACHE HADOOP ECOSYSTEM
28
Storm
Shark
Accumulo
Sentry
Spark
Impala
HBase
MapReduce
Hue
Solr
YARN
Flume
Cascading
Pig
Sqoop
Hive/
Stinger/
Tez
Whirr
Oozie
Mahout
Zookeeper
Enterprise-grade Inter-operability Multi-tenancy Security Operational
DrillDrill

®
Drill 1.0 Hive 0.13 w/ Tez Impala 1.x Shark 0.9
Latency Low Medium Low Medium
Files Yes (all Hive file
formats, plus JSON,
Text, …)
Yes (all Hive file
formats)
Yes (Parquet,
Sequence, …)
Yes (all Hive file
formats)
HBase/M7 Yes Yes, perf issues Yes, with issues Yes, perf issues
Schema Hive or schema-less Hive Hive Hive
SQL support ANSI SQL HiveQL HiveQL (subset) HiveQL
Client support ODBC/JDBC ODBC/JDBC ODBC/JDBC ODBC/JDBC
Hive compat High High Low High
Large datasets Yes Yes Limited Limited
Nested data Yes Limited No Limited
Concurrency High Limited Medium Limited
Interactive SQL-on-Hadoop options

®
Underneath the Covers

®
Storage config

®
Basic Process
Zookeepe
r
DFS/HBase DFS/HBase DFS/HBase
Drillbit
Distributed Cache
Drillbit
Distributed Cache
Drillbit
Distributed Cache
Query 1. Query comes to any Drillbit (JDBC, ODBC, CLI, protobuf)
2. Drillbit generates execution plan based on query optimization & locality
3. Fragments are farmed to individual nodes
4. Result is returned to driving node
c c c

®
Stages of Query Planning
Parser
Logical
Planner
Physical
Planner
Query
Foreman
Plan
fragments
sent to drill
bits
SQL
Query
Heuristic and
cost based
Cost based

®
®
Quick Tour
Self-Service Data Exploration with Apache Drill

®
Zero to Results in 2 Minutes (3 Commands)
$ tar xzf apache-drill.tar.gz!
!
$ apache-drill/bin/sqlline -u jdbc:drill:zk=local!
!
0: jdbc:drill:zk=local>!
SELECT count(*) AS incidents, columns[1] AS category!
FROM dfs.`/tmp/SFPD_Incidents_-_Previous_Three_Months.csv`!
GROUP BY columns[1]!
ORDER BY incidents DESC;!
+------------+------------+!
| incidents | category |!
+------------+------------+!
| 8372 | LARCENY/THEFT |!
| 4247 | OTHER OFFENSES |!
| 3765 | NON-CRIMINAL |!
| 2502 | ASSAULT |!
...!
35 rows selected (0.847 seconds)!
Install
Launch shell
(embedded
mode)
Query
Results

®
Data Sources
!select timestamp, message!
!from dfs1.logs.`AppServerLogs/2014/Jan/
p001.parquet` !
!where errorLevel > 2
This is a cluster in Apache Drill
-  DFS
-  HBase
-  Hive meta-store
A work-space
-  Typically a
sub-
directory
A table
-  pathnames
-  Hbase table
-  Hive table

®
A storage engine instance
-  DFS
-  HBase
-  Hive Metastore/HCatalog
A workspace
-  Sub-directory
-  Hive database
-  HBase namespace
A table
-  pathnames
-  HBase table
-  Hive table
Data Source is in the Query
SELECT timestamp, message!
FROM dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet` !
WHERE errorLevel > 2!

®
Data Sources
•  JSON
•  CSV
•  ORC (ie, all Hive types)
•  Parquet
•  HBase tables
•  … can combine them
Select USERS.name,
PROF.emails.work from
dfs.logs.`/data/logs` LOGS,
dfs.users.`/profiles.json` USERS,
where
LOGS.uid = USERS.uid and
errorLevel > 5
order by count(*);

®
Files and trees
// Dynamic queries on files!
select errorLevel, count(*) 
from dfs.logs.`/AppServerLogs/2014/Jan/
part0001.parquet` group by errorLevel;!
!
// Dynamic queries on entire directory tree!
select errorLevel, count(*) as TotalErrors 
from dfs.logs.`/AppServerLogs` 
group by errorLevel;!

®
More with Trees
Use pathname elements as variables in your query…!
# Query some partitions: How many errors per level by month from 2012?!
!
SELECT errorLevel, count(*)!
FROM dfs.logs.`/AppServerLogs`!
WHERE dirs[1] >= 2012!
GROUP BY errorLevel, dirs[2];!
!
# Even more control: How many sales by month in Q4 from 2012 on?!
!
SELECT count(*) as sales, dir0, dir1!
FROM dfs.logs.`/transactionlogs`!
WHERE dir0 >= 2012 and dir1 >=9 and purch_flag=true!
GROUP BY dir0, dir1;!
!
!
!

®
Works with HBase and Embedded Blobs
# Query an HBase table directly (no schemas)!
!
SELECT cf1.month, cf1.year !
FROM hbase.table1;!
!
# Embedded JSON value inside column profileBlob inside
column family cf1 of the HBase table users!
!
SELECT profile.name, count(profile.children)!
FROM (!
SELECT CONVERT_FROM(cf1.profileBlob, 'json') AS profile!
FROM hbase.users!
)!

®
Combine Data Sources on the Fly
# Join log directory with JSON file (user profiles) to identify the
name and email address for anyone associated with an error message.!
!
SELECT DISTINCT users.name, users.emails.work!
FROM dfs.logs.`/data/logs` logs,!
dfs.users.`/profiles.json` users!
WHERE logs.uid = users.id AND!
logs.errorLevel > 5;!
!
# Join a Hive table and an HBase table (without Hive metadata) to
determine the number of tweets per user!
!
SELECT users.name, count(*) as tweetCount!
FROM hive.social.tweets tweets,!
hbase.users users!
WHERE tweets.userId = convert_from(users.rowkey, 'UTF-8')!
GROUP BY tweets.userId;!
!

®
Use ANSI SQL with no modifications
# TPC-H standard query 4!
!
SELECT!
o.o_orderpriority, count(*) AS order_count!
FROM orders o!
WHERE o.o_orderdate >= date '1996-10-01'!
AND o.o_orderdate < date '1996-10-01' + interval '3' month!
AND EXISTS(!
SELECT * FROM lineitem l !
WHERE l.l_orderkey = o.o_orderkey!
AND l.l_commitdate < l.l_receiptdate!
)!
GROUP BY o.o_orderpriority!
ORDER BY o.o_orderpriority;!

®
®
Demo

®
Drill resources
WIKI:
https://cwiki.apache.org/confluence/display/DRILL/Apache+Drill+Wiki!
!
Drill in 10 minutes:
https://cwiki.apache.org/confluence/display/DRILL/Apache+Drill+in
+10+Minutes!
!
Apache page: http://incubator.apache.org/drill/!
!
!

®
Thank You
@mapr maprtech
tshiran@mapr.com
MapRTechnologies
maprtech
mapr-technologies

®
Underneath the Covers

®
Basic Process
Zookeepe
r
DFS/HBase DFS/HBase DFS/HBase
Drillbit
Distributed Cache
Drillbit
Distributed Cache
Drillbit
Distributed Cache
Query 1. Query comes to any Drillbit (JDBC, ODBC, CLI, protobuf)
2. Drillbit generates execution plan based on query optimization & locality
3. Fragments are farmed to individual nodes
4. Result is returned to driving node
c c c

®
Stages of Query Planning
Parser
Logical
Planner
Physical
Planner
Query
Foreman
Plan
fragments
sent to drill
bits
SQL
Query
Heuristic and
cost based
Cost based

®
Query Execution
SQL
Parser
Optimizer
Scheduler
Pig
Parser
PhysicalPlan
Mongo
Cassandra
HiveQL
Parser
RPC Endpoint
Distributed Cache
StorageEngine
Interface
OperatorsOperators
Foreman
LogicalPlan
HDFS
HBase
JDBC
Endpoint
ODBC
Endpoint

®
A Query engine that is…
•  Columnar/Vectorized
•  Optimistic/pipelined
•  Runtime compilation
•  Late binding
•  Extensible

®
Columnar representation
A B C D E
A
B
C
D
On disk
E

®
Columnar Encoding
•  Values in a col. stored next to one-another
–  Better compression
–  Range-map: save min-max, can skip if not
present
•  Only retrieve columns participating in query
•  Aggregations can be performed without
decoding
A
B
C
D
On disk
E

®
Run-length-encoding & Sum
•  Dataset encoded as <val> <run-length>:
–  2, 4 (4 2’s)
–  8, 10 (10 8’s)
•  Goal: sum all the records
•  Normally:
–  Decompress: 2, 2, 2, 2, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8
–  Add: 2 + 2 + 2 + 2 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8
•  Optimized work: 2 * 4 + 8 * 10
–  Less memory, less operations

®
Bit-packed Dictionary Sort
•  Dataset encoded with a dictionary and bit-positions:
–  Dictionary: [Rupert, Bill, Larry] {0, 1, 2}
–  Values: [1,0,1,2,1,2,1,0]
•  Normal work
–  Decompress & store: Bill, Rupert, Bill, Larry, Bill, Larry, Bill, Rupert
–  Sort: ~24 comparisons of variable width strings
•  Optimized work
–  Sort dictionary: {Bill: 1, Larry: 2, Rupert: 0}
–  Sort bit-packed values
–  Work: max 3 string comparisons, ~24 comparisons of fixed-width
dictionary bits

®
Drill 4-value semantics
•  SQL’s 3-valued semantics
–  True
–  False
–  Unknown
•  Drill adds fourth
–  Repeated

®
Vectorization
•  Drill operates on more than one record at a time
–  Word-sized manipulations
–  SIMD-like instructions
•  GCC, LLVM and JVM all do various optimizations automatically
–  Manually code algorithms
•  Logical Vectorization
–  Bitmaps allow lightning fast null-checks
–  Avoid branching to speed CPU pipeline

®
Runtime Compilation is Faster
•  JIT is smart, but
more gains with
runtime
compilation
•  Janino: Java-
based Java
compiler
From http://bit.ly/
16Xk32x

®
Drill compiler
Loaded class
Merge byte-
code of the
two classes
Janino
compiles
runtime
byte-code
CodeModel
generates
code
Precompiled
byte-code
templates

®
Optimistic
0
20
40
60
80
100
120
140
160
Speed vs. check-pointing
No need to checkpoint
Checkpoint frequentlyApache Drill

®
Optimistic Execution
•  Recovery code trivial
–  Running instances discard the failed query’s intermediate state
•  Pipelining possible
–  Send results as soon as batch is large enough
–  Requires barrier-less decomposition of query

®
Batches of Values
•  Value vectors
–  List of values, with same schema
–  With the 4-value semantics for each value
•  Shipped around in batches
–  max 256k bytes in a batch
–  max 64K rows in a batch
•  RPC designed for multiple replies to a request

®
Pipelining
•  Record batches are pipelined
between nodes
–  ~256kB usually
•  Unit of work for Drill
–  Operators works on a batch
•  Operator reconfiguration
happens at batch boundaries
DrillBit
DrillBit DrillBit

®
Pipelining Record Batches
SQL
Parser
Optimizer
Scheduler
Pig
Parser
PhysicalPlan
Mongo
Cassandra
HiveQL
Parser
RPC Endpoint
Distributed Cache
StorageEngine
Interface
OperatorsOperators
Foreman
LogicalPlan
HDFS
HBase
JDBC
Endpoint
ODBC
Endpoint

®
DISK
Pipelining
•  Random access: sort without copy or
restructuring
•  Avoids serialization/deserialization
•  Off-heap (no GC woes when lots of
memory)
•  Full specification + off-heap + batch
–  Enables C/C++ operators (fast!)
•  Read/write to disk
–  when data larger than memory
Drill Bit
Memory
overflow
uses disk

®
Cost-based Optimization
•  Using Optiq, an extensible framework
•  Pluggable rules, and cost model
•  Rules for distributed plan generation
•  Insert Exchange operator into physical plan
•  Optiq enhanced to explore parallel query plans
•  Pluggable cost model
–  CPU, IO, memory, network cost (data locality)
–  Storage engine features (HDFS vs HIVE vs HBase)
Query
Optimizer
Pluggable
rules
Pluggable
cost model

®
Distributed Plan Cost
•  Operators have distribution property
•  Hash, Broadcast, Singleton, …
•  Exchange operator to enforce distributions
•  Hash: HashToRandomExchange
•  Broadcast: BroadcastExchange
•  Singleton: UnionExchange, SingleMergeExchange
•  Enumerate all, use cost to pick best
•  Merge Join vs Hash Join
•  Partition-based join vs Broadcast-based join
•  Streaming Aggregation vs Hash Aggregation
•  Aggregation in one phase or two phases
•  partial local aggregation followed by final aggregation
HashToRandomExchange
Sort
Streaming-Aggregation
Data Data Data

2014 08-20-pit-hug

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à 2014 08-20-pit-hug

Similaire à 2014 08-20-pit-hug (20)

Dernier

Dernier (20)

2014 08-20-pit-hug