Drilling into Data with Apache Drill

Drilling into Data with Apache Drill
Tomer Shiran, Apache Drill Founder and PMC Member
Jacques Nadeau, Apache Drill PMC Chair

Tomer Shiran Jacques Nadeau
tshiran@apache.org jnadeau@apache.org
@tshiran @intjesus
Drill founder and PMC Member Drill PMC Chair (VP, Apache Drill)

Apache Drill
• Open source SQL query engine for non-relational datastores
– JSON document model
– Columnar
• Key advantages:
– Query any non-relational datastore
– No overhead (creating and maintaining schemas, transforming data, …)
– Treat your data like a table even when it’s not
– Keep using the BI tools you love
– Scales from one laptop to 1000s of servers
– Great performance and scalability

Omni-SQL (“SQL-on-Everything”)
Drill: Omni-SQL
Whereas the other engines we're discussing here create a relational database
environment on top of Hadoop, Drill instead enables a SQL language interface to
data in numerous formats, without requiring a formal schema to be declared. This
enables plug-and-play discovery over a huge universe of data without
prerequisites and preparation. So while Drill uses SQL, and can connect to
Hadoop, calling it SQL-on-Hadoop kind of misses the point. A better name might
be SQL-on-Everything, with very low setup requirements.
Andrew Brust,
“
”

Any Non-Relational Datastore
• File systems
– Traditional: Local files and NAS
– Hadoop: HDFS and MapR-FS
– Cloud storage: Amazon S3, Google
Cloud Storage, Azure Blob Storage
• NoSQL databases
– MongoDB
– HBase
– MapR-DB
– Hive
• And you can add new datastores
Any Client
• Multiple interfaces: ODBC, JDBC, REST, C,
Java
• BI tools
– Tableau
– Qlik
– MicroStrategy
– TIBCO Spotfire
– Excel
• Command line (Drill shell)
• Web and mobile apps
– Many JSON-powered chart libraries (see
D3.js)
• SAS, R, …
Drill Integrates With What You Have

Achieving “End-to-End Performance”
Execute fast
• Standard SQL
• Read data fast
• Leverage columnar
encodings and execution
• Execute operations
quickly
• Scale out, not up
Iterate fast
• Work without prep
• Decentralize data
management
• In-situ security
• Explore + query
• Access multiple sources
• Avoid the ETL rinse cycle

JSON Model, Columnar Speed
JSON
BSON
Mongo
HBase
NoSQL
Parquet
Avro
CSV
TSV
Schema-lessFixed schema
Flat
Complex
Name Gender Age
Michael M 6
Jennifer F 3
{
name: {
first: Michael,
last: Smith
},
hobbies: [ski, soccer],
district: Los Altos
}
{
name: {
first: Jennifer,
last: Gates
},
hobbies: [sing],
preschool: CCLC
}
RDBMS/SQL-on-Hadoop table
Apache Drill table

Apache Drill Provides the Best of Both Worlds
Acts Like a Database
• ANSI SQL: SELECT, FROM,
WHERE, JOIN, HAVING, ORDER
BY, WITH, CTAS, ALL, EXISTS,
ANY, IN, SOME
• VarChar, Int, BigInt, Decimal,
VarBinary, Timestamp, Float,
Double, etc.
• Subqueries, scalar subqueries,
partition pruning, CTE
• Data warehouse offload
• Tableau, ODBC, JDBC
• TPC-H & TPC-DS-like workloads
• Supports Hive SerDes
• Supports Hive UDFs
• Supports Hive Metastore
Even When Your Data
Doesn’t
• Path based queries and
wildcards
– select * from /my/logs/
– select * from /revenue/*/q2
• Modern data types
– Map, Array, Any
• Complex Functions and
Relational Operators
– FLATTEN, kvgen, convert_from,
convert_to, repeated_count, etc
• JSON Sensor analytics
• Complex data analysis
• Alternative DSLs

Why? To Support the Changing Data
Organization
Data Dev Circa 2000
1. Developer comes up with
requirements
2. DBA defines tables
3. DBA defines indices
4. DBA defines FK relationships
5. Developer stores data
6. BI builds reports
7. Analyst views reports
8. DBA adds materialized views
Data Today
1. Developer builds app, defines
schema, stores data
2. Analyst queries data
3. Data engineer fixes
performance problems or fills
functionality gaps

Everything Starts With a Drillbit…
• High performance query executor
• In-memory columnar execution
• Directly interacts with data, acquiring
knowledge as it reads
• Built to leverage large amounts of
memory
• Networked or not
• Exposes ODBC, JDBC, REST
• Built-in Web UI and CLI
• Extensible
Drillbit
Single process
(daemon or CLI)

Data Lake, More Like Data Maelstrom
HDFS HDFS
mongod mongod
HDFS HDFS
HBase HBase
Cassandra Cassandra
HDFS
HDFS
HBase
Windows
Desktop
Mac
Desktop
HBase & HDFS Cluster
HDFS Cluster
MongoDB Cluster
Cassandra Cluster
DesktopClustered Servers

Run Drillbits Wherever; Whatever Your Data
Drillbit
HDFS HDFS
mongod mongod
HDFS HDFS
HBase HBase
Drillbit
DrillbitDrillbit
Drillbit Drillbit
Cassandra Cassandra
Drillbit Drillbit
HDFS
HDFS
HBase
Drillbit
Drillbit
Windows
Desktop
Drillbit
Mac
Desktop
Drillbit

Connect to Any Drillbit with ODBC, JDBC, C, Java,
REST
1. User connects to Drillbit
2. That Drillbit becomes Foreman
– Foreman generates execution plan
– Cost-based query optimization &
locality
3. Execution fragments are farmed
to other Drillbits
4. Drillbits exchange data as
necessary to guarantee relational
algebra
5. Results are returned to user
through Foreman Drillbit
User
Drillbit
Drillbit
(foreman)

Run Drill in Embedded Mode (drill-embedded)
$ tar xf apache-drill-1.0.0.tar.gz
$ cd apache-drill-1.0.0
$ bin/drill-embedded
> SELECT * FROM dfs.root.`/Users/tshiran/yelp/user.json` LIMIT 1;
+----------------+----------------------------------+---------------+-------+
| yelping_since | votes | review_count | name |
+----------------+----------------------------------+---------------+-------+
| 2012-02 | {"funny":1,"useful":5,"cool":0} | 6 | Lee |
+----------------+----------------------------------+---------------+-------+
• drillbit (Drill daemon) starts automatically in embedded mode
• No ZooKeeper in embedded mode
• Web UI is available at localhost:8047

Review the Query Profile in the Web UI
(localhost:8047)

Run Drill in Distributed Mode
$ zkServer start # ZooKeeper maintains the list of drillbits in the cluster
$ bin/drillbit.sh start # conf/drill-override.conf includes cluster name and ZK nodes
$ bin/drill-conf # or bin/drill-localhost to skip ZK lookup
> SELECT stars, count(*)
FROM dfs.root.`/Users/tshiran/yelp/review.json`
GROUP BY stars ORDER BY stars;
+--------+---------+
| stars | EXPR$1 |
+--------+---------+
| 1 | 110772 |
| 2 | 102737 |
| 3 | 163761 |
| 4 | 342143 |
| 5 | 406045 |
+--------+---------+
5 rows selected (3.739 seconds)

2. CONFIGURE DATASTORES
(STORAGE PLUGINS)

Define Workspaces in the File Storage
Plugin
• d

The Data: Files
{
"votes": {"funny": 0, "useful": 2, "cool": 1},
"user_id": "Xqd0DzHaiyRqVH3WRG7hzg",
"review_id": "15SdjuK7DmYqUAj6rjGowg",
"stars": 5,
"date": "2007-05-17",
"text": "dr. goldberg offers everything ...",
"type": "review",
"business_id": "vcNAWiLM4dR7D2nwwJ7nCA"
}

The Data: MongoDB Collections
$ mongo
MongoDB shell version: 2.6.5
> show databases;
admin (empty)
local 0.078GB
yelp 0.453GB
> use yelp
> db.users.findOne()
{
"_id" : ObjectId("54566cdf3237149de181a92a"),
"yelping_since" : "2012-02",
"votes" : {
"funny" : 1,
"useful" : 5,
"cool" : 0
},
"review_count" : 6,
"name" : "Lee",
"user_id" : "qtrmBGNqCvupHMHL_bKFgQ",
"friends" : [ ]
}

Are There More 5-Star or 1-Star Reviews?
> SELECT stars, count(*)
FROM dfs.root.`/Users/tshiran/yelp/review.json`
GROUP BY stars ORDER BY stars;
+--------+---------+
| stars | EXPR$1 |
+--------+---------+
| 1 | 110772 |
| 2 | 102737 |
| 3 | 163761 |
| 4 | 342143 |
| 5 | 406045 |
+--------+---------+

Using Storage Plugins and Workspaces
> SELECT * FROM dfs.root.`/Users/tshiran/data/yelp/review.json`
LIMIT 1;
> SELECT * FROM dfs.demo.`yelp/review.json` LIMIT 1;
> SELECT * FROM mongo.yelp.users LIMIT 1;
> USE mongo.yelp;
> SELECT * FROM users LIMIT 1;
Storage plugin
Workspace
Path relative to workspace
Storage Plugin Workspace Table
dfs Path Path relative to workspace
mongo Database Collection
hive Database Table
hbase Namespace Table

Most Common User Names (MongoDB)
> SELECT name, count(*) AS users
FROM mongo.yelp.users
GROUP BY name
ORDER BY users DESC LIMIT 10;
+------------+------------+
| name | users |
+------------+------------+
| David | 2453 |
| John | 2378 |
| Michael | 2322 |
| Chris | 2202 |
| Mike | 2037 |
| Jennifer | 1867 |
| Jessica | 1463 |
| Jason | 1457 |
| Michelle | 1439 |
| Brian | 1436 |
+------------+------------+

Cities with the Most Businesses
> SELECT state, city, count(*) AS businesses
FROM dfs.demo.`/yelp/business.json`
GROUP BY state, city
ORDER BY businesses DESC LIMIT 10;
+------------+------------+-------------+
| state | city | businesses |
+------------+------------+-------------+
| NV | Las Vegas | 12021 |
| AZ | Phoenix | 7499 |
| AZ | Scottsdale | 3605 |
| EDH | Edinburgh | 2804 |
| AZ | Mesa | 2041 |
| AZ | Tempe | 2025 |
| NV | Henderson | 1914 |
| AZ | Chandler | 1637 |
| WI | Madison | 1630 |
| AZ | Glendale | 1196 |
+------------+------------+-------------+

business.json (1)
{
"business_id": "4bEjOyTaDG24SY5TxsaUNQ",
"full_address": "3655 Las Vegas Blvd SnThe StripnLas Vegas, NV 89109",
"hours": {
"Monday": {"close": "23:00", "open": "07:00"},
"Tuesday": {"close": "23:00", "open": "07:00"},
"Friday": {"close": "00:00", "open": "07:00"},
"Wednesday": {"close": "23:00", "open": "07:00"},
"Thursday": {"close": "23:00", "open": "07:00"},
"Sunday": {"close": "23:00", "open": "07:00"},
"Saturday": {"close": "00:00", "open": "07:00"}
},
"open": true,
"categories": ["Breakfast & Brunch", "Steakhouses", "French", "Restaurants"],
"city": "Las Vegas",
"review_count": 4084,
"name": "Mon Ami Gabi",
"neighborhoods": ["The Strip"],
"longitude": -115.172588519464,

business.json (2)
"state": "NV",
"stars": 4.0,
"attributes": {
"Alcohol": "full_bar”,
"Noise Level": "average",
"Has TV": false,
"Attire": "casual",
"Ambience": {
"romantic": true,
"intimate": false,
"touristy": false,
"hipster": false,
"classy": true,
"trendy": false,
"casual": false
},
"Good For": {"dessert": false, "latenight": false, "lunch": false,
"dinner": true, "breakfast": false, "brunch": false},
}
}

Which Places Are Open Right Now (22:00)?
> SELECT name, b.hours
FROM dfs.demo.`yelp/business.json` b
WHERE b.hours.Saturday.`open` < '22:00' AND
b.hours.Saturday.`close` > '22:00'
LIMIT 2;
+------------------------------+------------------------------------------------+
| name | hours |
+------------------------------+------------------------------------------------+
| Chang Jiang Chinese Kitchen | {"Saturday":{"close":"22:30","open":"11:00"}} |
| Grand China Restaurant | {"Saturday":{"close":"23:00","open":"11:00"}} |
+------------------------------+------------------------------------------------+

It’s 10pm in Vegas and I Want Good Hummus!
> SELECT name, b.hours.Friday AS friday, categories
FROM dfs.demo.`yelp/business.json` b
WHERE b.hours.Friday.`open` < '22:00' AND
b.hours.Friday.`close` > '22:00' AND
REPEATED_CONTAINS(categories, 'Mediterranean') AND
city = 'Las Vegas'
ORDER BY stars DESC
LIMIT 2;
+--------------------------------+-----------------------------------+--------------------------------------------------------------+
| name | friday | categories |
+--------------------------------+-----------------------------------+--------------------------------------------------------------+
| Olives | {"close":"22:30","open":"11:00"} | ["Mediterranean","Restaurants"] |
| Marrakech Moroccan Restaurant | {"close":"23:00","open":"17:30"} | ["Mediterranean","Middle Eastern","Moroccan","Restaurants"] |
+--------------------------------+-----------------------------------+--------------------------------------------------------------+

Flatten Repeated Values
> SELECT name, categories
FROM dfs.demo.`yelp/business.json` LIMIT 3;
+-----------------------------+-------------------------------------------+
| name | categories |
+-----------------------------+-------------------------------------------+
| Eric Goldberg, MD | ["Doctors","Health & Medical"] |
| Pine Cone Restaurant | ["Restaurants"] |
| Deforest Family Restaurant | ["American (Traditional)","Restaurants"] |
+-----------------------------+-------------------------------------------+
> SELECT name, FLATTEN(categories) AS categories
FROM dfs.demo.`yelp/business.json` LIMIT 5;
+-----------------------------+-------------------------+
+-----------------------------+-------------------------+
| Eric Goldberg, MD | Doctors |
| Eric Goldberg, MD | Health & Medical |
| Pine Cone Restaurant | Restaurants |
| Deforest Family Restaurant | American (Traditional) |
| Deforest Family Restaurant | Restaurants |
+-----------------------------+-------------------------+

Most and Least Common Business Categories
> SELECT category, count(*) AS businesses
FROM (SELECT name, FLATTEN(categories) AS category
FROM dfs.demo.`yelp/business.json`) c
GROUP BY category ORDER BY businesses DESC;
+-----------------------------------+-------------+
| category | businesses |
+-----------------------------------+-------------+
| Restaurants | 14303 |
| Shopping | 6428 |
…
| Australian | 1 |
| Boat Dealers | 1 |
| Firewood | 1 |
+-----------------------------------+-------------+
> SELECT name, categories FROM dfs.demo.`yelp/business.json` WHERE true and
REPEATED_CONTAINS(categories, 'Australian');
+------+------------+
+------+------------+
| The Australian AZ | ["Bars","Burgers","Nightlife","Australian","Sports Bars","Restaurants"] |
+------+------------+

Create a View for Name-Gender Mapping
> CREATE VIEW dfs.tmp.`names` AS
SELECT columns[0] AS name, columns[4] AS gender
FROM dfs.demo.`names.csv`;
> USE dfs.tmp;
> CREATE VIEW names1 ASSELECT columns[0] AS name, columns[4] AS gender
FROM dfs.demo.`names.csv`;
> SELECT * FROM dfs.tmp.names WHERE name = 'John';
+------------+------------+
| name | gender |
+------------+------------+
| John | Male |
+------------+------------+
columns[0] columns[4]
names.csv:

Who Rates Higher – Men or Women?
> SELECT n.gender, count(*) AS users, round(avg(average_stars), 2) stars
FROM mongo.yelp.users u, dfs.tmp.names n
WHERE u.name = n.name
GROUP BY n.gender;
+------------+------------+------------+
| gender | users | stars |
+------------+------------+------------+
| Female | 103684 | 3.77 |
| Male | 97430 | 3.696 |
| Unknown | 18409 | 3.727 |
+------------+------------+------------+

Thank You!
• Download at drill.apache.org
• Get in touch:
• tshiran@apache.org
• jnadeau@apache.org
• Ask questions:
• user@drill.apache.org
• Tweet: @ApacheDrill

Drilling into Data with Apache Drill

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (14)

Similaire à Drilling into Data with Apache Drill

Similaire à Drilling into Data with Apache Drill (20)

Plus de DataWorks Summit

Plus de DataWorks Summit (20)

Dernier

Dernier (20)

Drilling into Data with Apache Drill

Notes de l'éditeur