SlideShare une entreprise Scribd logo
1  sur  41
Building Highly Flexible, High Performance
query engines -
Highlights from Apache Drill project
Neeraja Rentachintala
Director, Product Management
MapR Technologies
Agenda
• Apache Drill overview
• Using Drill
• Under the Hood
• Status and progress
• Demo
APACHE DRILL OVERVIEW
Hadoop workloads and APIs
Use case ETL and
aggregation
(batch)
Predictive
modeling and
analytics
(batch)
Interactive SQL –
Data exploration,
Adhoc queries &
reporting
Search Operational
(user facing
applications,
point queries)
API MapReduce
Hive
Pig
Cascading
Mahout
MLLib
Spark
Drill
Shark
Impala
Hive on Tez
Presto
Solr
Elasticsear
ch
HBase API
Phoenix
Interactive SQL and Hadoop
• Opens up Hadoop data to broader
audience
– Existing SQL skill sets
– Broad eco system of tools
• New and improved BI/Analytics
use cases
– Analysis on more raw data, new
types of data and real time data
• Cost savings
Enterprise users
Data landscape is changing
New types of applications
• Social, mobile, Web, “Internet of
Things”, Cloud…
• Iterative/Agile in nature
• More users, more data
New data models & data types
• Flexible (schema-less) data
• Rapidly changing
• Semi-structured/Nested data
{
"data": [
"id": "X999_Y999",
"from": {
"name": "Tom Brady", "id": "X12"
},
"message": "Looking forward to 2014!",
"actions": [
{
"name": "Comment",
"link": "http://www.facebook.com/X99/posts Y999"
},
{
"name": "Like",
"link": "http://www.facebook.com/X99/posts Y999"
}
],
"type": "status",
"created_time": "2013-08-02T21:27:44+0000",
"updated_time": "2013-08-02T21:27:44+0000"
}
}
JSON
Traditional datasets
• Comes from transactional applications
• Stored for historical purposes and/or
for large scale ETL/Analytics
• Well defined schemas
• Managed centrally by DBAs
• No frequent changes to schema
• Flat datasets
New datasets
• Comes from new applications (Ex: Social
feeds, clickstream, logs, sensor data)
• Enable new use cases such as Customer
Satisfaction, Product/Service
optimization
• Flexible data models/managed within
applications
• Schemas evolving rapidly
• Semi-structured/Nested data
Hadoop evolving as central hub for analysis
ProvidesCosteffective,flexiblewaytostoreandandprocessdataat scale
Existing SQL approaches will not always work for
big data needs
• New data models/types don’t map well to the relational models
– Many data sources do not have rigid schemas (HBase, Mongo etc)
• Each record has a separate schema
• Sparse and wide rows
– Flattening nested data is error-prone and often impossible
• Think about repeated and optional fields at every level…
• A single HBase value could be a JSON document (compound nested type)
• Centralized schemas are hard to manage for big data
• Rapidly evolving data source schemas
• Lots of new data sources
• Third party data
• Unknown questions
Model data
Move data into
traditional systems
New questions
/requirements
Schema changes or
new data sources
DBA/DWH teams
Analyze Big data
Enterprise Users
Apache Drill
Open Source SQL on Hadoop for Agility with Big Data exploration
FLEXIBLE SCHEMA
MANAGEMENT
ANALYTICS ON
NOSQL DATA
PLUG AND PLAY
WITH EXISTING
TOOLS
Analyze data with
or without
centralized
schemas
Analyze data using
familiar BI/Analytics
and SQL based tools
Analyze semi
structured &
nested data with
no modeling/ETL
Flexible schema management
{
“ID”: 1,
“NAME”: “Fairmont San Francisco”,
“DESCRIPTION”: “Historic grandeur…”,
“AVG_REVIEWER_SCORE”: “4.3”,
“AMENITY”: {“TYPE”: “gym”,
DESCRIPTION: “fitness center”
},
{“TYPE”: “wifi”,
“DESCRIPTION”: “free wifi”},
“RATE_TYPE”: “nightly”,
“PRICE”: “$199”,
“REVIEWS”: [“review_1”, “review_2”],
“ATTRACTIONS”: “Chinatown”,
}
JSON
Existing SQL
solutions X
HotelID AmenityID
1 1
1 2
ID Type Descript
ion
1 Gym Fitness
center
2 Wifi Free wifi
Drill
{
“ID”: 1,
“NAME”: “Fairmont San Francisco”,
“DESCRIPTION”: “Historic grandeur…”,
“AVG_REVIEWER_SCORE”: “4.3”,
“AMENITY”: {“TYPE”: “gym”,
DESCRIPTION: “fitness
center”
},
{“TYPE”: “wifi”,
“DESCRIPTION”: “free wifi”},
“RATE_TYPE”: “nightly”,
“PRICE”: “$199”,
“REVIEWS”: [“review_1”, “review_2”],
“ATTRACTIONS”: “Chinatown”,
}
JSON
Drill
Flexible schema management
HotelID AmenityID
1 1
1 2
ID Type Descript
ion
1 Gym Fitness
center
2 Wifi Free wifi
Drill doesn’t require any schema definitions to query data making it faster to get
insights from data for users. Drill leverages schema definitions if exists.
Key features
• Dynamic/schema-less queries
• Nested data
• Apache Hive integration
• ANSI SQL/BI tool integration
Querying files
• Direct queries on a local or a distributed file system (HDFS, S3
etc)
• Configure one or more directories in file system as “Workspaces”
– Think of this as similar to schemas in databases
– Default workspace points to “root” location
• Specify a single file or a directory as ‘Table’ within query
• Specify schema in query or let Drill discover it
• Example:
• SELECT * FROM dfs.users.`/home/mapr/sample-data/profiles.json`
dfs File system as data source
users Workspace (corresponds to a directory)
/home/mapr/sample-
data/profiles.json
Table
More examples
• Query on single file
SELECT * FROM dfs.logs.`AppServerLogs/2014/Jan/part0001.txt`
• Query on directory
SELECT * FROM dfs.logs.`AppServerLogs/2014/Jan` where errorLevel=1;
• Joins on files
SELECT c.c_custkey,sum(o.o_totalprice)
FROM
dfs.`/home/mapr/tpch/customer.parquet` c
JOIN
dfs.`/home/mapr/tpch/orders.parquet` o
ON c.c_custkey = o.o_custkey
GROUP BY c.c_custkey
LIMIT 10
Querying HBase
• Direct queries on HBase tables
– SELECT row_key, cf1.month, cf1.year FROM hbase.table1;
– SELECT CONVERT_FROM(row_key, UTF-8) as HotelName from FROM
HotelData
• No need to define a parallel/overlay schema in Hive
• Encode and Decode data from HBase using Convert functions
– Convert_To and Convert_From
Nested data
• Nested data as first class entity: Extensions to SQL for nested
data types, similar to BigQuery
• No upfront flattening/modeling required
• Generic architecture for a broad variety of nested data types
(eg:JSON, BSON, XML, AVRO, Protocol Buffers)
• Performance with ground up design for nested data
• Example:
SELECT
c.name, c.address, REPEATED_COUNT(c.children)
FROM(
SELECT
CONVERT_FROM(cf1.user-json-blob, JSON) AS c
FROM
hbase.table1
)
Apache Hive integration
• Plug and Play integration in existing
Hive deployments
• Use Drill to query data in Hive
tables/views
• Support to work with more than
one Hive metastore
• Support for all Hive file formats
• Ability to use Hive UDFs as part of
Drill queries
Hive
meta
store
Files HBase
Hive
SQL layer
Drill
SQL layer +
execution
engine
MapReduce
execution
framework
Cross data source queries
• Combine data from Files, HBase, Hive in one query
• No central metadata definitions necessary
• Example:
– USE HiveTest.CustomersDB
– SELECT Customers.customer_name, SocialData.Tweets.Count
FROM Customers
JOIN HBaseCatalog.SocialData SocialData
ON Customers.Customer_id = Convert_From(SocialData.rowkey, UTF-8)
BI tool integration
• Standard JDBC/ODBC drivers
• Integration Tableau, Excel, Microstrategy, Toad,
SQuirreL...
SQL support
• ANSI SQL compatibility
– “SQL Like” not enough
• SQL data types
– SMALLINT, BIGINT, TINYINT, INT, FLOAT, DOUBLE,DATE, TIMESTAMP, DECIMAL, VARCHAR,
VARBINARY ….
• All common SQL constructs
• SELECT, GROUP BY, ORDER BY, LIMIT, JOIN, HAVING, UNION, UNION ALL, IN/NOT IN,
EXISTS/NOT EXISTS,DISTINCT, BETWEEN, CREATE TABLE/VIEW AS ….
• Scalar and correlated sub queries
• Metadata discovery using INFORMATION_SCHEMA
• Support for datasets that do not fit in memory
Packaging/install
• Works on all Hadoop distributions
• Easy ramp up with embedded/standalone
mode
– Try out Drill easily on your machine
– No Hadoop requirement
© MapR Technologies, confidential
Under the Hood
High Level Architecture
• Drillbits run on each node, designed to maximize data locality
• Drill includes a distributed execution environment built specifically for
distributed query processing
• Any Drillbit can act as endpoint for particular query.
• Zookeeper maintains ephemeral cluster membership information only
• Small distributed cache utilizing embedded Hazelcast maintains information
about individual queue depth, cached query plans, metadata, locality
information, etc.
Zookeeper
Storage
Process
Storage
Process
Storage
Process
Drillbit
Distributed Cache
Drillbit
Distributed Cache
Drillbit
Distributed Cache
Basic query flow
Zookeeper
DFS/HBase DFS/HBase DFS/HBase
Drillbit
Distributed Cache
Drillbit
Distributed Cache
Drillbit
Distributed Cache
Query
1. Query comes to any Drillbit (JDBC, ODBC, CLI)
2. Drillbit generates execution plan based on query optimization & locality
3. Fragments are farmed to individual nodes
4. Data is returned to driving node
Core Modules within a Drillbit
SQL Parser
Optimizer
PhysicalPlan
DFS
HBase
RPC Endpoint
Distributed Cache
StorageEngineInterface
LogicalPlan
Execution
Hive
Query Execution
• Source query—what we want to do (analyst
friendly)
• Logical Plan— what we want to do (language
agnostic, computer friendly)
• Physical Plan—how we want to do it (the best
way we can tell)
• Execution Plan—where we want to do it
A Query engine that is…
• Optimistic/pipelined
• Columnar/Vectorized
• Runtime compiled
• Late binding
• Extensible
Optimistic Execution
• With a short time horizon, failures infrequent
– Don’t spend energy and time creating boundaries
and checkpoints to minimize recovery time
– Rerun entire query in face of failure
• No barriers
• No persistence unless memory overflow
Runtime Compilation
• Give JIT help
• Avoid virtual method invocation
• Avoid heap allocation and object overhead
• Minimize memory overhead
Record versus Columnar
Representation
Record Column
Data Format Example
Donut Price Icing
Bacon Maple Bar 2.19 [Maple Frosting, Bacon]
Portland Cream 1.79 [Chocolate]
The Loop 2.29 [Vanilla, Fruitloops]
Triple Chocolate
Penetration
2.79 [Chocolate, Cocoa Puffs]
Record Encoding
Bacon Maple Bar, 2.19, Maple Frosting, Bacon, Portland Cream, 1.79, Chocolate
The Loop, 2.29, Vanilla, Fruitloops, Triple Chocolate Penetration, 2.79, Chocolate,
Cocoa Puffs
Columnar Encoding
Bacon Maple Bar, Portland Cream, The Loop, Triple Chocolate Penetration
2.19, 1.79, 2.29, 2.79
Maple Frosting, Bacon, Chocolate, Vanilla, Fruitloops, Chocolate, Cocoa Puffs
Example: RLE and Sum
• Dataset
– 2, 4
– 8, 10
• Goal
– Sum all the records
• Normal Work
– Decompress & store: 2, 2, 2, 2, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8
– Add: 2 + 2 + 2 + 2 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8
• Optimized Work
– 2 * 4 + 8 * 10
– Less Memory, less operations
Record Batch
• Drill optimizes for BOTH columnar
STORAGE and Execution
• Record Batch is unit of work for the
query system
– Operators always work on a batch of
records
• All values associated with a
particular collection of records
• Each record batch must have a single
defined schema
• Record batches are pipelined
between operators and nodes
RecordBatch
VV VV VV VV
RecordBatch
VV VV VV VV
RecordBatch
VV VV VV VV
Strengths of RecordBatch +
ValueVectors
• RecordBatch clearly delineates low overhead/high
performance space
– Record-by-record, avoid method invocation
– Batch-by-batch, trust JVM
• Avoid serialization/deserialization
• Off-heap means large memory footprint without GC woes
• Full specification combined with off-heap and batch-level
execution allows C/C++ operators as necessary
• Random access: sort without copy or restructuring
Late Schema Binding
• Schema can change over course of query
• Operators are able to reconfigure themselves
on schema change events
Integration and Extensibility points
• Support UDFs
– UDFs/UDAFs using high performance Java API
• Not Hadoop centric
– Work with other NoSQL solutions including MongoDB, Cassandra, Riak, etc.
– Build one distributed query engine together than per technology
• Built in classpath scanning and plugin concept to add additional
storage engines, function and operators with zero configuration
• Support direct execution of strongly specified JSON based logical
and physical plans
– Simplifies testing
– Enables integration of alternative query languages
Comparison with MapReduce
• Barriers
– Map completion required before shuffle/reduce
commencement
– All maps must complete before reduce can start
– In chained jobs, one job must finish entirely before the
next one can start
• Persistence and Recoverability
– Data is persisted to disk between each barrier
– Serialization and deserialization are required between
execution phase
STATUS
Status
• Heavy active development
• Significant community momentum
– ~15+ contributors
– 400+ people in Drill mailing lists
– 400+ members in Bay area Drill user group
• Current state : Alpha
• Timeline
1.0 Beta (End of Q2, 2014) 1.0 GA (Q3, 2014)
Interested in Apache Drill?
• Join the community
– Join the Drill mailing lists
• drill-user@incubator.apache.org
• drill-dev@incubator.apache.org
– Contribute
• Use cases/Sample queries, JIRAs, code, unit tests, documentation, ...
– Fork us on GitHub: http://github.com/apache/incubator-drill/
– Create a JIRA: https://issues.apache.org/jira/browse/DRILL
• Resources
– Try out Drill in 10mins
– http://incubator.apache.org/drill/
– https://cwiki.apache.org/confluence/display/DRILL/Apache+Drill+Wiki
DEMO

Contenu connexe

Tendances

Tableau and hadoop
Tableau and hadoopTableau and hadoop
Tableau and hadoopCraig Jordan
 
Graph Databases for SQL Server Professionals
Graph Databases for SQL Server ProfessionalsGraph Databases for SQL Server Professionals
Graph Databases for SQL Server ProfessionalsStéphane Fréchette
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiridatastack
 
Conhecendo o Apache HBase
Conhecendo o Apache HBaseConhecendo o Apache HBase
Conhecendo o Apache HBaseFelipe Ferreira
 
Big Data on the Microsoft Platform
Big Data on the Microsoft PlatformBig Data on the Microsoft Platform
Big Data on the Microsoft PlatformAndrew Brust
 
Azure Data Lake and U-SQL
Azure Data Lake and U-SQLAzure Data Lake and U-SQL
Azure Data Lake and U-SQLMichael Rys
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & HadoopBlackvard
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsAndrew Brust
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseAsis Mohanty
 
Teradata Loom Introductory Presentation
Teradata Loom Introductory PresentationTeradata Loom Introductory Presentation
Teradata Loom Introductory Presentationmlang222
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop StoryMichael Rys
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big DataAndrew Brust
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage CCG
 
SQL Server 2012 Beyond Relational Performance and Scale
SQL Server 2012 Beyond Relational Performance and ScaleSQL Server 2012 Beyond Relational Performance and Scale
SQL Server 2012 Beyond Relational Performance and ScaleMichael Rys
 
NoSQL: An Analysis
NoSQL: An AnalysisNoSQL: An Analysis
NoSQL: An AnalysisAndrew Brust
 
NoSQL and The Big Data Hullabaloo
NoSQL and The Big Data HullabalooNoSQL and The Big Data Hullabaloo
NoSQL and The Big Data HullabalooAndrew Brust
 

Tendances (19)

Tableau and hadoop
Tableau and hadoopTableau and hadoop
Tableau and hadoop
 
Graph Databases for SQL Server Professionals
Graph Databases for SQL Server ProfessionalsGraph Databases for SQL Server Professionals
Graph Databases for SQL Server Professionals
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Conhecendo o Apache HBase
Conhecendo o Apache HBaseConhecendo o Apache HBase
Conhecendo o Apache HBase
 
Big Data on the Microsoft Platform
Big Data on the Microsoft PlatformBig Data on the Microsoft Platform
Big Data on the Microsoft Platform
 
Azure Data Lake and U-SQL
Azure Data Lake and U-SQLAzure Data Lake and U-SQL
Azure Data Lake and U-SQL
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & Hadoop
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouse
 
Teradata Loom Introductory Presentation
Teradata Loom Introductory PresentationTeradata Loom Introductory Presentation
Teradata Loom Introductory Presentation
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage
 
SQL Server 2012 Beyond Relational Performance and Scale
SQL Server 2012 Beyond Relational Performance and ScaleSQL Server 2012 Beyond Relational Performance and Scale
SQL Server 2012 Beyond Relational Performance and Scale
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
NoSQL: An Analysis
NoSQL: An AnalysisNoSQL: An Analysis
NoSQL: An Analysis
 
Introduction to Azure Data Lake
Introduction to Azure Data LakeIntroduction to Azure Data Lake
Introduction to Azure Data Lake
 
NoSQL and The Big Data Hullabaloo
NoSQL and The Big Data HullabalooNoSQL and The Big Data Hullabaloo
NoSQL and The Big Data Hullabaloo
 

Similaire à Apache Drill at ApacheCon2014

No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summitOpen Analytics
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineeringThang Bui (Bob)
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Getting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsightGetting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsightNilesh Gule
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?James Serra
 
American family hadoop journey, uw ebc sig meeting, april 2015
American family hadoop journey, uw ebc sig meeting, april 2015American family hadoop journey, uw ebc sig meeting, april 2015
American family hadoop journey, uw ebc sig meeting, april 2015Craig Jordan
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars GeorgeJAX London
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxvishwasgarade1
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillMapR Technologies
 
hive architecture and hive components in detail
hive architecture and hive components in detailhive architecture and hive components in detail
hive architecture and hive components in detailHariKumar544765
 
Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL David Smelker
 
2013 year of real-time hadoop
2013 year of real-time hadoop2013 year of real-time hadoop
2013 year of real-time hadoopGeoff Hendrey
 

Similaire à Apache Drill at ApacheCon2014 (20)

Apache Drill
Apache DrillApache Drill
Apache Drill
 
Apache drill
Apache drillApache drill
Apache drill
 
Impala for PhillyDB Meetup
Impala for PhillyDB MeetupImpala for PhillyDB Meetup
Impala for PhillyDB Meetup
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Getting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsightGetting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsight
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
American family hadoop journey, uw ebc sig meeting, april 2015
American family hadoop journey, uw ebc sig meeting, april 2015American family hadoop journey, uw ebc sig meeting, april 2015
American family hadoop journey, uw ebc sig meeting, april 2015
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptx
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache Drill
 
hive architecture and hive components in detail
hive architecture and hive components in detailhive architecture and hive components in detail
hive architecture and hive components in detail
 
Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL
 
2013 year of real-time hadoop
2013 year of real-time hadoop2013 year of real-time hadoop
2013 year of real-time hadoop
 

Dernier

Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 

Dernier (20)

Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 

Apache Drill at ApacheCon2014

  • 1. Building Highly Flexible, High Performance query engines - Highlights from Apache Drill project Neeraja Rentachintala Director, Product Management MapR Technologies
  • 2. Agenda • Apache Drill overview • Using Drill • Under the Hood • Status and progress • Demo
  • 4. Hadoop workloads and APIs Use case ETL and aggregation (batch) Predictive modeling and analytics (batch) Interactive SQL – Data exploration, Adhoc queries & reporting Search Operational (user facing applications, point queries) API MapReduce Hive Pig Cascading Mahout MLLib Spark Drill Shark Impala Hive on Tez Presto Solr Elasticsear ch HBase API Phoenix
  • 5. Interactive SQL and Hadoop • Opens up Hadoop data to broader audience – Existing SQL skill sets – Broad eco system of tools • New and improved BI/Analytics use cases – Analysis on more raw data, new types of data and real time data • Cost savings Enterprise users
  • 6. Data landscape is changing New types of applications • Social, mobile, Web, “Internet of Things”, Cloud… • Iterative/Agile in nature • More users, more data New data models & data types • Flexible (schema-less) data • Rapidly changing • Semi-structured/Nested data { "data": [ "id": "X999_Y999", "from": { "name": "Tom Brady", "id": "X12" }, "message": "Looking forward to 2014!", "actions": [ { "name": "Comment", "link": "http://www.facebook.com/X99/posts Y999" }, { "name": "Like", "link": "http://www.facebook.com/X99/posts Y999" } ], "type": "status", "created_time": "2013-08-02T21:27:44+0000", "updated_time": "2013-08-02T21:27:44+0000" } } JSON
  • 7. Traditional datasets • Comes from transactional applications • Stored for historical purposes and/or for large scale ETL/Analytics • Well defined schemas • Managed centrally by DBAs • No frequent changes to schema • Flat datasets New datasets • Comes from new applications (Ex: Social feeds, clickstream, logs, sensor data) • Enable new use cases such as Customer Satisfaction, Product/Service optimization • Flexible data models/managed within applications • Schemas evolving rapidly • Semi-structured/Nested data Hadoop evolving as central hub for analysis ProvidesCosteffective,flexiblewaytostoreandandprocessdataat scale
  • 8. Existing SQL approaches will not always work for big data needs • New data models/types don’t map well to the relational models – Many data sources do not have rigid schemas (HBase, Mongo etc) • Each record has a separate schema • Sparse and wide rows – Flattening nested data is error-prone and often impossible • Think about repeated and optional fields at every level… • A single HBase value could be a JSON document (compound nested type) • Centralized schemas are hard to manage for big data • Rapidly evolving data source schemas • Lots of new data sources • Third party data • Unknown questions Model data Move data into traditional systems New questions /requirements Schema changes or new data sources DBA/DWH teams Analyze Big data Enterprise Users
  • 9. Apache Drill Open Source SQL on Hadoop for Agility with Big Data exploration FLEXIBLE SCHEMA MANAGEMENT ANALYTICS ON NOSQL DATA PLUG AND PLAY WITH EXISTING TOOLS Analyze data with or without centralized schemas Analyze data using familiar BI/Analytics and SQL based tools Analyze semi structured & nested data with no modeling/ETL
  • 10. Flexible schema management { “ID”: 1, “NAME”: “Fairmont San Francisco”, “DESCRIPTION”: “Historic grandeur…”, “AVG_REVIEWER_SCORE”: “4.3”, “AMENITY”: {“TYPE”: “gym”, DESCRIPTION: “fitness center” }, {“TYPE”: “wifi”, “DESCRIPTION”: “free wifi”}, “RATE_TYPE”: “nightly”, “PRICE”: “$199”, “REVIEWS”: [“review_1”, “review_2”], “ATTRACTIONS”: “Chinatown”, } JSON Existing SQL solutions X HotelID AmenityID 1 1 1 2 ID Type Descript ion 1 Gym Fitness center 2 Wifi Free wifi
  • 11. Drill { “ID”: 1, “NAME”: “Fairmont San Francisco”, “DESCRIPTION”: “Historic grandeur…”, “AVG_REVIEWER_SCORE”: “4.3”, “AMENITY”: {“TYPE”: “gym”, DESCRIPTION: “fitness center” }, {“TYPE”: “wifi”, “DESCRIPTION”: “free wifi”}, “RATE_TYPE”: “nightly”, “PRICE”: “$199”, “REVIEWS”: [“review_1”, “review_2”], “ATTRACTIONS”: “Chinatown”, } JSON Drill Flexible schema management HotelID AmenityID 1 1 1 2 ID Type Descript ion 1 Gym Fitness center 2 Wifi Free wifi Drill doesn’t require any schema definitions to query data making it faster to get insights from data for users. Drill leverages schema definitions if exists.
  • 12. Key features • Dynamic/schema-less queries • Nested data • Apache Hive integration • ANSI SQL/BI tool integration
  • 13. Querying files • Direct queries on a local or a distributed file system (HDFS, S3 etc) • Configure one or more directories in file system as “Workspaces” – Think of this as similar to schemas in databases – Default workspace points to “root” location • Specify a single file or a directory as ‘Table’ within query • Specify schema in query or let Drill discover it • Example: • SELECT * FROM dfs.users.`/home/mapr/sample-data/profiles.json` dfs File system as data source users Workspace (corresponds to a directory) /home/mapr/sample- data/profiles.json Table
  • 14. More examples • Query on single file SELECT * FROM dfs.logs.`AppServerLogs/2014/Jan/part0001.txt` • Query on directory SELECT * FROM dfs.logs.`AppServerLogs/2014/Jan` where errorLevel=1; • Joins on files SELECT c.c_custkey,sum(o.o_totalprice) FROM dfs.`/home/mapr/tpch/customer.parquet` c JOIN dfs.`/home/mapr/tpch/orders.parquet` o ON c.c_custkey = o.o_custkey GROUP BY c.c_custkey LIMIT 10
  • 15. Querying HBase • Direct queries on HBase tables – SELECT row_key, cf1.month, cf1.year FROM hbase.table1; – SELECT CONVERT_FROM(row_key, UTF-8) as HotelName from FROM HotelData • No need to define a parallel/overlay schema in Hive • Encode and Decode data from HBase using Convert functions – Convert_To and Convert_From
  • 16. Nested data • Nested data as first class entity: Extensions to SQL for nested data types, similar to BigQuery • No upfront flattening/modeling required • Generic architecture for a broad variety of nested data types (eg:JSON, BSON, XML, AVRO, Protocol Buffers) • Performance with ground up design for nested data • Example: SELECT c.name, c.address, REPEATED_COUNT(c.children) FROM( SELECT CONVERT_FROM(cf1.user-json-blob, JSON) AS c FROM hbase.table1 )
  • 17. Apache Hive integration • Plug and Play integration in existing Hive deployments • Use Drill to query data in Hive tables/views • Support to work with more than one Hive metastore • Support for all Hive file formats • Ability to use Hive UDFs as part of Drill queries Hive meta store Files HBase Hive SQL layer Drill SQL layer + execution engine MapReduce execution framework
  • 18. Cross data source queries • Combine data from Files, HBase, Hive in one query • No central metadata definitions necessary • Example: – USE HiveTest.CustomersDB – SELECT Customers.customer_name, SocialData.Tweets.Count FROM Customers JOIN HBaseCatalog.SocialData SocialData ON Customers.Customer_id = Convert_From(SocialData.rowkey, UTF-8)
  • 19. BI tool integration • Standard JDBC/ODBC drivers • Integration Tableau, Excel, Microstrategy, Toad, SQuirreL...
  • 20. SQL support • ANSI SQL compatibility – “SQL Like” not enough • SQL data types – SMALLINT, BIGINT, TINYINT, INT, FLOAT, DOUBLE,DATE, TIMESTAMP, DECIMAL, VARCHAR, VARBINARY …. • All common SQL constructs • SELECT, GROUP BY, ORDER BY, LIMIT, JOIN, HAVING, UNION, UNION ALL, IN/NOT IN, EXISTS/NOT EXISTS,DISTINCT, BETWEEN, CREATE TABLE/VIEW AS …. • Scalar and correlated sub queries • Metadata discovery using INFORMATION_SCHEMA • Support for datasets that do not fit in memory
  • 21. Packaging/install • Works on all Hadoop distributions • Easy ramp up with embedded/standalone mode – Try out Drill easily on your machine – No Hadoop requirement
  • 22. © MapR Technologies, confidential Under the Hood
  • 23. High Level Architecture • Drillbits run on each node, designed to maximize data locality • Drill includes a distributed execution environment built specifically for distributed query processing • Any Drillbit can act as endpoint for particular query. • Zookeeper maintains ephemeral cluster membership information only • Small distributed cache utilizing embedded Hazelcast maintains information about individual queue depth, cached query plans, metadata, locality information, etc. Zookeeper Storage Process Storage Process Storage Process Drillbit Distributed Cache Drillbit Distributed Cache Drillbit Distributed Cache
  • 24. Basic query flow Zookeeper DFS/HBase DFS/HBase DFS/HBase Drillbit Distributed Cache Drillbit Distributed Cache Drillbit Distributed Cache Query 1. Query comes to any Drillbit (JDBC, ODBC, CLI) 2. Drillbit generates execution plan based on query optimization & locality 3. Fragments are farmed to individual nodes 4. Data is returned to driving node
  • 25. Core Modules within a Drillbit SQL Parser Optimizer PhysicalPlan DFS HBase RPC Endpoint Distributed Cache StorageEngineInterface LogicalPlan Execution Hive
  • 26. Query Execution • Source query—what we want to do (analyst friendly) • Logical Plan— what we want to do (language agnostic, computer friendly) • Physical Plan—how we want to do it (the best way we can tell) • Execution Plan—where we want to do it
  • 27. A Query engine that is… • Optimistic/pipelined • Columnar/Vectorized • Runtime compiled • Late binding • Extensible
  • 28. Optimistic Execution • With a short time horizon, failures infrequent – Don’t spend energy and time creating boundaries and checkpoints to minimize recovery time – Rerun entire query in face of failure • No barriers • No persistence unless memory overflow
  • 29. Runtime Compilation • Give JIT help • Avoid virtual method invocation • Avoid heap allocation and object overhead • Minimize memory overhead
  • 31. Data Format Example Donut Price Icing Bacon Maple Bar 2.19 [Maple Frosting, Bacon] Portland Cream 1.79 [Chocolate] The Loop 2.29 [Vanilla, Fruitloops] Triple Chocolate Penetration 2.79 [Chocolate, Cocoa Puffs] Record Encoding Bacon Maple Bar, 2.19, Maple Frosting, Bacon, Portland Cream, 1.79, Chocolate The Loop, 2.29, Vanilla, Fruitloops, Triple Chocolate Penetration, 2.79, Chocolate, Cocoa Puffs Columnar Encoding Bacon Maple Bar, Portland Cream, The Loop, Triple Chocolate Penetration 2.19, 1.79, 2.29, 2.79 Maple Frosting, Bacon, Chocolate, Vanilla, Fruitloops, Chocolate, Cocoa Puffs
  • 32. Example: RLE and Sum • Dataset – 2, 4 – 8, 10 • Goal – Sum all the records • Normal Work – Decompress & store: 2, 2, 2, 2, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8 – Add: 2 + 2 + 2 + 2 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8 • Optimized Work – 2 * 4 + 8 * 10 – Less Memory, less operations
  • 33. Record Batch • Drill optimizes for BOTH columnar STORAGE and Execution • Record Batch is unit of work for the query system – Operators always work on a batch of records • All values associated with a particular collection of records • Each record batch must have a single defined schema • Record batches are pipelined between operators and nodes RecordBatch VV VV VV VV RecordBatch VV VV VV VV RecordBatch VV VV VV VV
  • 34. Strengths of RecordBatch + ValueVectors • RecordBatch clearly delineates low overhead/high performance space – Record-by-record, avoid method invocation – Batch-by-batch, trust JVM • Avoid serialization/deserialization • Off-heap means large memory footprint without GC woes • Full specification combined with off-heap and batch-level execution allows C/C++ operators as necessary • Random access: sort without copy or restructuring
  • 35. Late Schema Binding • Schema can change over course of query • Operators are able to reconfigure themselves on schema change events
  • 36. Integration and Extensibility points • Support UDFs – UDFs/UDAFs using high performance Java API • Not Hadoop centric – Work with other NoSQL solutions including MongoDB, Cassandra, Riak, etc. – Build one distributed query engine together than per technology • Built in classpath scanning and plugin concept to add additional storage engines, function and operators with zero configuration • Support direct execution of strongly specified JSON based logical and physical plans – Simplifies testing – Enables integration of alternative query languages
  • 37. Comparison with MapReduce • Barriers – Map completion required before shuffle/reduce commencement – All maps must complete before reduce can start – In chained jobs, one job must finish entirely before the next one can start • Persistence and Recoverability – Data is persisted to disk between each barrier – Serialization and deserialization are required between execution phase
  • 39. Status • Heavy active development • Significant community momentum – ~15+ contributors – 400+ people in Drill mailing lists – 400+ members in Bay area Drill user group • Current state : Alpha • Timeline 1.0 Beta (End of Q2, 2014) 1.0 GA (Q3, 2014)
  • 40. Interested in Apache Drill? • Join the community – Join the Drill mailing lists • drill-user@incubator.apache.org • drill-dev@incubator.apache.org – Contribute • Use cases/Sample queries, JIRAs, code, unit tests, documentation, ... – Fork us on GitHub: http://github.com/apache/incubator-drill/ – Create a JIRA: https://issues.apache.org/jira/browse/DRILL • Resources – Try out Drill in 10mins – http://incubator.apache.org/drill/ – https://cwiki.apache.org/confluence/display/DRILL/Apache+Drill+Wiki
  • 41. DEMO

Notes de l'éditeur

  1. Lot of workloads exist for Big data, batch, machine learning, search, interactive SQL, Operational/user facing applicationsApache Drill fits into the interactive SQL category
  2. Analytics on Semi-Structured/Nested dataUse standard SQL to query Nested data without upfront flattening/modelingExtensions to ANSI SQL to operate on nested dataGeneric architecture for a broad variety of nested data types (eg:JSON, BSON, XML, AVRO, Protocol Buffers)Performance with ground up design for Nested dataIn-memory columnar/hierarchical data model enabling SQL processing directly on Nested dataPush-down SQL functionality to HBase for query speed ups