SlideShare une entreprise Scribd logo
1  sur  22
Apache Drill
Interactive Analysis of Large-Scale Datasets
Ted Dunning
Chief Application Architect, MapR
My Background
• Startups
– Aptex, MusicMatch, ID Analytics, Veoh
– Big data since before big
• Open source
– since the dark ages before the internet
– Mahout, Zookeeper, Drill
– bought the beer at first HUG
• MapR
• Founding member of Apache Drill
MapR Technologies
• The open enterprise-grade distribution for Hadoop
– Easy, dependable and fast
– Open source with standards-based extensions
• MapR is deployed at 1000’s of companies
– From small Internet startups to the world’s largest enterprises
• MapR customers analyze massive amounts of data:
– Hundreds of billions of events daily
– 90% of the world’s Internet population monthly
– $1 trillion in retail purchases annually
• MapR has partnered with Google to provide Hadoop on Google
Compute Engine
Latency Matters
• Ad-hoc analysis with interactive tools
• Real-time dashboards
• Event/trend detection and analysis
– Network intrusions
– Fraud
– Failures
Big Data Processing
Batch processing Interactive analysis Stream processing
Query runtime Minutes to hours Milliseconds to
minutes
Never-ending
Data volume TBs to PBs GBs to PBs Continuous stream
Programming
model
MapReduce Queries DAG
Users Developers Analysts and
developers
Developers
Google project MapReduce Dremel
Open source
project
Hadoop
MapReduce
Storm and S4
Introducing Apache Drill…
GOOGLE DREMEL
Google Dremel
• Interactive analysis of large-scale datasets
– Trillion records at interactive speeds
– Complementary to MapReduce
– Used by thousands of Google employees
– Paper published at VLDB 2010
• Authors: Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva
Shivakumar, Matt Tolton, Theo Vassilakis
• Model
– Nested data model with schema
• Most data at Google is stored/transferred in Protocol Buffers
• Normalization (to relational) is prohibitive
– SQL-like query language with nested data support
• Implementation
– Column-based storage and processing
– In-situ data access (GFS and Bigtable)
– Tree architecture as in Web search (and databases)
Google BigQuery
• Hosted Dremel (Dremel as a Service)
• CLI (bq) and Web UI
• Import data from Google Cloud Storage or local files
– Files must be in CSV format
• Nested data not supported [yet] except built-in datasets
– Schema definition required
APACHE DRILL
Architecture
• Only the execution engine knows the physical attributes of the cluster
– # nodes, hardware, file locations, …
• Public interfaces enable extensibility
– Developers can build parsers for new query languages
– Developers can provide an execution plan directly
• Each level of the plan has a human readable representation
– Facilitates debugging and unit testing
Architecture (2)
Execution Engine Layers
• Drill execution engine has two layers
– Operator layer is serialization-aware
• Processes individual records
– Execution layer is not serialization-aware
• Processes batches of records (blobs)
• Responsible for communication, dependencies and fault tolerance
Data Flow
Nested Query Languages
• DrQL
– SQL-like query language for nested data
– Compatible with Google BigQuery/Dremel
• BigQuery applications should work with Drill
– Designed to support efficient column-based processing
• No record assembly during query processing
• Mongo Query Language
– {$query: {x: 3, y: "abc"}, $orderby: {x: 1}}
• Other languages/programming models can plug in
Nested Data Model
• The data model in Dremel is Protocol Buffers
– Nested
– Schema
• Apache Drill is designed to support multiple data models
– Schema: Protocol Buffers, Apache Avro, …
– Schema-less: JSON, BSON, …
• Flat records are supported as a special case of nested data
– CSV, TSV, …
{
"name": "Srivas",
"gender": "Male",
"followers": 100
}
{
"name": "Raina",
"gender": "Female",
"followers": 200,
"zip": "94305"
}
enum Gender {
MALE, FEMALE
}
record User {
string name;
Gender gender;
long followers;
}
Avro IDL JSON
DrQL Example
SELECT DocId AS Id,
COUNT(Name.Language.Code) WITHIN Name AS
Cnt,
Name.Url + ',' + Name.Language.Code AS
Str
FROM t
WHERE REGEXP(Name.Url, '^http')
AND DocId < 20;
* Example from the Dremel paper
Query Components
• Query components:
– SELECT
– FROM
– WHERE
– GROUP BY
– HAVING
– (JOIN)
• Key logical operators:
– Scan
– Filter
– Aggregate
– (Join)
Extensibility
• Nested query languages
– Pluggable model
– DrQL
– Mongo Query Language
– Cascading
• Distributed execution engine
– Extensible model (eg, Dryad)
– Low-latency
– Fault tolerant
• Nested data formats
– Pluggable model
– Column-based (ColumnIO/Dremel, Trevni, RCFile) and row-based (RecordIO, Avro, JSON, CSV)
– Schema (Protocol Buffers, Avro, CSV) and schema-less (JSON, BSON)
• Scalable data sources
– Pluggable model
– Hadoop
– HBase
Scan Operators
Scan with schema Scan without schema
Operator
output
Protocol Buffers JSON-like (MessagePack)
Supported
data formats
ColumnIO (column-based protobuf/Dremel)
RecordIO (row-based protobuf)
CSV
JSON
HBase
SELECT …
FROM …
ColumnIO(proto URI, data URI)
RecordIO(proto URI, data URI)
Json(data URI)
HBase(table name)
• Drill supports multiple data formats by having per-format scan operators
• Queries involving multiple data formats/sources are supported
• Fields and predicates can be pushed down into the scan operator
• Scan operators may have adaptive side-effects (database cracking)
• Produce ColumnIO from RecordIO
• Google PowerDrill stores materialized expressions with the data
Design Principles
Flexible
• Pluggable query languages
• Extensible execution engine
• Pluggable data formats
• Column-based and row-based
• Schema and schema-less
• Pluggable data sources
Easy
• Unzip and run
• Zero configuration
• Reverse DNS not needed
• IP addresses can change
• Clear and concise log messages
Dependable
• No SPOF
• Instant recovery from crashes
Fast
• C/C++ core with Java support
• Google C++ style guide
• Min latency and max throughput
(limited only by hardware)
Hadoop Integration
• Hadoop data sources
– Hadoop FileSystem API (HDFS/MapR-FS)
– HBase
• Hadoop data formats
– Apache Avro
– RCFile
• MapReduce-based tools to create column-based formats
• Table registry in HCatalog
• Run long-running services in YARN
Get Involved!
• Download (almost) these slides
– http://www.mapr.com/company/events/bay-area-hug/9-19-2012
• Join the project
– drill-dev-subscribe@incubator.apache.org #apachedrill
• Contact me:
– tdunning@maprtech.com
– tdunning@apache.org
– ted.dunning@maprtech.com
– @ted_dunning
• Join MapR
– jobs@mapr.com

Contenu connexe

Tendances

The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
MapR Technologies
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Modern Data Stack France
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and Future
DataWorks Summit
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
Yiwei Ma
 

Tendances (20)

Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Practice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China MobilePractice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China Mobile
 
Back to School - St. Louis Hadoop Meetup September 2016
Back to School - St. Louis Hadoop Meetup September 2016Back to School - St. Louis Hadoop Meetup September 2016
Back to School - St. Louis Hadoop Meetup September 2016
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
 
Big Data Journey
Big Data JourneyBig Data Journey
Big Data Journey
 
Dchug m7-30 apr2013
Dchug m7-30 apr2013Dchug m7-30 apr2013
Dchug m7-30 apr2013
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBase
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
 
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and Future
 
10c introduction
10c introduction10c introduction
10c introduction
 
Cloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLCloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQL
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedIn
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 

En vedette

HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 

En vedette (11)

News From Mahout
News From MahoutNews From Mahout
News From Mahout
 
Drill Lightning London Big Data
Drill Lightning London Big DataDrill Lightning London Big Data
Drill Lightning London Big Data
 
Summit EU Machine Learning
Summit EU Machine LearningSummit EU Machine Learning
Summit EU Machine Learning
 
New directions for mahout
New directions for mahoutNew directions for mahout
New directions for mahout
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache Drill
 
Hadoop: Revolutionizing Analytics AND Operations
Hadoop: Revolutionizing Analytics AND OperationsHadoop: Revolutionizing Analytics AND Operations
Hadoop: Revolutionizing Analytics AND Operations
 
Hadoop as a Platform for Genomics
Hadoop as a Platform for GenomicsHadoop as a Platform for Genomics
Hadoop as a Platform for Genomics
 
Cmu 2011 09.pptx
Cmu 2011 09.pptxCmu 2011 09.pptx
Cmu 2011 09.pptx
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 

Similaire à Drill at the Chicago Hug

Drill Bay Area HUG 2012-09-19
Drill Bay Area HUG 2012-09-19Drill Bay Area HUG 2012-09-19
Drill Bay Area HUG 2012-09-19
jasonfrantz
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
Open Analytics
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 

Similaire à Drill at the Chicago Hug (20)

Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Drill at the Chug 9-19-12
Drill at the Chug 9-19-12
 
Drill Bay Area HUG 2012-09-19
Drill Bay Area HUG 2012-09-19Drill Bay Area HUG 2012-09-19
Drill Bay Area HUG 2012-09-19
 
Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis
 
HUG France - Apache Drill
HUG France - Apache DrillHUG France - Apache Drill
HUG France - Apache Drill
 
Drill dchug-29 nov2012
Drill dchug-29 nov2012Drill dchug-29 nov2012
Drill dchug-29 nov2012
 
Drill lightning-london-big-data-10-01-2012
Drill lightning-london-big-data-10-01-2012Drill lightning-london-big-data-10-01-2012
Drill lightning-london-big-data-10-01-2012
 
Hadoop
HadoopHadoop
Hadoop
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 

Plus de MapR Technologies

Plus de MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Dernier (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Drill at the Chicago Hug

  • 1. Apache Drill Interactive Analysis of Large-Scale Datasets Ted Dunning Chief Application Architect, MapR
  • 2. My Background • Startups – Aptex, MusicMatch, ID Analytics, Veoh – Big data since before big • Open source – since the dark ages before the internet – Mahout, Zookeeper, Drill – bought the beer at first HUG • MapR • Founding member of Apache Drill
  • 3. MapR Technologies • The open enterprise-grade distribution for Hadoop – Easy, dependable and fast – Open source with standards-based extensions • MapR is deployed at 1000’s of companies – From small Internet startups to the world’s largest enterprises • MapR customers analyze massive amounts of data: – Hundreds of billions of events daily – 90% of the world’s Internet population monthly – $1 trillion in retail purchases annually • MapR has partnered with Google to provide Hadoop on Google Compute Engine
  • 4. Latency Matters • Ad-hoc analysis with interactive tools • Real-time dashboards • Event/trend detection and analysis – Network intrusions – Fraud – Failures
  • 5. Big Data Processing Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Milliseconds to minutes Never-ending Data volume TBs to PBs GBs to PBs Continuous stream Programming model MapReduce Queries DAG Users Developers Analysts and developers Developers Google project MapReduce Dremel Open source project Hadoop MapReduce Storm and S4 Introducing Apache Drill…
  • 7. Google Dremel • Interactive analysis of large-scale datasets – Trillion records at interactive speeds – Complementary to MapReduce – Used by thousands of Google employees – Paper published at VLDB 2010 • Authors: Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis • Model – Nested data model with schema • Most data at Google is stored/transferred in Protocol Buffers • Normalization (to relational) is prohibitive – SQL-like query language with nested data support • Implementation – Column-based storage and processing – In-situ data access (GFS and Bigtable) – Tree architecture as in Web search (and databases)
  • 8. Google BigQuery • Hosted Dremel (Dremel as a Service) • CLI (bq) and Web UI • Import data from Google Cloud Storage or local files – Files must be in CSV format • Nested data not supported [yet] except built-in datasets – Schema definition required
  • 10. Architecture • Only the execution engine knows the physical attributes of the cluster – # nodes, hardware, file locations, … • Public interfaces enable extensibility – Developers can build parsers for new query languages – Developers can provide an execution plan directly • Each level of the plan has a human readable representation – Facilitates debugging and unit testing
  • 12. Execution Engine Layers • Drill execution engine has two layers – Operator layer is serialization-aware • Processes individual records – Execution layer is not serialization-aware • Processes batches of records (blobs) • Responsible for communication, dependencies and fault tolerance
  • 14. Nested Query Languages • DrQL – SQL-like query language for nested data – Compatible with Google BigQuery/Dremel • BigQuery applications should work with Drill – Designed to support efficient column-based processing • No record assembly during query processing • Mongo Query Language – {$query: {x: 3, y: "abc"}, $orderby: {x: 1}} • Other languages/programming models can plug in
  • 15. Nested Data Model • The data model in Dremel is Protocol Buffers – Nested – Schema • Apache Drill is designed to support multiple data models – Schema: Protocol Buffers, Apache Avro, … – Schema-less: JSON, BSON, … • Flat records are supported as a special case of nested data – CSV, TSV, … { "name": "Srivas", "gender": "Male", "followers": 100 } { "name": "Raina", "gender": "Female", "followers": 200, "zip": "94305" } enum Gender { MALE, FEMALE } record User { string name; Gender gender; long followers; } Avro IDL JSON
  • 16. DrQL Example SELECT DocId AS Id, COUNT(Name.Language.Code) WITHIN Name AS Cnt, Name.Url + ',' + Name.Language.Code AS Str FROM t WHERE REGEXP(Name.Url, '^http') AND DocId < 20; * Example from the Dremel paper
  • 17. Query Components • Query components: – SELECT – FROM – WHERE – GROUP BY – HAVING – (JOIN) • Key logical operators: – Scan – Filter – Aggregate – (Join)
  • 18. Extensibility • Nested query languages – Pluggable model – DrQL – Mongo Query Language – Cascading • Distributed execution engine – Extensible model (eg, Dryad) – Low-latency – Fault tolerant • Nested data formats – Pluggable model – Column-based (ColumnIO/Dremel, Trevni, RCFile) and row-based (RecordIO, Avro, JSON, CSV) – Schema (Protocol Buffers, Avro, CSV) and schema-less (JSON, BSON) • Scalable data sources – Pluggable model – Hadoop – HBase
  • 19. Scan Operators Scan with schema Scan without schema Operator output Protocol Buffers JSON-like (MessagePack) Supported data formats ColumnIO (column-based protobuf/Dremel) RecordIO (row-based protobuf) CSV JSON HBase SELECT … FROM … ColumnIO(proto URI, data URI) RecordIO(proto URI, data URI) Json(data URI) HBase(table name) • Drill supports multiple data formats by having per-format scan operators • Queries involving multiple data formats/sources are supported • Fields and predicates can be pushed down into the scan operator • Scan operators may have adaptive side-effects (database cracking) • Produce ColumnIO from RecordIO • Google PowerDrill stores materialized expressions with the data
  • 20. Design Principles Flexible • Pluggable query languages • Extensible execution engine • Pluggable data formats • Column-based and row-based • Schema and schema-less • Pluggable data sources Easy • Unzip and run • Zero configuration • Reverse DNS not needed • IP addresses can change • Clear and concise log messages Dependable • No SPOF • Instant recovery from crashes Fast • C/C++ core with Java support • Google C++ style guide • Min latency and max throughput (limited only by hardware)
  • 21. Hadoop Integration • Hadoop data sources – Hadoop FileSystem API (HDFS/MapR-FS) – HBase • Hadoop data formats – Apache Avro – RCFile • MapReduce-based tools to create column-based formats • Table registry in HCatalog • Run long-running services in YARN
  • 22. Get Involved! • Download (almost) these slides – http://www.mapr.com/company/events/bay-area-hug/9-19-2012 • Join the project – drill-dev-subscribe@incubator.apache.org #apachedrill • Contact me: – tdunning@maprtech.com – tdunning@apache.org – ted.dunning@maprtech.com – @ted_dunning • Join MapR – jobs@mapr.com

Notes de l'éditeur

  1. Drill Remove schema requirementIn-situ for real since we’ll support multiple formatsNote: MR needed for big joins so to speak
  2. DrillWill support nestedNo schema required
  3. Load data into Drill (optional)Could just use as is in “row” formatMultiple query languagesPluggability very important
  4. Likely to support theseCould add HiveQL and more as well. Could even be clever and support HiveQL to MR or Drill based upon queryPig as wellPluggabilityData formatQuery languageSomething 6-9 months alpha qualityCommunity driven, I can’t speak for projectMapRFS gives better chunk size controlNFS support may make small test drivers easierUnified namespace will allow multi-cluster accessMight even have drill component that autoformats dataRead only model
  5. Protocol buffers are conceptual data modelWill support multiple data modelsWill have to define a way to explain data format (filtering, fields, etc)Schema-less will have perf penaltyHbase will be one format
  6. Example query that Drill should supportNeed to talk more here about what Dremel does
  7. Note: we have an already partially built execution engine
  8. Be prepared for Apache questionsCommitter vs committee vs contributorIf can’t answer question, ask them to answer and contributeLisa - Need landing pageReferences to paper and such at end