SlideShare une entreprise Scribd logo
1  sur  30
Télécharger pour lire hors ligne
1	
  
Introduc)on	
  to	
  Apache	
  Drill	
  
Michael	
  Hausenblas,	
  Chief	
  Data	
  Engineer	
  EMEA,	
  MapR	
  
6th	
  Swiss	
  Big	
  Data	
  User	
  Group	
  MeeAng,	
  Zurich,	
  2013-­‐03-­‐25	
  
2	
  
2	
  
Kudos	
  to	
  hJp://cmx.io/	
  	
  
3	
  
Workloads	
  
•  Batch	
  processing	
  (MapReduce)	
  
•  Light-­‐weight	
  OLTP	
  (HBase,	
  Cassandra,	
  etc.)	
  
•  Stream	
  processing	
  (Storm,	
  S4)	
  
•  Search	
  (Solr,	
  ElasAcsearch)	
  
•  Interac)ve,	
  ad-­‐hoc	
  query	
  and	
  analysis	
  (?)	
  
4	
  
Impala
InteracAve	
  Query	
  at	
  Scale	
  
low-­‐latency	
  
5	
  
Use	
  Case	
  I	
  
•  Jane,	
  a	
  markeAng	
  analyst	
  
•  Determine	
  target	
  segments	
  
•  Data	
  from	
  different	
  sources	
  
	
  
6	
  
Use	
  Case	
  II	
  
•  LogisAcs	
  –	
  supplier	
  status	
  
•  Queries	
  
– How	
  many	
  shipments	
  from	
  supplier	
  X?	
  
– How	
  many	
  shipments	
  in	
  region	
  Y?	
  
SUPPLIER_ID	
   NAME	
   REGION	
  
ACM	
   ACME	
  Corp	
   US	
  
GAL	
   GotALot	
  Inc	
   US	
  
BAP	
   Bits	
  and	
  Pieces	
  Ltd	
   Europe	
  
ZUP	
   Zu	
  Pli	
   Asia	
  
{
"shipment": 100123,
"supplier": "ACM",
“timestamp": "2013-02-01",
"description": ”first delivery today”
},
{
"shipment": 100124,
"supplier": "BAP",
"timestamp": "2013-02-02",
"description": "hope you enjoy it”
}
…
7	
  
Today’s	
  SoluAons	
  
•  RDBMS-­‐focused	
  
–  ETL	
  data	
  from	
  MongoDB	
  and	
  Hadoop	
  
–  Query	
  data	
  using	
  SQL	
  
•  MapReduce-­‐focused	
  
–  ETL	
  from	
  RDBMS	
  and	
  MongoDB	
  
–  Use	
  Hive,	
  etc.	
  
8	
  
Requirements	
  
•  Support	
  for	
  different	
  data	
  sources	
  
•  Support	
  for	
  different	
  query	
  interfaces	
  
•  Low-­‐latency/real-­‐Ame	
  
•  Ad-­‐hoc	
  queries	
  
•  Scalable,	
  reliable	
  
9	
  
Google’s	
  Dremel	
  
hJp://research.google.com/pubs/pub36632.html	
  	
  
10	
  
Apache	
  Drill	
  Overview	
  
•  Inspired	
  by	
  Google’s	
  Dremel	
  
•  Standard	
  	
  SQL	
  2003	
  support	
  
•  Other	
  QL	
  possible	
  
•  Plug-­‐able	
  data	
  sources	
  
•  Support	
  for	
  nested	
  data	
  
•  Schema	
  is	
  opAonal	
  
•  Community	
  driven,	
  open,	
  100’s	
  involved	
  
11	
  
Apache	
  Drill	
  Overview	
  
12	
  
High-­‐level	
  Architecture	
  
13	
  
High-­‐level	
  Architecture	
  
•  Each	
  node:	
  Drillbit	
  -­‐	
  maximize	
  data	
  locality	
  
•  Co-­‐ordinaAon,	
  query	
  planning,	
  execuAon,	
  etc,	
  are	
  distributed	
  
•  By	
  default	
  Drillbits	
  hold	
  all	
  roles	
  
•  Any	
  node	
  can	
  act	
  as	
  endpoint	
  for	
  a	
  query	
  
Storage	
  
Process	
  
Drillbit	
  
node	
  
Storage	
  
Process	
  
Drillbit	
  
node	
  
Storage	
  
Process	
  
Drillbit	
  
node	
  
Storage	
  
Process	
  
Drillbit	
  
node	
  
14	
  
High-­‐level	
  Architecture	
  
•  Zookeeper	
  for	
  ephemeral	
  cluster	
  membership	
  info	
  
•  Distributed	
  cache	
  (Hazelcast)	
  for	
  metadata,	
  locality	
  
informaAon,	
  etc.	
  
Zookeeper	
  
Distributed	
  Cache	
  
Storage	
  
Process	
  
Drillbit	
  
node	
  
Storage	
  
Process	
  
Drillbit	
  
node	
  
Storage	
  
Process	
  
Drillbit	
  
node	
  
Storage	
  
Process	
  
Drillbit	
  
node	
  
Distributed	
  Cache	
   Distributed	
  Cache	
   Distributed	
  Cache	
  
15	
  
High-­‐level	
  Architecture	
  
•  Origina)ng	
  Drillbit	
  acts	
  as	
  foreman,	
  manages	
  query	
  execuAon,	
  
scheduling,	
  locality	
  informaAon,	
  etc.	
  
•  Streaming	
  data	
  communica)on	
  avoiding	
  SerDe	
  
Zookeeper	
  
Distributed	
  Cache	
  
Storage	
  
Process	
  
Drillbit	
  
node	
  
Storage	
  
Process	
  
Drillbit	
  
node	
  
Storage	
  
Process	
  
Drillbit	
  
node	
  
Storage	
  
Process	
  
Drillbit	
  
node	
  
Distributed	
  Cache	
   Distributed	
  Cache	
   Distributed	
  Cache	
  
16	
  
Principled	
  Query	
  ExecuAon	
  
Source	
  
Query	
   Parser	
  
Logical	
  
Plan	
   OpAmizer	
  
Physical	
  
Plan	
   ExecuAon	
  
SQL	
  2003	
  	
  
DrQL	
  
MongoQL	
  
DSL	
  
scanner	
  API	
  topology	
  query: [
{
@id: "log",
op: "sequence",
do: [
{
op: "scan",
source: “logs”
},
{
op:
"filter",
condition:
"x > 3”
},
parser	
  API	
  
17	
  
Drillbit	
  Modules	
  
DFS	
  Engine	
  
HBase	
  Engine	
  
RPC	
  Endpoint	
  
SQL	
  
HiveQL	
  
Pig	
  
Parser	
  
Distributed	
  Cache	
  
Logical	
  Plan	
  
Physical	
  Plan	
  
OpAmizer	
  
Storage	
  Engine	
  Interface	
  
Scheduler	
  
Foreman	
  
Operators	
  
Mongo	
  
18	
  
Key	
  Features	
  
•  Full	
  SQL	
  2003	
  
•  Nested	
  data	
  
•  OpAonal	
  schema	
  
•  Extensibility	
  points	
  
19	
  
Full	
  SQL	
  –	
  ANSI	
  SQL	
  2003	
  
•  SQL-­‐like	
  is	
  oken	
  not	
  enough	
  
•  IntegraAon	
  with	
  exisAng	
  tools	
  
–  Datameer,	
  Tableau,	
  Excel,	
  SAP	
  Crystal	
  Reports	
  
–  Use	
  standard	
  ODBC/JDBC	
  driver	
  
20	
  
Nested	
  Data	
  
•  Nested	
  data	
  becoming	
  prevalent	
  
–  JSON/BSON,	
  XML,	
  ProtoBuf,	
  Avro	
  
–  Some	
  data	
  sources	
  support	
  it	
  naAvely	
  
(MongoDB,	
  etc.)	
  
•  FlaJening	
  nested	
  data	
  is	
  error-­‐prone	
  
•  Extension	
  to	
  ANSI	
  SQL	
  2003	
  
21	
  
OpAonal	
  Schema	
  
•  Many	
  data	
  sources	
  don’t	
  have	
  rigid	
  schemas	
  
–  Schema	
  changes	
  rapidly	
  
–  Different	
  schema	
  per	
  record	
  (e.g.	
  HBase)	
  
•  Supports	
  queries	
  against	
  unknown	
  schema	
  
•  User	
  can	
  define	
  schema	
  or	
  via	
  discovery	
  
22	
  
Extensibility	
  Points	
  
•  Source	
  query	
  –	
  parser	
  API	
  
•  Custom	
  operators,	
  UDF	
  –	
  logical	
  plan	
  
•  OpAmizer	
  
•  Data	
  sources	
  and	
  formats	
  –	
  scanner	
  API	
  
Source	
  
Query	
   Parser	
  
Logical	
  
Plan	
   OpAmizer	
  
Physical	
  
Plan	
   ExecuAon	
  
23	
  
…	
  and	
  Hadoop?	
  
•  HDFS	
  can	
  be	
  a	
  data	
  source	
  
•  Complementary	
  use	
  cases	
  …	
  
•  …	
  use	
  Apache	
  Drill	
  
–  Find	
  record	
  with	
  specified	
  condiAon	
  
–  AggregaAon	
  under	
  dynamic	
  condiAons	
  
•  …	
  use	
  MapReduce	
  
–  Data	
  mining	
  with	
  mulAple	
  iteraAons	
  
–  ETL	
  
23	
  
hJps://cloud.google.com/files/BigQueryTechnicalWP.pdf	
  	
  
24	
  
Example	
  
hJps://cwiki.apache.org/confluence/display/DRILL/Demo+HowTo	
  	
  
{
"id": "0001",
"type": "donut",
”ppu": 0.55,
"batters":
{
"batter”:
[
{ "id": "1001", "type": "Regular" },
{ "id": "1002", "type": "Chocolate" },
…
data	
  source:	
  donuts.json	
  
query:[ {
op:"sequence",
do:[
{
op: "scan",
ref: "donuts",
source: "local-logs",
selection: {data: "activity"}
},
{
op: "filter",
expr: "donuts.ppu < 2.00"
},
…
logical	
  plan:	
  simple_plan.json	
  
result:	
  out.json	
  
{
"sales" : 700.0,
"typeCount" : 1,
"quantity" : 700,
"ppu" : 1.0
}
{
"sales" : 109.71,
"typeCount" : 2,
"quantity" : 159,
"ppu" : 0.69
}
{
"sales" : 184.25,
"typeCount" : 2,
"quantity" : 335,
"ppu" : 0.55
}
25	
  
Status	
  
•  Heavy	
  development	
  by	
  mulAple	
  organizaAons	
  
•  Available	
  
– Logical	
  plan	
  (ADSP)	
  
– Reference	
  interpreter	
  
– Basic	
  SQL	
  parser	
  	
  
– Basic	
  demo	
  
– Basic	
  HBase	
  back-­‐end	
  
26	
  
Status	
  
March/April	
  
	
  
•  Larger	
  SQL	
  syntax	
  
•  Physical	
  plan	
  
•  In-­‐memory	
  compressed	
  data	
  interfaces	
  
•  Distributed	
  execuAon	
  focused	
  on	
  large	
  cluster	
  
high	
  performance	
  sort,	
  aggregaAon	
  and	
  join	
  
27	
  
ContribuAng	
  
•  Dremel-­‐inspired	
  columnar	
  format:	
  TwiJer’s	
  Parquet	
  	
  and	
  
Hive’s	
  ORC	
  file	
  
•  IntegraAon	
  with	
  Hive	
  metastore	
  (?)	
  
•  DRILL-­‐13	
  Storage	
  Engine:	
  Define	
  Java	
  Interface	
  
•  DRILL-­‐15	
  Build	
  HBase	
  storage	
  engine	
  implementaAon	
  
28	
  
ContribuAng	
  
•  DRILL-­‐48	
  RPC	
  interface	
  for	
  query	
  submission	
  and	
  physical	
  plan	
  
execuAon	
  
•  DRILL-­‐53	
  Setup	
  cluster	
  configuraAon	
  and	
  membership	
  mgmt	
  
system	
  
–  ZK	
  for	
  coordinaAon	
  
–  Helix	
  for	
  parAAon	
  and	
  resource	
  assignment	
  (?)	
  
•  Further	
  schedule	
  
–  Alpha	
  Q2	
  
–  Beta	
  Q3	
  
29	
  
Kudos	
  to	
  …	
  
•  Julian	
  Hyde,	
  Pentaho	
  	
  
•  Timothy	
  Chen,	
  Microsok	
  
•  Chris	
  Merrick,	
  RJMetrics	
  	
  
•  David	
  Alves,	
  UT	
  AusAn	
  
•  Sree	
  Vaadi,	
  SSS/NGData	
  
•  Jacques	
  Nadeau,	
  MapR	
  
•  Ted	
  Dunning,	
  MapR	
  
30	
  
Engage!	
  
•  Follow	
  @ApacheDrill	
  on	
  TwiJer	
  
•  Sign	
  up	
  at	
  mailing	
  lists	
  (user	
  |	
  dev)	
  	
  
hJp://incubator.apache.org/drill/mailing-­‐lists.html	
  	
  
•  Learn	
  where	
  and	
  how	
  to	
  contribute	
  
hJps://cwiki.apache.org/confluence/display/DRILL/ContribuAng	
  	
  
•  Keep	
  an	
  eye	
  on	
  hJp://drill-­‐user.org/	
  	
  

Contenu connexe

Tendances

An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
MapR Technologies
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
MapR Technologies
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scale
MapR Technologies
 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill
Carol McDonald
 

Tendances (20)

Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Using Apache Drill
Using Apache DrillUsing Apache Drill
Using Apache Drill
 
Apache drill
Apache drillApache drill
Apache drill
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scale
 
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is Possible
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
SQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillSQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache Drill
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
 
Rethinking SQL for Big Data with Apache Drill
Rethinking SQL for Big Data with Apache DrillRethinking SQL for Big Data with Apache Drill
Rethinking SQL for Big Data with Apache Drill
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into Production
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoop
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill
 

Similaire à Swiss Big Data User Group - Introduction to Apache Drill

Hadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 MarchHadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 March
MapR Technologies
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
Open Analytics
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
Cloudera, Inc.
 

Similaire à Swiss Big Data User Group - Introduction to Apache Drill (20)

Hadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 MarchHadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 March
 
Apache drill
Apache drillApache drill
Apache drill
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Oracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_databaseOracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_database
 
Drill njhug -19 feb2013
Drill njhug -19 feb2013Drill njhug -19 feb2013
Drill njhug -19 feb2013
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 
An intro to Azure Data Lake
An intro to Azure Data LakeAn intro to Azure Data Lake
An intro to Azure Data Lake
 
Wmware NoSQL
Wmware NoSQLWmware NoSQL
Wmware NoSQL
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
 
Apache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and BasicsApache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and Basics
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stack
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014
 
Drill at the Chicago Hug
Drill at the Chicago HugDrill at the Chicago Hug
Drill at the Chicago Hug
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 

Plus de MapR Technologies

Plus de MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Dernier

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Dernier (20)

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Swiss Big Data User Group - Introduction to Apache Drill

  • 1. 1   Introduc)on  to  Apache  Drill   Michael  Hausenblas,  Chief  Data  Engineer  EMEA,  MapR   6th  Swiss  Big  Data  User  Group  MeeAng,  Zurich,  2013-­‐03-­‐25  
  • 2. 2   2   Kudos  to  hJp://cmx.io/    
  • 3. 3   Workloads   •  Batch  processing  (MapReduce)   •  Light-­‐weight  OLTP  (HBase,  Cassandra,  etc.)   •  Stream  processing  (Storm,  S4)   •  Search  (Solr,  ElasAcsearch)   •  Interac)ve,  ad-­‐hoc  query  and  analysis  (?)  
  • 4. 4   Impala InteracAve  Query  at  Scale   low-­‐latency  
  • 5. 5   Use  Case  I   •  Jane,  a  markeAng  analyst   •  Determine  target  segments   •  Data  from  different  sources    
  • 6. 6   Use  Case  II   •  LogisAcs  –  supplier  status   •  Queries   – How  many  shipments  from  supplier  X?   – How  many  shipments  in  region  Y?   SUPPLIER_ID   NAME   REGION   ACM   ACME  Corp   US   GAL   GotALot  Inc   US   BAP   Bits  and  Pieces  Ltd   Europe   ZUP   Zu  Pli   Asia   { "shipment": 100123, "supplier": "ACM", “timestamp": "2013-02-01", "description": ”first delivery today” }, { "shipment": 100124, "supplier": "BAP", "timestamp": "2013-02-02", "description": "hope you enjoy it” } …
  • 7. 7   Today’s  SoluAons   •  RDBMS-­‐focused   –  ETL  data  from  MongoDB  and  Hadoop   –  Query  data  using  SQL   •  MapReduce-­‐focused   –  ETL  from  RDBMS  and  MongoDB   –  Use  Hive,  etc.  
  • 8. 8   Requirements   •  Support  for  different  data  sources   •  Support  for  different  query  interfaces   •  Low-­‐latency/real-­‐Ame   •  Ad-­‐hoc  queries   •  Scalable,  reliable  
  • 9. 9   Google’s  Dremel   hJp://research.google.com/pubs/pub36632.html    
  • 10. 10   Apache  Drill  Overview   •  Inspired  by  Google’s  Dremel   •  Standard    SQL  2003  support   •  Other  QL  possible   •  Plug-­‐able  data  sources   •  Support  for  nested  data   •  Schema  is  opAonal   •  Community  driven,  open,  100’s  involved  
  • 11. 11   Apache  Drill  Overview  
  • 13. 13   High-­‐level  Architecture   •  Each  node:  Drillbit  -­‐  maximize  data  locality   •  Co-­‐ordinaAon,  query  planning,  execuAon,  etc,  are  distributed   •  By  default  Drillbits  hold  all  roles   •  Any  node  can  act  as  endpoint  for  a  query   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node  
  • 14. 14   High-­‐level  Architecture   •  Zookeeper  for  ephemeral  cluster  membership  info   •  Distributed  cache  (Hazelcast)  for  metadata,  locality   informaAon,  etc.   Zookeeper   Distributed  Cache   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Distributed  Cache   Distributed  Cache   Distributed  Cache  
  • 15. 15   High-­‐level  Architecture   •  Origina)ng  Drillbit  acts  as  foreman,  manages  query  execuAon,   scheduling,  locality  informaAon,  etc.   •  Streaming  data  communica)on  avoiding  SerDe   Zookeeper   Distributed  Cache   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Distributed  Cache   Distributed  Cache   Distributed  Cache  
  • 16. 16   Principled  Query  ExecuAon   Source   Query   Parser   Logical   Plan   OpAmizer   Physical   Plan   ExecuAon   SQL  2003     DrQL   MongoQL   DSL   scanner  API  topology  query: [ { @id: "log", op: "sequence", do: [ { op: "scan", source: “logs” }, { op: "filter", condition: "x > 3” }, parser  API  
  • 17. 17   Drillbit  Modules   DFS  Engine   HBase  Engine   RPC  Endpoint   SQL   HiveQL   Pig   Parser   Distributed  Cache   Logical  Plan   Physical  Plan   OpAmizer   Storage  Engine  Interface   Scheduler   Foreman   Operators   Mongo  
  • 18. 18   Key  Features   •  Full  SQL  2003   •  Nested  data   •  OpAonal  schema   •  Extensibility  points  
  • 19. 19   Full  SQL  –  ANSI  SQL  2003   •  SQL-­‐like  is  oken  not  enough   •  IntegraAon  with  exisAng  tools   –  Datameer,  Tableau,  Excel,  SAP  Crystal  Reports   –  Use  standard  ODBC/JDBC  driver  
  • 20. 20   Nested  Data   •  Nested  data  becoming  prevalent   –  JSON/BSON,  XML,  ProtoBuf,  Avro   –  Some  data  sources  support  it  naAvely   (MongoDB,  etc.)   •  FlaJening  nested  data  is  error-­‐prone   •  Extension  to  ANSI  SQL  2003  
  • 21. 21   OpAonal  Schema   •  Many  data  sources  don’t  have  rigid  schemas   –  Schema  changes  rapidly   –  Different  schema  per  record  (e.g.  HBase)   •  Supports  queries  against  unknown  schema   •  User  can  define  schema  or  via  discovery  
  • 22. 22   Extensibility  Points   •  Source  query  –  parser  API   •  Custom  operators,  UDF  –  logical  plan   •  OpAmizer   •  Data  sources  and  formats  –  scanner  API   Source   Query   Parser   Logical   Plan   OpAmizer   Physical   Plan   ExecuAon  
  • 23. 23   …  and  Hadoop?   •  HDFS  can  be  a  data  source   •  Complementary  use  cases  …   •  …  use  Apache  Drill   –  Find  record  with  specified  condiAon   –  AggregaAon  under  dynamic  condiAons   •  …  use  MapReduce   –  Data  mining  with  mulAple  iteraAons   –  ETL   23   hJps://cloud.google.com/files/BigQueryTechnicalWP.pdf    
  • 24. 24   Example   hJps://cwiki.apache.org/confluence/display/DRILL/Demo+HowTo     { "id": "0001", "type": "donut", ”ppu": 0.55, "batters": { "batter”: [ { "id": "1001", "type": "Regular" }, { "id": "1002", "type": "Chocolate" }, … data  source:  donuts.json   query:[ { op:"sequence", do:[ { op: "scan", ref: "donuts", source: "local-logs", selection: {data: "activity"} }, { op: "filter", expr: "donuts.ppu < 2.00" }, … logical  plan:  simple_plan.json   result:  out.json   { "sales" : 700.0, "typeCount" : 1, "quantity" : 700, "ppu" : 1.0 } { "sales" : 109.71, "typeCount" : 2, "quantity" : 159, "ppu" : 0.69 } { "sales" : 184.25, "typeCount" : 2, "quantity" : 335, "ppu" : 0.55 }
  • 25. 25   Status   •  Heavy  development  by  mulAple  organizaAons   •  Available   – Logical  plan  (ADSP)   – Reference  interpreter   – Basic  SQL  parser     – Basic  demo   – Basic  HBase  back-­‐end  
  • 26. 26   Status   March/April     •  Larger  SQL  syntax   •  Physical  plan   •  In-­‐memory  compressed  data  interfaces   •  Distributed  execuAon  focused  on  large  cluster   high  performance  sort,  aggregaAon  and  join  
  • 27. 27   ContribuAng   •  Dremel-­‐inspired  columnar  format:  TwiJer’s  Parquet    and   Hive’s  ORC  file   •  IntegraAon  with  Hive  metastore  (?)   •  DRILL-­‐13  Storage  Engine:  Define  Java  Interface   •  DRILL-­‐15  Build  HBase  storage  engine  implementaAon  
  • 28. 28   ContribuAng   •  DRILL-­‐48  RPC  interface  for  query  submission  and  physical  plan   execuAon   •  DRILL-­‐53  Setup  cluster  configuraAon  and  membership  mgmt   system   –  ZK  for  coordinaAon   –  Helix  for  parAAon  and  resource  assignment  (?)   •  Further  schedule   –  Alpha  Q2   –  Beta  Q3  
  • 29. 29   Kudos  to  …   •  Julian  Hyde,  Pentaho     •  Timothy  Chen,  Microsok   •  Chris  Merrick,  RJMetrics     •  David  Alves,  UT  AusAn   •  Sree  Vaadi,  SSS/NGData   •  Jacques  Nadeau,  MapR   •  Ted  Dunning,  MapR  
  • 30. 30   Engage!   •  Follow  @ApacheDrill  on  TwiJer   •  Sign  up  at  mailing  lists  (user  |  dev)     hJp://incubator.apache.org/drill/mailing-­‐lists.html     •  Learn  where  and  how  to  contribute   hJps://cwiki.apache.org/confluence/display/DRILL/ContribuAng     •  Keep  an  eye  on  hJp://drill-­‐user.org/