SlideShare une entreprise Scribd logo
1  sur  42
Télécharger pour lire hors ligne
Building Highly Flexible, High Performance
query engines -
Highlights from Apache Drill project
	
  	
  	
  	
  	
  	
  	
  Neeraja	
  Rentachintala	
  
	
  	
  	
  	
  	
  	
  	
  Director,	
  Product	
  Management	
  
	
  MapR	
  Technologies	
  
Agenda	
  
•  Apache	
  Drill	
  overview	
  
•  Using	
  Drill	
  
•  Under	
  the	
  Hood	
  
•  Status	
  and	
  progress	
  
•  Demo	
  
APACHE	
  DRILL	
  OVERVIEW	
  
Hadoop	
  workloads	
  and	
  APIs	
  
Use	
  case	
   ETL	
  and	
  
aggregaDon	
  
(batch)	
  
PredicDve	
  
modeling	
  and	
  
analyDcs	
  
(batch)	
  
InteracDve	
  SQL	
  –	
  
Data	
  exploraDon,	
  
Adhoc	
  queries	
  &	
  
reporDng	
  
Search	
   OperaDonal	
  
(user	
  facing	
  
applicaDons,	
  
point	
  queries)	
  
API	
   MapReduce	
  
Hive	
  
Pig	
  
Cascading	
  
Mahout	
  
MLLib	
  
Spark	
  
Drill	
  	
  
Shark	
  
Impala	
  
Hive	
  on	
  Tez	
  
Presto	
  
Solr	
  
ElasDcsear
ch	
  
HBase	
  API	
  
Phoenix	
  
InteracDve	
  SQL	
  and	
  Hadoop	
  
•  Opens up Hadoop data to broader
audience
–  Existing SQL skill sets
–  Broad eco system of tools
•  New and improved BI/Analytics
use cases
–  Analysis on more raw data, new
types of data and real time data
•  Cost savings
	
  
Enterprise users
Data landscape is changing
New	
  types	
  of	
  applica<ons	
  
•  Social,	
  mobile,	
  Web,	
  “Internet	
  of	
  
Things”,	
  Cloud…	
  
•  IteraDve/Agile	
  in	
  nature	
  
•  More	
  users,	
  more	
  data	
  
New	
  data	
  models	
  &	
  data	
  types	
  
•  Flexible	
  (schema-­‐less)	
  data	
  
•  Rapidly	
  changing	
  
•  Semi-­‐structured/Nested	
  data	
  
{	
  
	
  	
  	
  "data":	
  [	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  "id":	
  "X999_Y999",	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  "from":	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  "name":	
  "Tom	
  Brady",	
  "id":	
  "X12"	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  },	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  "message":	
  "Looking	
  forward	
  to	
  2014!",	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  "acDons":	
  [	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  "name":	
  "Comment",	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  "link":	
  "hhp://www.facebook.com/X99/posts	
  Y999"	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  },	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  "name":	
  "Like",	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  "link":	
  "hhp://www.facebook.com/X99/posts	
  Y999"	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  ],	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  "type":	
  "status",	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  "created_Dme":	
  "2013-­‐08-­‐02T21:27:44+0000",	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  "updated_Dme":	
  "2013-­‐08-­‐02T21:27:44+0000"	
  
	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  	
  	
  	
  }	
  
JSON	
  
Tradi<onal	
  datasets	
  
	
  
•  Comes	
  from	
  transacDonal	
  applicaDons	
  
•  Stored	
  for	
  historical	
  purposes	
  and/or	
  
for	
  large	
  scale	
  ETL/AnalyDcs	
  
•  Well	
  defined	
  schemas	
  
•  Managed	
  centrally	
  by	
  DBAs	
  
•  No	
  frequent	
  changes	
  to	
  schema	
  
•  Flat	
  datasets	
  
	
  
	
  
New	
  datasets	
  
•  Comes	
  from	
  new	
  applicaDons	
  (Ex:	
  Social	
  
feeds,	
  clickstream,	
  logs,	
  sensor	
  data)	
  
•  Enable	
  new	
  use	
  cases	
  such	
  as	
  Customer	
  
SaDsfacDon,	
  Product/Service	
  
opDmizaDon	
  
•  Flexible	
  data	
  models/managed	
  within	
  
applicaDons	
  
•  Schemas	
  evolving	
  rapidly	
  
•  Semi-­‐structured/Nested	
  data	
  
	
  
Hadoop evolving as central hub for analysis
Provides	
  Cost	
  effecDve,	
  flexible	
  way	
  to	
  store	
  and	
  and	
  process	
  data	
  at	
  	
  scale	
  
ExisDng	
  SQL	
  approaches	
  will	
  not	
  always	
  work	
  for	
  
big	
  data	
  needs	
  
•  New	
  data	
  models/types	
  don’t	
  map	
  well	
  to	
  the	
  relaDonal	
  models	
  
–  Many	
  data	
  sources	
  do	
  not	
  have	
  rigid	
  schemas	
  (HBase,	
  Mongo	
  etc)	
  
•  Each	
  record	
  has	
  a	
  separate	
  schema	
  
•  Sparse	
  and	
  wide	
  rows	
  
–  Flahening	
  nested	
  data	
  is	
  error-­‐prone	
  and	
  oten	
  impossible	
  
•  Think	
  about	
  repeated	
  and	
  opDonal	
  fields	
  at	
  every	
  level…	
  
•  A	
  single	
  HBase	
  value	
  could	
  be	
  a	
  JSON	
  document	
  (compound	
  nested	
  type)	
  
•  Centralized	
  schemas	
  are	
  hard	
  to	
  manage	
  for	
  big	
  data	
  
•  Rapidly	
  evolving	
  data	
  source	
  schemas	
  
•  Lots	
  of	
  new	
  data	
  sources	
  
•  Third	
  party	
  data	
  
•  Unknown	
  quesDons	
  
	
  	
  	
  	
  	
  Model	
  data	
  
	
  Move	
  data	
  into	
  
tradi<onal	
  systems	
  
New questions
/requirements
Schema changes or
new data sources
DBA/DWH teams
Analyze Big data
Enterprise Users
Apache	
  Drill	
  
Open	
  Source	
  SQL	
  on	
  Hadoop	
  for	
  Agility	
  with	
  Big	
  Data	
  explora<on	
  
FLEXIBLE	
  SCHEMA	
  
MANAGEMENT	
  
ANALYTICS	
  ON	
  
NOSQL	
  DATA	
  
PLUG	
  AND	
  PLAY	
  
WITH	
  EXISTING	
  
TOOLS	
  
Analyze	
  data	
  with	
  
or	
  without	
  
centralized	
  
schemas	
  
	
  
	
  
	
   Analyze	
  data	
  using	
  
familiar	
  BI/AnalyDcs	
  
and	
  SQL	
  based	
  tools	
  
Analyze	
  semi	
  
structured	
  &	
  
nested	
  data	
  with	
  
no	
  modeling/ETL	
  
Flexible	
  schema	
  management	
  
	
  {	
  
	
  	
  	
  	
  “ID”:	
  1,	
  
	
  	
  	
  	
  “NAME”:	
  “Fairmont	
  San	
  Francisco”,	
  
	
  	
  	
  	
  “DESCRIPTION”:	
  “Historic	
  grandeur…”,	
  
	
  	
  	
  	
  “AVG_REVIEWER_SCORE”:	
  “4.3”,	
  
	
  	
  	
  	
  “AMENITY”:	
  {“TYPE”:	
  “gym”,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  DESCRIPTION:	
  “fitness	
  center”	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  },	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  {“TYPE”:	
  “wifi”,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  “DESCRIPTION”:	
  “free	
  wifi”},	
  
	
  	
  	
  	
  “RATE_TYPE”:	
  “nightly”,	
  
	
  	
  	
  	
  	
  “PRICE”:	
  “$199”,	
  
	
  	
  	
  	
  	
  “REVIEWS”:	
  [“review_1”,	
  “review_2”],	
  
	
  	
  	
  	
  	
  “ATTRACTIONS”:	
  “Chinatown”,	
  
	
  	
  }	
  
JSON	
  
ExisDng	
  SQL	
  
soluDons	
   X
HotelID	
   AmenityID	
  
1	
   1	
  
1	
   2	
  
ID	
   Type	
   Descrip
Don	
  
1	
   Gym	
   Fitness	
  
center	
  
2	
   Wifi	
   Free	
  wifi	
  
Drill	
  
	
  {	
  
	
  	
  	
  	
  “ID”:	
  1,	
  
	
  	
  	
  	
  “NAME”:	
  “Fairmont	
  San	
  Francisco”,	
  
	
  	
  	
  	
  “DESCRIPTION”:	
  “Historic	
  grandeur…”,	
  
	
  	
  	
  	
  “AVG_REVIEWER_SCORE”:	
  “4.3”,	
  
	
  	
  	
  	
  “AMENITY”:	
  {“TYPE”:	
  “gym”,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  DESCRIPTION:	
  “fitness	
  
center”	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  },	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  {“TYPE”:	
  “wifi”,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  “DESCRIPTION”:	
  “free	
  wifi”},	
  
	
  	
  	
  	
  “RATE_TYPE”:	
  “nightly”,	
  
	
  	
  	
  	
  	
  “PRICE”:	
  “$199”,	
  
	
  	
  	
  	
  	
  “REVIEWS”:	
  [“review_1”,	
  “review_2”],	
  
	
  	
  	
  	
  	
  “ATTRACTIONS”:	
  “Chinatown”,	
  
	
  	
  }	
  
JSON	
  
Drill	
  
Flexible	
  schema	
  management	
  
HotelID	
   AmenityID	
  
1	
   1	
  
1	
   2	
  
ID	
   Type	
   Descrip
Don	
  
1	
   Gym	
   Fitness	
  
center	
  
2	
   Wifi	
   Free	
  wifi	
  
Drill	
  doesn’t	
  require	
  any	
  schema	
  defini<ons	
  to	
  query	
  data	
  making	
  it	
  faster	
  to	
  get	
  
insights	
  from	
  data	
  for	
  users.	
  Drill	
  leverages	
  schema	
  defini<ons	
  if	
  exists.	
  
Key	
  features	
  
•  Dynamic/schema-­‐less	
  queries	
  
•  Nested	
  data	
  
•  Apache	
  Hive	
  integraDon	
  
•  ANSI	
  SQL/BI	
  tool	
  integraDon	
  
	
  
	
  
Querying	
  files	
  
•  Direct	
  queries	
  on	
  a	
  local	
  or	
  a	
  distributed	
  file	
  system	
  (HDFS,	
  S3	
  
etc)	
  
•  Configure	
  one	
  or	
  more	
  directories	
  in	
  file	
  system	
  as	
  “Workspaces”	
  	
  
–  Think	
  of	
  this	
  as	
  similar	
  to	
  schemas	
  in	
  databases	
  
–  Default	
  workspace	
  points	
  to	
  “root”	
  locaDon	
  
•  Specify	
  a	
  single	
  file	
  or	
  a	
  directory	
  as	
  ‘Table’	
  within	
  query	
  
•  Specify	
  schema	
  in	
  query	
  or	
  let	
  Drill	
  discover	
  it	
  
•  Example:	
  
•  SELECT * FROM dfs.users.`/home/mapr/sample-data/profiles.json`!
!!
dfs	
   File	
  system	
  as	
  data	
  source	
  
users	
   Workspace	
  (corresponds	
  to	
  a	
  directory)	
  
/home/mapr/sample-data/
profiles.json!
Table	
  
More	
  examples	
  
•  Query	
  on	
  single	
  file	
  
SELECT * FROM dfs.logs.`AppServerLogs/2014/Jan/part0001.txt`!
•  Query	
  on	
  directory	
  
SELECT * FROM dfs.logs.`AppServerLogs/2014/Jan` where
errorLevel=1;!
•  Joins	
  on	
  files	
  
SELECT  c.c_custkey,sum(o.o_totalprice) !
FROM!
!dfs.`/home/mapr/tpch/customer.parquet` c !
!JOIN!
!dfs.`/home/mapr/tpch/orders.parquet` o!
!ON c.c_custkey = o.o_custkey!
GROUP BY c.c_custkey !
LIMIT 10!
Querying	
  HBase	
  
•  Direct	
  queries	
  on	
  HBase	
  tables	
  
–  SELECT row_key, cf1.month, cf1.year FROM
hbase.table1;!
–  SELECT CONVERT_FROM(row_key, UTF-8) as HotelName from
FROM HotelData	
  
•  No	
  need	
  to	
  define	
  a	
  parallel/overlay	
  schema	
  in	
  Hive	
  
•  Encode	
  and	
  Decode	
  data	
  from	
  HBase	
  using	
  Convert	
  funcDons	
  
–  Convert_To	
  and	
  Convert_From	
  
!
Nested	
  data	
  
•  Nested	
  data	
  as	
  first	
  class	
  enDty:	
  Extensions	
  to	
  SQL	
  for	
  nested	
  
data	
  types,	
  similar	
  to	
  BigQuery	
  	
  	
  
•  No	
  upfront	
  flahening/modeling	
  required	
  
•  Generic	
  architecture	
  for	
  a	
  broad	
  variety	
  of	
  nested	
  data	
  types	
  
(eg:JSON,	
  BSON,	
  XML,	
  AVRO,	
  Protocol	
  Buffers)	
  
•  Performance	
  with	
  ground	
  up	
  design	
  for	
  nested	
  data	
  
•  Example:	
  
SELECT!
!c.name, c.address, REPEATED_COUNT(c.children) !
FROM(!
SELECT!
! !CONVERT_FROM(cf1.user-json-blob, JSON) AS c !
FROM!
!hbase.table1!
)	
  
 
Apache	
  Hive	
  integraDon	
  
	
  
•  Plug	
  and	
  Play	
  integraDon	
  in	
  exisDng	
  
Hive	
  deployments	
  
•  Use	
  Drill	
  to	
  query	
  data	
  in	
  Hive	
  
tables/views	
  	
  
•  Support	
  to	
  work	
  with	
  more	
  than	
  
one	
  Hive	
  metastore	
  
•  Support	
  for	
  all	
  Hive	
  file	
  formats	
  
•  Ability	
  to	
  use	
  Hive	
  UDFs	
  as	
  part	
  of	
  
Drill	
  queries	
  
Hive	
  
meta	
  
store	
  
Files	
   HBase	
  
Hive	
  
SQL	
  layer	
  
	
   Drill	
  
SQL	
  layer	
  +	
  
execuDon	
  
engine	
  MapReduce	
  
execuDon	
  
framework	
  	
  
Cross	
  data	
  source	
  queries	
  
•  Combine	
  data	
  from	
  Files,	
  HBase,	
  Hive	
  in	
  one	
  query	
  
•  No	
  central	
  metadata	
  definiDons	
  necessary	
  
•  Example:	
  
–  USE HiveTest.CustomersDB!
–  SELECT Customers.customer_name, SocialData.Tweets.Count!
FROM Customers!
JOIN HBaseCatalog.SocialData SocialData !
ON Customers.Customer_id = Convert_From(SocialData.rowkey, UTF-8) !
BI	
  tool	
  integraDon	
  
	
  
•  Standard	
  JDBC/ODBC	
  drivers	
  
•  IntegraDon	
  Tableau,	
  Excel,	
  Microstrategy,	
  Toad,	
  
SQuirreL...	
  
SQL	
  support	
  
•  ANSI	
  SQL	
  compaDbility	
  
–  “SQL	
  Like”	
  not	
  enough	
  
•  SQL	
  data	
  types	
  	
  
–  SMALLINT, BIGINT, TINYINT, INT, FLOAT, DOUBLE,DATE, TIMESTAMP, DECIMAL, VARCHAR,
VARBINARY ….!
•  All	
  common	
  SQL	
  constructs	
  
•  SELECT, GROUP BY, ORDER BY, LIMIT, JOIN, HAVING, UNION, UNION ALL, IN/NOT
IN, EXISTS/NOT EXISTS,DISTINCT, BETWEEN, CREATE TABLE/VIEW AS ….!
•  Scalar and correlated sub queries!
•  Metadata	
  discovery	
  using	
  INFORMATION_SCHEMA	
  
•  Support	
  for	
  datasets	
  that	
  do	
  not	
  fit	
  in	
  memory	
  
Packaging/install	
  
•  Works	
  on	
  all	
  Hadoop	
  distribuDons	
  
•  Easy	
  ramp	
  up	
  with	
  embedded/standalone	
  
mode	
  
– Try	
  out	
  Drill	
  easily	
  on	
  your	
  machine	
  
– No	
  Hadoop	
  requirement	
  
© MapR Technologies, confidential ®
Under	
  the	
  Hood	
  
High	
  Level	
  Architecture	
  
•  Drillbits	
  run	
  on	
  each	
  node,	
  designed	
  to	
  maximize	
  data	
  locality	
  
•  Drill	
  includes	
  a	
  distributed	
  execuDon	
  environment	
  built	
  specifically	
  for	
  
distributed	
  query	
  processing	
  
•  Any	
  Drillbit	
  can	
  act	
  as	
  endpoint	
  for	
  parDcular	
  query.	
  
•  Zookeeper	
  maintains	
  ephemeral	
  cluster	
  membership	
  informaDon	
  only	
  
•  Small	
  distributed	
  cache	
  uDlizing	
  embedded	
  Hazelcast	
  maintains	
  informaDon	
  
about	
  individual	
  queue	
  depth,	
  cached	
  query	
  plans,	
  metadata,	
  locality	
  
informaDon,	
  etc.	
  
Zookeeper	
  
Storage	
  
Process	
  
Storage	
  
Process	
  
Storage	
  
Process	
  
Drillbit	
  
Distributed	
  Cache	
  
Drillbit	
  
Distributed	
  Cache	
  
Drillbit	
  
Distributed	
  Cache	
  
Basic	
  query	
  flow	
  
Zookeeper	
  
DFS/HBase	
   DFS/HBase	
   DFS/HBase	
  
Drillbit	
  
Distributed	
  Cache	
  
Drillbit	
  
Distributed	
  Cache	
  
Drillbit	
  
Distributed	
  Cache	
  
Query	
  
1.	
  Query	
  comes	
  to	
  any	
  Drillbit	
  (JDBC,	
  ODBC,	
  CLI)	
  
2.	
  Drillbit	
  generates	
  execuDon	
  plan	
  based	
  on	
  query	
  opDmizaDon	
  &	
  locality	
  
3.	
  Fragments	
  are	
  farmed	
  to	
  individual	
  nodes	
  
4.	
  Data	
  is	
  returned	
  to	
  driving	
  node	
  
Core	
  Modules	
  within	
  a	
  Drillbit	
  
SQL	
  Parser	
  
	
  
OpDmizer	
  
Physical	
  Plan	
  
DFS	
  
HBase	
  
RPC	
  Endpoint	
  
Distributed	
  Cache	
  
Storage	
  Engine	
  Interface	
  
Logical	
  Plan	
  
ExecuDon	
  
Hive	
  
Query	
  ExecuDon	
  
•  Source	
  query—what	
  we	
  want	
  to	
  do	
  (analyst	
  
friendly)	
  
•  Logical	
  Plan—	
  what	
  we	
  want	
  to	
  do	
  (language	
  
agnosDc,	
  computer	
  friendly)	
  
•  Physical	
  Plan—how	
  we	
  want	
  to	
  do	
  it	
  (the	
  best	
  
way	
  we	
  can	
  tell)	
  
•  Execu<on	
  Plan—where	
  we	
  want	
  to	
  do	
  it	
  
A	
  Query	
  engine	
  that	
  is…	
  
•  OpDmisDc/pipelined	
  
•  Columnar/Vectorized	
  
•  RunDme	
  compiled	
  
•  Late	
  binding	
  	
  
•  Extensible	
  
OpDmisDc	
  ExecuDon	
  
•  With	
  a	
  short	
  Dme	
  horizon,	
  failures	
  infrequent	
  
– Don’t	
  spend	
  energy	
  and	
  Dme	
  creaDng	
  boundaries	
  
and	
  checkpoints	
  to	
  minimize	
  recovery	
  Dme	
  
– Rerun	
  enDre	
  query	
  in	
  face	
  of	
  failure	
  
•  No	
  barriers	
  
•  No	
  persistence	
  unless	
  memory	
  overflow	
  
RunDme	
  CompilaDon	
  
•  Give	
  JIT	
  help	
  
•  Avoid	
  virtual	
  method	
  invocaDon	
  
•  Avoid	
  heap	
  allocaDon	
  and	
  object	
  overhead	
  	
  
•  Minimize	
  memory	
  overhead	
  
Record	
  versus	
  Columnar	
  
RepresentaDon	
  
Record	
   Column	
  
Data	
  Format	
  Example	
  
Donut	
   Price	
   Icing	
  
Bacon	
  Maple	
  Bar	
   2.19	
   [Maple	
  FrosDng,	
  Bacon]	
  
Portland	
  Cream	
   1.79	
   [Chocolate]	
  
The	
  Loop	
   2.29	
   [Vanilla,	
  Fruitloops]	
  
Triple	
  Chocolate	
  
PenetraDon	
  
2.79	
   [Chocolate,	
  Cocoa	
  Puffs]	
  
Record	
  Encoding	
  
Bacon	
  Maple	
  Bar,	
  2.19,	
  Maple	
  FrosDng,	
  Bacon,	
  Portland	
  Cream,	
  1.79,	
  Chocolate	
  
The	
  Loop,	
  2.29,	
  Vanilla,	
  Fruitloops,	
  Triple	
  Chocolate	
  PenetraDon,	
  2.79,	
  Chocolate,	
  
Cocoa	
  Puffs	
  
Columnar	
  Encoding	
  
Bacon	
  Maple	
  Bar,	
  Portland	
  Cream,	
  The	
  Loop,	
  Triple	
  Chocolate	
  PenetraDon	
  
2.19,	
  1.79,	
  2.29,	
  2.79	
  
Maple	
  FrosDng,	
  Bacon,	
  Chocolate,	
  Vanilla,	
  Fruitloops,	
  Chocolate,	
  Cocoa	
  Puffs	
  
	
  
Example:	
  RLE	
  and	
  Sum	
  
•  Dataset	
  	
  
–  2,	
  4	
  
–  8,	
  10	
  
•  Goal	
  
–  Sum	
  all	
  the	
  records	
  
•  Normal	
  Work	
  
–  Decompress	
  &	
  store:	
  2,	
  2,	
  2,	
  2,	
  8,	
  8,	
  8,	
  8,	
  8,	
  8,	
  8,	
  8,	
  8,	
  8	
  
–  Add:	
  2	
  +	
  2	
  +	
  2	
  +	
  2	
  +	
  8	
  +	
  8	
  +	
  8	
  +	
  8	
  +	
  8	
  +	
  8	
  +	
  8	
  +	
  8	
  +	
  8	
  +	
  8	
  
•  OpDmized	
  Work	
  
–  2	
  *	
  4	
  +	
  8	
  *	
  10	
  
–  Less	
  Memory,	
  less	
  operaDons	
  
Record	
  Batch	
  
•  Drill	
  opDmizes	
  for	
  BOTH	
  columnar	
  
STORAGE	
  and	
  ExecuDon	
  
•  Record	
  Batch	
  is	
  unit	
  of	
  work	
  for	
  the	
  
query	
  system	
  
–  Operators	
  always	
  work	
  on	
  a	
  batch	
  of	
  
records	
  
•  All	
  values	
  associated	
  with	
  a	
  
parDcular	
  collecDon	
  of	
  records	
  
•  Each	
  record	
  batch	
  must	
  have	
  a	
  
single	
  defined	
  schema	
  
•  Record	
  batches	
  are	
  pipelined	
  
between	
  operators	
  and	
  nodes	
  
RecordBatch	
  
VV	
   VV	
   VV	
   VV	
  
RecordBatch	
  
VV	
   VV	
   VV	
   VV	
  
RecordBatch	
  
VV	
   VV	
   VV	
   VV	
  
Strengths	
  of	
  RecordBatch	
  +	
  
ValueVectors	
  
•  RecordBatch	
  clearly	
  delineates	
  low	
  overhead/high	
  
performance	
  space	
  
–  Record-­‐by-­‐record,	
  avoid	
  method	
  invocaDon	
  
–  Batch-­‐by-­‐batch,	
  trust	
  JVM	
  
•  Avoid	
  serializaDon/deserializaDon	
  
•  Off-­‐heap	
  means	
  large	
  memory	
  footprint	
  without	
  GC	
  woes	
  
•  Full	
  specificaDon	
  combined	
  with	
  off-­‐heap	
  and	
  batch-­‐level	
  
execuDon	
  allows	
  C/C++	
  operators	
  as	
  necessary	
  
•  Random	
  access:	
  sort	
  without	
  copy	
  or	
  restructuring	
  
Late	
  Schema	
  Binding	
  
•  Schema	
  can	
  change	
  over	
  course	
  of	
  query	
  
•  Operators	
  are	
  able	
  to	
  reconfigure	
  themselves	
  
on	
  schema	
  change	
  events	
  
IntegraDon	
  and	
  Extensibility	
  points	
  
•  Support	
  UDFs	
  
–  UDFs/UDAFs	
  using	
  high	
  performance	
  Java	
  API	
  
•  Not	
  Hadoop	
  centric	
  
–  Work	
  with	
  other	
  NoSQL	
  soluDons	
  including	
  MongoDB,	
  Cassandra,	
  Riak,	
  etc.	
  
–  Build	
  one	
  distributed	
  query	
  engine	
  together	
  than	
  per	
  technology	
  
•  Built	
  in	
  classpath	
  scanning	
  and	
  plugin	
  concept	
  to	
  add	
  addiDonal	
  
storage	
  engines,	
  funcDon	
  and	
  operators	
  with	
  zero	
  configuraDon	
  
•  Support	
  direct	
  execuDon	
  of	
  strongly	
  specified	
  JSON	
  based	
  logical	
  
and	
  physical	
  plans	
  
–  Simplifies	
  tesDng	
  
–  Enables	
  integraDon	
  of	
  alternaDve	
  query	
  languages	
  	
  
Comparison	
  with	
  MapReduce	
  
•  Barriers	
  
–  Map	
  compleDon	
  required	
  before	
  shuffle/reduce	
  
commencement	
  
–  All	
  maps	
  must	
  complete	
  before	
  reduce	
  can	
  start	
  
–  In	
  chained	
  jobs,	
  one	
  job	
  must	
  finish	
  enDrely	
  before	
  the	
  
next	
  one	
  can	
  start	
  
•  Persistence	
  and	
  Recoverability	
  
–  Data	
  is	
  persisted	
  to	
  disk	
  between	
  each	
  barrier	
  
–  SerializaDon	
  and	
  deserializaDon	
  are	
  required	
  between	
  
execuDon	
  phase	
  
STATUS	
  
Status	
  
•  Heavy	
  acDve	
  development	
  
•  Significant	
  community	
  momentum	
  	
  
–  ~15+	
  contributors	
  
–  400+	
  people	
  in	
  Drill	
  mailing	
  lists	
  
–  400+	
  members	
  in	
  Bay	
  area	
  Drill	
  user	
  group	
  
•  Current	
  state	
  :	
  Alpha	
  
•  Timeline	
  
1.0	
  Beta	
  (End	
  of	
  Q2,	
  2014)	
   1.0	
  GA	
  (Q3,	
  2014)	
  
Roadmap	
  
• Low-latency SQL
• Schema-less execution
• Files & HBase/M7 support
• Hive integration
• ANSI SQL + Extensions for
nested data
• BI and SQL tool support
via ODBC/JDBC
Data exploration/ad-hoc
queries
1.0
• HBase query speedup
• Rich nested data API
• Analytical functions
• YARN integration
• Security
Advanced analytics and
operational data
1.1
• Ultra low latency queries
• Single row insert/update/
delete
• Workload management
Operational SQL
2.0
Interested	
  in	
  Apache	
  Drill?	
  
•  Join	
  the	
  community	
  
–  Join	
  the	
  Drill	
  mailing	
  lists	
  
•  drill-­‐user@incubator.apache.org	
  
•  drill-­‐dev@incubator.apache.org	
  	
  
–  Contribute	
  
•  Use	
  cases/Sample	
  queries,	
  JIRAs,	
  code,	
  unit	
  tests,	
  documentaDon,	
  ...	
  	
  
–  Fork	
  us	
  on	
  GitHub:	
  hhp://github.com/apache/incubator-­‐drill/	
  
–  Create	
  a	
  JIRA:	
  hhps://issues.apache.org/jira/browse/DRILL	
  
•  Resources	
  
–  Try	
  out	
  Drill	
  in	
  10mins	
  
–  hhp://incubator.apache.org/drill/	
  
–  hhps://cwiki.apache.org/confluence/display/DRILL/Apache+Drill+Wiki	
  	
  
DEMO	
  

Contenu connexe

Tendances

Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...NoSQLmatters
 
Cool NoSQL on Azure with DocumentDB
Cool NoSQL on Azure with DocumentDBCool NoSQL on Azure with DocumentDB
Cool NoSQL on Azure with DocumentDBJan Hentschel
 
Modeling Data in MongoDB
Modeling Data in MongoDBModeling Data in MongoDB
Modeling Data in MongoDBlehresman
 
Extensible RESTful Applications with Apache TinkerPop
Extensible RESTful Applications with Apache TinkerPopExtensible RESTful Applications with Apache TinkerPop
Extensible RESTful Applications with Apache TinkerPopVarun Ganesh
 
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...OpenSource Connections
 
Introducing Azure DocumentDB - NoSQL, No Problem
Introducing Azure DocumentDB - NoSQL, No ProblemIntroducing Azure DocumentDB - NoSQL, No Problem
Introducing Azure DocumentDB - NoSQL, No ProblemAndrew Liu
 
Semi Formal Model for Document Oriented Databases
Semi Formal Model for Document Oriented DatabasesSemi Formal Model for Document Oriented Databases
Semi Formal Model for Document Oriented DatabasesDaniel Coupal
 
Share Point2007 Best Practices Final
Share Point2007 Best Practices FinalShare Point2007 Best Practices Final
Share Point2007 Best Practices FinalMarianne Sweeny
 
SoftLeader Jackson Training
SoftLeader Jackson TrainingSoftLeader Jackson Training
SoftLeader Jackson TrainingJini Lee
 
Mongo Bb - NoSQL tutorial
Mongo Bb - NoSQL tutorialMongo Bb - NoSQL tutorial
Mongo Bb - NoSQL tutorialMohan Rathour
 

Tendances (15)

Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
 
Cool NoSQL on Azure with DocumentDB
Cool NoSQL on Azure with DocumentDBCool NoSQL on Azure with DocumentDB
Cool NoSQL on Azure with DocumentDB
 
Modeling Data in MongoDB
Modeling Data in MongoDBModeling Data in MongoDB
Modeling Data in MongoDB
 
Extensible RESTful Applications with Apache TinkerPop
Extensible RESTful Applications with Apache TinkerPopExtensible RESTful Applications with Apache TinkerPop
Extensible RESTful Applications with Apache TinkerPop
 
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
 
Introducing Azure DocumentDB - NoSQL, No Problem
Introducing Azure DocumentDB - NoSQL, No ProblemIntroducing Azure DocumentDB - NoSQL, No Problem
Introducing Azure DocumentDB - NoSQL, No Problem
 
Solr5
Solr5Solr5
Solr5
 
Semi Formal Model for Document Oriented Databases
Semi Formal Model for Document Oriented DatabasesSemi Formal Model for Document Oriented Databases
Semi Formal Model for Document Oriented Databases
 
Share Point2007 Best Practices Final
Share Point2007 Best Practices FinalShare Point2007 Best Practices Final
Share Point2007 Best Practices Final
 
elasticsearch
elasticsearchelasticsearch
elasticsearch
 
SoftLeader Jackson Training
SoftLeader Jackson TrainingSoftLeader Jackson Training
SoftLeader Jackson Training
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
How Solr Search Works
How Solr Search WorksHow Solr Search Works
How Solr Search Works
 
Mongo Bb - NoSQL tutorial
Mongo Bb - NoSQL tutorialMongo Bb - NoSQL tutorial
Mongo Bb - NoSQL tutorial
 

En vedette

The Briefcase Cluster – Enabling Big Data Everywhere
The Briefcase Cluster – Enabling Big Data Everywhere The Briefcase Cluster – Enabling Big Data Everywhere
The Briefcase Cluster – Enabling Big Data Everywhere MapR Technologies
 
Kenshoo - Use Hadoop, One Week, No Coding
Kenshoo - Use Hadoop, One Week, No CodingKenshoo - Use Hadoop, One Week, No Coding
Kenshoo - Use Hadoop, One Week, No CodingMapR Technologies
 
Search Engine Training Institute in Ambala!Batra Computer Centre
Search Engine Training Institute in Ambala!Batra Computer CentreSearch Engine Training Institute in Ambala!Batra Computer Centre
Search Engine Training Institute in Ambala!Batra Computer Centrejatin batra
 
How to build your query engine in spark
How to build your query engine in sparkHow to build your query engine in spark
How to build your query engine in sparkPeng Cheng
 
MyGym Project with mobile app demo
MyGym Project with mobile app demoMyGym Project with mobile app demo
MyGym Project with mobile app demoSusan Oldham
 
Project presentation
Project presentationProject presentation
Project presentationDaily Ki Jobs
 
Gym Management System User Manual
Gym Management System User ManualGym Management System User Manual
Gym Management System User ManualDavid O' Connor
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data LakeCaserta
 
project report on FITNESS HUB
project report on FITNESS HUBproject report on FITNESS HUB
project report on FITNESS HUBShwetanshu Gupta
 
Smart Gym System documentation
Smart Gym System documentationSmart Gym System documentation
Smart Gym System documentationTuvshinbayar Davaa
 
library management system in SQL
library management system in SQLlibrary management system in SQL
library management system in SQLfarouq umar
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Sparkdatamantra
 

En vedette (20)

The Briefcase Cluster – Enabling Big Data Everywhere
The Briefcase Cluster – Enabling Big Data Everywhere The Briefcase Cluster – Enabling Big Data Everywhere
The Briefcase Cluster – Enabling Big Data Everywhere
 
Kenshoo - Use Hadoop, One Week, No Coding
Kenshoo - Use Hadoop, One Week, No CodingKenshoo - Use Hadoop, One Week, No Coding
Kenshoo - Use Hadoop, One Week, No Coding
 
search engines
search enginessearch engines
search engines
 
Search Engine Training Institute in Ambala!Batra Computer Centre
Search Engine Training Institute in Ambala!Batra Computer CentreSearch Engine Training Institute in Ambala!Batra Computer Centre
Search Engine Training Institute in Ambala!Batra Computer Centre
 
How to build your query engine in spark
How to build your query engine in sparkHow to build your query engine in spark
How to build your query engine in spark
 
Gymanisum project
Gymanisum projectGymanisum project
Gymanisum project
 
MyGym Project with mobile app demo
MyGym Project with mobile app demoMyGym Project with mobile app demo
MyGym Project with mobile app demo
 
Fitness center
Fitness centerFitness center
Fitness center
 
Project presentation
Project presentationProject presentation
Project presentation
 
Gym Management System User Manual
Gym Management System User ManualGym Management System User Manual
Gym Management System User Manual
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
 
Datalake Architecture
Datalake ArchitectureDatalake Architecture
Datalake Architecture
 
project report on FITNESS HUB
project report on FITNESS HUBproject report on FITNESS HUB
project report on FITNESS HUB
 
Gym proposal
Gym proposalGym proposal
Gym proposal
 
Smart Gym System documentation
Smart Gym System documentationSmart Gym System documentation
Smart Gym System documentation
 
Gym Final Report.ppt
Gym Final Report.pptGym Final Report.ppt
Gym Final Report.ppt
 
library management system in SQL
library management system in SQLlibrary management system in SQL
library management system in SQL
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 

Similaire à Building Highly Flexible, High Performance Query Engines

Gab document db scaling database
Gab   document db scaling databaseGab   document db scaling database
Gab document db scaling databaseMUG Perú
 
Characteristics of no sql databases
Characteristics of no sql databasesCharacteristics of no sql databases
Characteristics of no sql databasesDipti Borkar
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleMapR Technologies
 
Using MongoDB + Hadoop Together
Using MongoDB + Hadoop TogetherUsing MongoDB + Hadoop Together
Using MongoDB + Hadoop TogetherMongoDB
 
Big Data: Guidelines and Examples for the Enterprise Decision Maker
Big Data: Guidelines and Examples for the Enterprise Decision MakerBig Data: Guidelines and Examples for the Enterprise Decision Maker
Big Data: Guidelines and Examples for the Enterprise Decision MakerMongoDB
 
MongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDBMongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDBMongoDB
 
The Fine Art of Schema Design in MongoDB: Dos and Don'ts
The Fine Art of Schema Design in MongoDB: Dos and Don'tsThe Fine Art of Schema Design in MongoDB: Dos and Don'ts
The Fine Art of Schema Design in MongoDB: Dos and Don'tsMatias Cascallares
 
Socialite, the Open Source Status Feed
Socialite, the Open Source Status FeedSocialite, the Open Source Status Feed
Socialite, the Open Source Status FeedMongoDB
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasMapR Technologies
 
Analyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The CloudAnalyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The CloudRobert Dempsey
 
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Michael Rys
 
Json to hive_schema_generator
Json to hive_schema_generatorJson to hive_schema_generator
Json to hive_schema_generatorPayal Jain
 
Application development with Oracle NoSQL Database 3.0
Application development with Oracle NoSQL Database 3.0Application development with Oracle NoSQL Database 3.0
Application development with Oracle NoSQL Database 3.0Anuj Sahni
 
Mongodb intro
Mongodb introMongodb intro
Mongodb introchristkv
 
Combine Spring Data Neo4j and Spring Boot to quickl
Combine Spring Data Neo4j and Spring Boot to quicklCombine Spring Data Neo4j and Spring Boot to quickl
Combine Spring Data Neo4j and Spring Boot to quicklNeo4j
 
曾勇 Elastic search-intro
曾勇 Elastic search-intro曾勇 Elastic search-intro
曾勇 Elastic search-introShaoning Pan
 
Applied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R WorkshopApplied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R WorkshopAvkash Chauhan
 
Elastic search intro-@lamper
Elastic search intro-@lamperElastic search intro-@lamper
Elastic search intro-@lampermedcl
 

Similaire à Building Highly Flexible, High Performance Query Engines (20)

Gab document db scaling database
Gab   document db scaling databaseGab   document db scaling database
Gab document db scaling database
 
Characteristics of no sql databases
Characteristics of no sql databasesCharacteristics of no sql databases
Characteristics of no sql databases
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scale
 
Using MongoDB + Hadoop Together
Using MongoDB + Hadoop TogetherUsing MongoDB + Hadoop Together
Using MongoDB + Hadoop Together
 
Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014
 
Big Data: Guidelines and Examples for the Enterprise Decision Maker
Big Data: Guidelines and Examples for the Enterprise Decision MakerBig Data: Guidelines and Examples for the Enterprise Decision Maker
Big Data: Guidelines and Examples for the Enterprise Decision Maker
 
MongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDBMongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDB
 
The Fine Art of Schema Design in MongoDB: Dos and Don'ts
The Fine Art of Schema Design in MongoDB: Dos and Don'tsThe Fine Art of Schema Design in MongoDB: Dos and Don'ts
The Fine Art of Schema Design in MongoDB: Dos and Don'ts
 
Socialite, the Open Source Status Feed
Socialite, the Open Source Status FeedSocialite, the Open Source Status Feed
Socialite, the Open Source Status Feed
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
 
Analyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The CloudAnalyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The Cloud
 
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
 
REST easy with API Platform
REST easy with API PlatformREST easy with API Platform
REST easy with API Platform
 
Json to hive_schema_generator
Json to hive_schema_generatorJson to hive_schema_generator
Json to hive_schema_generator
 
Application development with Oracle NoSQL Database 3.0
Application development with Oracle NoSQL Database 3.0Application development with Oracle NoSQL Database 3.0
Application development with Oracle NoSQL Database 3.0
 
Mongodb intro
Mongodb introMongodb intro
Mongodb intro
 
Combine Spring Data Neo4j and Spring Boot to quickl
Combine Spring Data Neo4j and Spring Boot to quicklCombine Spring Data Neo4j and Spring Boot to quickl
Combine Spring Data Neo4j and Spring Boot to quickl
 
曾勇 Elastic search-intro
曾勇 Elastic search-intro曾勇 Elastic search-intro
曾勇 Elastic search-intro
 
Applied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R WorkshopApplied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R Workshop
 
Elastic search intro-@lamper
Elastic search intro-@lamperElastic search intro-@lamper
Elastic search intro-@lamper
 

Plus de MapR Technologies

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscapeMapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataMapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureMapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionMapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformMapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareMapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLMapR Technologies
 

Plus de MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Dernier

Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 

Dernier (20)

Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 

Building Highly Flexible, High Performance Query Engines

  • 1. Building Highly Flexible, High Performance query engines - Highlights from Apache Drill project              Neeraja  Rentachintala                Director,  Product  Management    MapR  Technologies  
  • 2. Agenda   •  Apache  Drill  overview   •  Using  Drill   •  Under  the  Hood   •  Status  and  progress   •  Demo  
  • 4. Hadoop  workloads  and  APIs   Use  case   ETL  and   aggregaDon   (batch)   PredicDve   modeling  and   analyDcs   (batch)   InteracDve  SQL  –   Data  exploraDon,   Adhoc  queries  &   reporDng   Search   OperaDonal   (user  facing   applicaDons,   point  queries)   API   MapReduce   Hive   Pig   Cascading   Mahout   MLLib   Spark   Drill     Shark   Impala   Hive  on  Tez   Presto   Solr   ElasDcsear ch   HBase  API   Phoenix  
  • 5. InteracDve  SQL  and  Hadoop   •  Opens up Hadoop data to broader audience –  Existing SQL skill sets –  Broad eco system of tools •  New and improved BI/Analytics use cases –  Analysis on more raw data, new types of data and real time data •  Cost savings   Enterprise users
  • 6. Data landscape is changing New  types  of  applica<ons   •  Social,  mobile,  Web,  “Internet  of   Things”,  Cloud…   •  IteraDve/Agile  in  nature   •  More  users,  more  data   New  data  models  &  data  types   •  Flexible  (schema-­‐less)  data   •  Rapidly  changing   •  Semi-­‐structured/Nested  data   {        "data":  [                    "id":  "X999_Y999",                    "from":  {                          "name":  "Tom  Brady",  "id":  "X12"                    },                    "message":  "Looking  forward  to  2014!",                    "acDons":  [                          {                                "name":  "Comment",                                "link":  "hhp://www.facebook.com/X99/posts  Y999"                          },                          {                                "name":  "Like",                                "link":  "hhp://www.facebook.com/X99/posts  Y999"                          }                    ],                    "type":  "status",                    "created_Dme":  "2013-­‐08-­‐02T21:27:44+0000",                    "updated_Dme":  "2013-­‐08-­‐02T21:27:44+0000"              }                }   JSON  
  • 7. Tradi<onal  datasets     •  Comes  from  transacDonal  applicaDons   •  Stored  for  historical  purposes  and/or   for  large  scale  ETL/AnalyDcs   •  Well  defined  schemas   •  Managed  centrally  by  DBAs   •  No  frequent  changes  to  schema   •  Flat  datasets       New  datasets   •  Comes  from  new  applicaDons  (Ex:  Social   feeds,  clickstream,  logs,  sensor  data)   •  Enable  new  use  cases  such  as  Customer   SaDsfacDon,  Product/Service   opDmizaDon   •  Flexible  data  models/managed  within   applicaDons   •  Schemas  evolving  rapidly   •  Semi-­‐structured/Nested  data     Hadoop evolving as central hub for analysis Provides  Cost  effecDve,  flexible  way  to  store  and  and  process  data  at    scale  
  • 8. ExisDng  SQL  approaches  will  not  always  work  for   big  data  needs   •  New  data  models/types  don’t  map  well  to  the  relaDonal  models   –  Many  data  sources  do  not  have  rigid  schemas  (HBase,  Mongo  etc)   •  Each  record  has  a  separate  schema   •  Sparse  and  wide  rows   –  Flahening  nested  data  is  error-­‐prone  and  oten  impossible   •  Think  about  repeated  and  opDonal  fields  at  every  level…   •  A  single  HBase  value  could  be  a  JSON  document  (compound  nested  type)   •  Centralized  schemas  are  hard  to  manage  for  big  data   •  Rapidly  evolving  data  source  schemas   •  Lots  of  new  data  sources   •  Third  party  data   •  Unknown  quesDons            Model  data    Move  data  into   tradi<onal  systems   New questions /requirements Schema changes or new data sources DBA/DWH teams Analyze Big data Enterprise Users
  • 9. Apache  Drill   Open  Source  SQL  on  Hadoop  for  Agility  with  Big  Data  explora<on   FLEXIBLE  SCHEMA   MANAGEMENT   ANALYTICS  ON   NOSQL  DATA   PLUG  AND  PLAY   WITH  EXISTING   TOOLS   Analyze  data  with   or  without   centralized   schemas         Analyze  data  using   familiar  BI/AnalyDcs   and  SQL  based  tools   Analyze  semi   structured  &   nested  data  with   no  modeling/ETL  
  • 10. Flexible  schema  management    {          “ID”:  1,          “NAME”:  “Fairmont  San  Francisco”,          “DESCRIPTION”:  “Historic  grandeur…”,          “AVG_REVIEWER_SCORE”:  “4.3”,          “AMENITY”:  {“TYPE”:  “gym”,                                                                DESCRIPTION:  “fitness  center”                                                          },                                                          {“TYPE”:  “wifi”,                                                            “DESCRIPTION”:  “free  wifi”},          “RATE_TYPE”:  “nightly”,            “PRICE”:  “$199”,            “REVIEWS”:  [“review_1”,  “review_2”],            “ATTRACTIONS”:  “Chinatown”,      }   JSON   ExisDng  SQL   soluDons   X HotelID   AmenityID   1   1   1   2   ID   Type   Descrip Don   1   Gym   Fitness   center   2   Wifi   Free  wifi  
  • 11. Drill    {          “ID”:  1,          “NAME”:  “Fairmont  San  Francisco”,          “DESCRIPTION”:  “Historic  grandeur…”,          “AVG_REVIEWER_SCORE”:  “4.3”,          “AMENITY”:  {“TYPE”:  “gym”,                                                                DESCRIPTION:  “fitness   center”                                                          },                                                          {“TYPE”:  “wifi”,                                                            “DESCRIPTION”:  “free  wifi”},          “RATE_TYPE”:  “nightly”,            “PRICE”:  “$199”,            “REVIEWS”:  [“review_1”,  “review_2”],            “ATTRACTIONS”:  “Chinatown”,      }   JSON   Drill   Flexible  schema  management   HotelID   AmenityID   1   1   1   2   ID   Type   Descrip Don   1   Gym   Fitness   center   2   Wifi   Free  wifi   Drill  doesn’t  require  any  schema  defini<ons  to  query  data  making  it  faster  to  get   insights  from  data  for  users.  Drill  leverages  schema  defini<ons  if  exists.  
  • 12. Key  features   •  Dynamic/schema-­‐less  queries   •  Nested  data   •  Apache  Hive  integraDon   •  ANSI  SQL/BI  tool  integraDon      
  • 13. Querying  files   •  Direct  queries  on  a  local  or  a  distributed  file  system  (HDFS,  S3   etc)   •  Configure  one  or  more  directories  in  file  system  as  “Workspaces”     –  Think  of  this  as  similar  to  schemas  in  databases   –  Default  workspace  points  to  “root”  locaDon   •  Specify  a  single  file  or  a  directory  as  ‘Table’  within  query   •  Specify  schema  in  query  or  let  Drill  discover  it   •  Example:   •  SELECT * FROM dfs.users.`/home/mapr/sample-data/profiles.json`! !! dfs   File  system  as  data  source   users   Workspace  (corresponds  to  a  directory)   /home/mapr/sample-data/ profiles.json! Table  
  • 14. More  examples   •  Query  on  single  file   SELECT * FROM dfs.logs.`AppServerLogs/2014/Jan/part0001.txt`! •  Query  on  directory   SELECT * FROM dfs.logs.`AppServerLogs/2014/Jan` where errorLevel=1;! •  Joins  on  files   SELECT  c.c_custkey,sum(o.o_totalprice) ! FROM! !dfs.`/home/mapr/tpch/customer.parquet` c ! !JOIN! !dfs.`/home/mapr/tpch/orders.parquet` o! !ON c.c_custkey = o.o_custkey! GROUP BY c.c_custkey ! LIMIT 10!
  • 15. Querying  HBase   •  Direct  queries  on  HBase  tables   –  SELECT row_key, cf1.month, cf1.year FROM hbase.table1;! –  SELECT CONVERT_FROM(row_key, UTF-8) as HotelName from FROM HotelData   •  No  need  to  define  a  parallel/overlay  schema  in  Hive   •  Encode  and  Decode  data  from  HBase  using  Convert  funcDons   –  Convert_To  and  Convert_From   !
  • 16. Nested  data   •  Nested  data  as  first  class  enDty:  Extensions  to  SQL  for  nested   data  types,  similar  to  BigQuery       •  No  upfront  flahening/modeling  required   •  Generic  architecture  for  a  broad  variety  of  nested  data  types   (eg:JSON,  BSON,  XML,  AVRO,  Protocol  Buffers)   •  Performance  with  ground  up  design  for  nested  data   •  Example:   SELECT! !c.name, c.address, REPEATED_COUNT(c.children) ! FROM(! SELECT! ! !CONVERT_FROM(cf1.user-json-blob, JSON) AS c ! FROM! !hbase.table1! )  
  • 17.   Apache  Hive  integraDon     •  Plug  and  Play  integraDon  in  exisDng   Hive  deployments   •  Use  Drill  to  query  data  in  Hive   tables/views     •  Support  to  work  with  more  than   one  Hive  metastore   •  Support  for  all  Hive  file  formats   •  Ability  to  use  Hive  UDFs  as  part  of   Drill  queries   Hive   meta   store   Files   HBase   Hive   SQL  layer     Drill   SQL  layer  +   execuDon   engine  MapReduce   execuDon   framework    
  • 18. Cross  data  source  queries   •  Combine  data  from  Files,  HBase,  Hive  in  one  query   •  No  central  metadata  definiDons  necessary   •  Example:   –  USE HiveTest.CustomersDB! –  SELECT Customers.customer_name, SocialData.Tweets.Count! FROM Customers! JOIN HBaseCatalog.SocialData SocialData ! ON Customers.Customer_id = Convert_From(SocialData.rowkey, UTF-8) !
  • 19. BI  tool  integraDon     •  Standard  JDBC/ODBC  drivers   •  IntegraDon  Tableau,  Excel,  Microstrategy,  Toad,   SQuirreL...  
  • 20. SQL  support   •  ANSI  SQL  compaDbility   –  “SQL  Like”  not  enough   •  SQL  data  types     –  SMALLINT, BIGINT, TINYINT, INT, FLOAT, DOUBLE,DATE, TIMESTAMP, DECIMAL, VARCHAR, VARBINARY ….! •  All  common  SQL  constructs   •  SELECT, GROUP BY, ORDER BY, LIMIT, JOIN, HAVING, UNION, UNION ALL, IN/NOT IN, EXISTS/NOT EXISTS,DISTINCT, BETWEEN, CREATE TABLE/VIEW AS ….! •  Scalar and correlated sub queries! •  Metadata  discovery  using  INFORMATION_SCHEMA   •  Support  for  datasets  that  do  not  fit  in  memory  
  • 21. Packaging/install   •  Works  on  all  Hadoop  distribuDons   •  Easy  ramp  up  with  embedded/standalone   mode   – Try  out  Drill  easily  on  your  machine   – No  Hadoop  requirement  
  • 22. © MapR Technologies, confidential ® Under  the  Hood  
  • 23. High  Level  Architecture   •  Drillbits  run  on  each  node,  designed  to  maximize  data  locality   •  Drill  includes  a  distributed  execuDon  environment  built  specifically  for   distributed  query  processing   •  Any  Drillbit  can  act  as  endpoint  for  parDcular  query.   •  Zookeeper  maintains  ephemeral  cluster  membership  informaDon  only   •  Small  distributed  cache  uDlizing  embedded  Hazelcast  maintains  informaDon   about  individual  queue  depth,  cached  query  plans,  metadata,  locality   informaDon,  etc.   Zookeeper   Storage   Process   Storage   Process   Storage   Process   Drillbit   Distributed  Cache   Drillbit   Distributed  Cache   Drillbit   Distributed  Cache  
  • 24. Basic  query  flow   Zookeeper   DFS/HBase   DFS/HBase   DFS/HBase   Drillbit   Distributed  Cache   Drillbit   Distributed  Cache   Drillbit   Distributed  Cache   Query   1.  Query  comes  to  any  Drillbit  (JDBC,  ODBC,  CLI)   2.  Drillbit  generates  execuDon  plan  based  on  query  opDmizaDon  &  locality   3.  Fragments  are  farmed  to  individual  nodes   4.  Data  is  returned  to  driving  node  
  • 25. Core  Modules  within  a  Drillbit   SQL  Parser     OpDmizer   Physical  Plan   DFS   HBase   RPC  Endpoint   Distributed  Cache   Storage  Engine  Interface   Logical  Plan   ExecuDon   Hive  
  • 26. Query  ExecuDon   •  Source  query—what  we  want  to  do  (analyst   friendly)   •  Logical  Plan—  what  we  want  to  do  (language   agnosDc,  computer  friendly)   •  Physical  Plan—how  we  want  to  do  it  (the  best   way  we  can  tell)   •  Execu<on  Plan—where  we  want  to  do  it  
  • 27. A  Query  engine  that  is…   •  OpDmisDc/pipelined   •  Columnar/Vectorized   •  RunDme  compiled   •  Late  binding     •  Extensible  
  • 28. OpDmisDc  ExecuDon   •  With  a  short  Dme  horizon,  failures  infrequent   – Don’t  spend  energy  and  Dme  creaDng  boundaries   and  checkpoints  to  minimize  recovery  Dme   – Rerun  enDre  query  in  face  of  failure   •  No  barriers   •  No  persistence  unless  memory  overflow  
  • 29. RunDme  CompilaDon   •  Give  JIT  help   •  Avoid  virtual  method  invocaDon   •  Avoid  heap  allocaDon  and  object  overhead     •  Minimize  memory  overhead  
  • 30. Record  versus  Columnar   RepresentaDon   Record   Column  
  • 31. Data  Format  Example   Donut   Price   Icing   Bacon  Maple  Bar   2.19   [Maple  FrosDng,  Bacon]   Portland  Cream   1.79   [Chocolate]   The  Loop   2.29   [Vanilla,  Fruitloops]   Triple  Chocolate   PenetraDon   2.79   [Chocolate,  Cocoa  Puffs]   Record  Encoding   Bacon  Maple  Bar,  2.19,  Maple  FrosDng,  Bacon,  Portland  Cream,  1.79,  Chocolate   The  Loop,  2.29,  Vanilla,  Fruitloops,  Triple  Chocolate  PenetraDon,  2.79,  Chocolate,   Cocoa  Puffs   Columnar  Encoding   Bacon  Maple  Bar,  Portland  Cream,  The  Loop,  Triple  Chocolate  PenetraDon   2.19,  1.79,  2.29,  2.79   Maple  FrosDng,  Bacon,  Chocolate,  Vanilla,  Fruitloops,  Chocolate,  Cocoa  Puffs    
  • 32. Example:  RLE  and  Sum   •  Dataset     –  2,  4   –  8,  10   •  Goal   –  Sum  all  the  records   •  Normal  Work   –  Decompress  &  store:  2,  2,  2,  2,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8   –  Add:  2  +  2  +  2  +  2  +  8  +  8  +  8  +  8  +  8  +  8  +  8  +  8  +  8  +  8   •  OpDmized  Work   –  2  *  4  +  8  *  10   –  Less  Memory,  less  operaDons  
  • 33. Record  Batch   •  Drill  opDmizes  for  BOTH  columnar   STORAGE  and  ExecuDon   •  Record  Batch  is  unit  of  work  for  the   query  system   –  Operators  always  work  on  a  batch  of   records   •  All  values  associated  with  a   parDcular  collecDon  of  records   •  Each  record  batch  must  have  a   single  defined  schema   •  Record  batches  are  pipelined   between  operators  and  nodes   RecordBatch   VV   VV   VV   VV   RecordBatch   VV   VV   VV   VV   RecordBatch   VV   VV   VV   VV  
  • 34. Strengths  of  RecordBatch  +   ValueVectors   •  RecordBatch  clearly  delineates  low  overhead/high   performance  space   –  Record-­‐by-­‐record,  avoid  method  invocaDon   –  Batch-­‐by-­‐batch,  trust  JVM   •  Avoid  serializaDon/deserializaDon   •  Off-­‐heap  means  large  memory  footprint  without  GC  woes   •  Full  specificaDon  combined  with  off-­‐heap  and  batch-­‐level   execuDon  allows  C/C++  operators  as  necessary   •  Random  access:  sort  without  copy  or  restructuring  
  • 35. Late  Schema  Binding   •  Schema  can  change  over  course  of  query   •  Operators  are  able  to  reconfigure  themselves   on  schema  change  events  
  • 36. IntegraDon  and  Extensibility  points   •  Support  UDFs   –  UDFs/UDAFs  using  high  performance  Java  API   •  Not  Hadoop  centric   –  Work  with  other  NoSQL  soluDons  including  MongoDB,  Cassandra,  Riak,  etc.   –  Build  one  distributed  query  engine  together  than  per  technology   •  Built  in  classpath  scanning  and  plugin  concept  to  add  addiDonal   storage  engines,  funcDon  and  operators  with  zero  configuraDon   •  Support  direct  execuDon  of  strongly  specified  JSON  based  logical   and  physical  plans   –  Simplifies  tesDng   –  Enables  integraDon  of  alternaDve  query  languages    
  • 37. Comparison  with  MapReduce   •  Barriers   –  Map  compleDon  required  before  shuffle/reduce   commencement   –  All  maps  must  complete  before  reduce  can  start   –  In  chained  jobs,  one  job  must  finish  enDrely  before  the   next  one  can  start   •  Persistence  and  Recoverability   –  Data  is  persisted  to  disk  between  each  barrier   –  SerializaDon  and  deserializaDon  are  required  between   execuDon  phase  
  • 39. Status   •  Heavy  acDve  development   •  Significant  community  momentum     –  ~15+  contributors   –  400+  people  in  Drill  mailing  lists   –  400+  members  in  Bay  area  Drill  user  group   •  Current  state  :  Alpha   •  Timeline   1.0  Beta  (End  of  Q2,  2014)   1.0  GA  (Q3,  2014)  
  • 40. Roadmap   • Low-latency SQL • Schema-less execution • Files & HBase/M7 support • Hive integration • ANSI SQL + Extensions for nested data • BI and SQL tool support via ODBC/JDBC Data exploration/ad-hoc queries 1.0 • HBase query speedup • Rich nested data API • Analytical functions • YARN integration • Security Advanced analytics and operational data 1.1 • Ultra low latency queries • Single row insert/update/ delete • Workload management Operational SQL 2.0
  • 41. Interested  in  Apache  Drill?   •  Join  the  community   –  Join  the  Drill  mailing  lists   •  drill-­‐user@incubator.apache.org   •  drill-­‐dev@incubator.apache.org     –  Contribute   •  Use  cases/Sample  queries,  JIRAs,  code,  unit  tests,  documentaDon,  ...     –  Fork  us  on  GitHub:  hhp://github.com/apache/incubator-­‐drill/   –  Create  a  JIRA:  hhps://issues.apache.org/jira/browse/DRILL   •  Resources   –  Try  out  Drill  in  10mins   –  hhp://incubator.apache.org/drill/   –  hhps://cwiki.apache.org/confluence/display/DRILL/Apache+Drill+Wiki