SlideShare une entreprise Scribd logo
1  sur  43
Télécharger pour lire hors ligne
1
Cloudera	
  Impala	
  
Charm	
  City	
  Linux,	
  March	
  2014	
  
	
  
Alex	
  Moundalexis	
  
	
  	
  
@technmsg	
  
Thirty	
  Seconds	
  About	
  Alex	
  
•  Solu@ons	
  Architect	
  
•  aka	
  consultant	
  
•  government	
  
•  infrastructure	
  
•  former	
  coder	
  of	
  Perl	
  
•  former	
  administrator	
  
•  likes	
  shiny	
  objects	
  
2	
  
What	
  Does	
  Cloudera	
  Do?	
  
•  product	
  
•  distribu@on	
  of	
  Hadoop	
  components,	
  Apache	
  licensed	
  
•  enterprise	
  tooling	
  
•  support	
  
•  training	
  
•  services	
  (aka	
  consul@ng)	
  
•  community	
  
3
Disclaimer	
  
•  Cloudera	
  builds	
  things	
  soMware	
  
•  most	
  donated	
  to	
  Apache	
  
•  some	
  closed-­‐source	
  
•  Cloudera	
  “products”	
  I	
  reference	
  are	
  open	
  source	
  
•  Apache	
  Licensed	
  
•  source	
  code	
  is	
  on	
  GitHub	
  
•  hSps://github.com/cloudera	
  
4
What	
  This	
  Talk	
  Isn’t	
  About	
  
•  deploying	
  
•  Puppet,	
  Chef,	
  Ansible,	
  homegrown	
  scripts,	
  intern	
  labor	
  
•  sizing	
  &	
  tuning	
  
•  depends	
  heavily	
  on	
  data	
  and	
  workload	
  
•  coding	
  
•  unless	
  you	
  count	
  XML	
  or	
  CSV	
  or	
  SQL	
  
•  algorithms	
  
5
6
Quick	
  and	
  dirty,	
  for	
  context.	
  
The	
  Apache	
  Hadoop	
  Ecosystem	
  
Why	
  “Ecosystem?”	
  
•  In	
  the	
  beginning,	
  just	
  Hadoop	
  
•  HDFS	
  
•  MapReduce	
  
•  Today,	
  dozens	
  of	
  interrelated	
  components	
  
•  I/O	
  
•  Processing	
  
•  Specialty	
  Applica@ons	
  
•  Configura@on	
  
•  Workflow	
  
7
HDFS	
  
•  Distributed,	
  highly	
  fault-­‐tolerant	
  filesystem	
  
•  Op@mized	
  for	
  large	
  streaming	
  access	
  to	
  data	
  
•  Based	
  on	
  Google	
  File	
  System	
  
•  hSp://research.google.com/archive/gfs.html	
  
8
Lots	
  of	
  Commodity	
  Machines	
  
9
Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
MapReduce	
  (MR)	
  
•  Programming	
  paradigm	
  
•  Batch	
  oriented,	
  not	
  real@me	
  
•  Works	
  well	
  with	
  distributed	
  compu@ng	
  
•  Lots	
  of	
  Java,	
  but	
  other	
  languages	
  supported	
  
•  Based	
  on	
  Google’s	
  paper	
  
•  hSp://research.google.com/archive/mapreduce.html	
  
10
Under	
  the	
  Covers	
  
11
You specify map() and
reduce() functions.

The framework does the
rest.	

60
Apache	
  Hive	
  
•  Abstrac@on	
  of	
  Hadoop’s	
  Java	
  API	
  
•  HiveQL	
  “compiles”	
  down	
  to	
  MR	
  
•  a	
  “SQL-­‐like”	
  language	
  
•  Eases	
  analysis	
  using	
  MapReduce	
  
13
Apache	
  Hive	
  Metastore	
  
•  Maps	
  HDFS	
  files	
  to	
  DB-­‐like	
  resources	
  
•  Databases	
  
•  Tables	
  
•  Column/field	
  names,	
  data	
  types	
  
•  Roles/users	
  
•  InputFormat/OutputFormat	
  
14
WHY	
  DO	
  WE	
  NEED	
  THIS?	
  
But	
  wait…	
  
15	
  
16	
  
17
I	
  am	
  not	
  a	
  SQL	
  wizard	
  by	
  any	
  means…	
  
Super	
  Shady	
  SQL	
  Supplement	
  
A	
  Simple	
  Rela@onal	
  Database	
  
name	
   state	
   employer	
   year	
  
Alex	
   Maryland	
   Cloudera	
   2013	
  
Joey	
   Maryland	
   Cloudera	
   2011	
  
Sean	
   Texas	
   Cloudera	
   2013	
  
Paris	
   Maryland	
   AOL	
   2011	
  
18
	
  
Interac@ng	
  with	
  Rela@onal	
  Data	
  
name	
   state	
   employer	
   year	
  
Alex	
   Maryland	
   Cloudera	
   2013	
  
Joey	
   Maryland	
   Cloudera	
   2011	
  
Sean	
   Texas	
   Cloudera	
   2013	
  
Paris	
   Maryland	
   AOL	
   2011	
  
19
	
  SELECT	
  *	
  FROM	
  people;	
  
Interac@ng	
  with	
  Rela@onal	
  Data	
  
name	
   state	
   employer	
   year	
  
Alex	
   Maryland	
   Cloudera	
   2013	
  
Joey	
   Maryland	
   Cloudera	
   2011	
  
Sean	
   Texas	
   Cloudera	
   2013	
  
Paris	
   Maryland	
   AOL	
   2011	
  
20
	
  SELECT	
  *	
  FROM	
  people;	
  
Reques@ng	
  Specific	
  Fields	
  
name	
   state	
   employer	
   year	
  
Alex	
   Maryland	
   Cloudera	
   2013	
  
Joey	
   Maryland	
   Cloudera	
   2011	
  
Sean	
   Texas	
   Cloudera	
   2013	
  
Paris	
   Maryland	
   AOL	
   2011	
  
21
	
  SELECT	
  name,	
  state	
  FROM	
  people;	
  
Reques@ng	
  Specific	
  Fields	
  
name	
   state	
   employer	
   year	
  
Alex	
   Maryland	
   Cloudera	
   2013	
  
Joey	
   Maryland	
   Cloudera	
   2011	
  
Sean	
   Texas	
   Cloudera	
   2013	
  
Paris	
   Maryland	
   AOL	
   2011	
  
22
	
  SELECT	
  name,	
  state	
  FROM	
  people;	
  
Reques@ng	
  Specific	
  Rows	
  
name	
   state	
   employer	
   year	
  
Alex	
   Maryland	
   Cloudera	
   2013	
  
Joey	
   Maryland	
   Cloudera	
   2011	
  
Sean	
   Texas	
   Cloudera	
   2013	
  
Paris	
   Maryland	
   AOL	
   2011	
  
23
	
  SELECT	
  name,	
  state	
  FROM	
  people	
  WHERE	
  year	
  	
  2012;	
  
Reques@ng	
  Specific	
  Rows	
  
name	
   state	
   employer	
   year	
  
Alex	
   Maryland	
   Cloudera	
   2013	
  
Joey	
   Maryland	
   Cloudera	
   2011	
  
Sean	
   Texas	
   Cloudera	
   2013	
  
Paris	
   Maryland	
   AOL	
   2011	
  
24
	
  SELECT	
  name,	
  state	
  FROM	
  people	
  WHERE	
  year	
  	
  2012;	
  
Two	
  Simple	
  Tables	
  
owner	
   species	
   name	
  
Alex	
   Cactus	
   Marvin	
  
Joey	
   Cat	
   Brain	
  
Sean	
   None	
  
Paris	
   Unknown	
  
25	
  
	
  
name	
   state	
   employer	
   year	
  
Alex	
   Maryland	
   Cloudera	
   2013	
  
Joey	
   Maryland	
   Cloudera	
   2011	
  
Sean	
   Texas	
   Cloudera	
   2013	
  
Paris	
   Maryland	
   AOL	
   2011	
  
Joining	
  Two	
  Tables	
  
owner	
   species	
   name	
  
Alex	
   Cactus	
   Marvin	
  
Joey	
   Cat	
   Brain	
  
Sean	
   None	
  
Paris	
   Unknown	
  
26	
  
	
  SELECT	
  people.name	
  AS	
  owner,	
  people.state	
  AS	
  state,	
  pets.name	
  AS	
  pet	
  
	
  FROM	
  people	
  LEFT	
  JOIN	
  pets	
  ON	
  people.name	
  =	
  pets.owner	
  
	
  name	
   state	
   employer	
   year	
  
Alex	
   Maryland	
   Cloudera	
   2013	
  
Joey	
   Maryland	
   Cloudera	
   2011	
  
Sean	
   Texas	
   Cloudera	
   2013	
  
Paris	
   Maryland	
   AOL	
   2011	
  
Joining	
  Two	
  Tables	
  
owner	
   species	
   name	
  
Alex	
   Cactus	
   Marvin	
  
Joey	
   Cat	
   Brain	
  
Sean	
   None	
  
Paris	
   Unknown	
  
27	
  
	
  SELECT	
  people.name	
  AS	
  owner,	
  people.state	
  AS	
  state,	
  pets.name	
  AS	
  pet	
  
	
  FROM	
  people	
  LEFT	
  JOIN	
  pets	
  ON	
  people.name	
  =	
  pets.owner	
  
	
  name	
   state	
   employer	
   year	
  
Alex	
   Maryland	
   Cloudera	
   2013	
  
Joey	
   Maryland	
   Cloudera	
   2011	
  
Sean	
   Texas	
   Cloudera	
   2013	
  
Paris	
   Maryland	
   AOL	
   2011	
  
Joining	
  Two	
  Tables	
  
owner	
   species	
   name	
  
Alex	
   Cactus	
   Marvin	
  
Joey	
   Cat	
   Brain	
  
Sean	
   None	
  
Paris	
   Unknown	
  
28	
  
	
  SELECT	
  people.name	
  AS	
  owner,	
  people.state	
  AS	
  state,	
  pets.name	
  AS	
  pet	
  
	
  FROM	
  people	
  LEFT	
  JOIN	
  pets	
  ON	
  people.name	
  =	
  pets.owner	
  
name	
   state	
   employer	
   year	
  
Alex	
   Maryland	
   Cloudera	
   2013	
  
Joey	
   Maryland	
   Cloudera	
   2011	
  
Sean	
   Texas	
   Cloudera	
   2013	
  
Paris	
   Maryland	
   AOL	
   2011	
  
Joining	
  Two	
  Tables	
  
29
	
  SELECT	
  people.name	
  AS	
  owner,	
  people.state	
  AS	
  state,	
  pets.name	
  AS	
  pet	
  
	
  FROM	
  people	
  LEFT	
  JOIN	
  pets	
  ON	
  people.name	
  =	
  pets.owner	
  
owner	
   state	
   pet	
  
Alex	
   Maryland	
   Marvin	
  
Joey	
   Maryland	
   Brain	
  
Sean	
   Texas	
  
Paris	
   Maryland	
  
Varying	
  Implementa@on	
  of	
  JOIN	
  
30
	
  SELECT	
  people.name	
  AS	
  owner,	
  people.state	
  AS	
  state,	
  pets.name	
  AS	
  pet	
  
	
  FROM	
  people	
  LEFT	
  JOIN	
  pets	
  ON	
  people.name	
  =	
  pets.owner	
  
owner	
   state	
   pet	
  
Alex	
   Maryland	
   Marvin	
  
Joey	
   Maryland	
   Brain	
  
Sean	
   Texas	
   ?	
  
Paris	
   Maryland	
   ?	
  
31
Familiar	
  interface,	
  but	
  more	
  powerful.	
  
Cloudera	
  Impala	
  
Cloudera	
  Impala	
  
•  Interac@ve	
  query	
  on	
  Hadoop	
  
•  think	
  seconds,	
  not	
  minutes	
  
•  Nearly	
  ANSI-­‐92	
  standard	
  SQL	
  
•  compa@ble	
  with	
  HiveQL	
  
•  Na@ve	
  MPP	
  query	
  engine	
  
•  built	
  for	
  low-­‐latency	
  queries	
  
32
Cloudera	
  Impala	
  –	
  Design	
  Choices	
  
•  Na@ve	
  daemons,	
  wriSen	
  in	
  C/C++	
  
•  No	
  JVM,	
  no	
  MapReduce	
  
•  Saturate	
  disks	
  on	
  reads	
  
•  Uses	
  in-­‐memory	
  HDFS	
  caching	
  
•  Re-­‐uses	
  Hive	
  metastore	
  
•  Not	
  as	
  fault-­‐tolerant	
  as	
  MapReduce	
  
33
Cloudera	
  Impala	
  –	
  Architecture	
  
•  Impala	
  Daemon	
  
•  runs	
  on	
  every	
  node	
  
•  handles	
  client	
  requests	
  
•  handles	
  query	
  planning	
  	
  execu@on	
  
•  State	
  Store	
  Daemon	
  
•  provides	
  name	
  service	
  
•  metadata	
  distribu@on	
  
•  used	
  for	
  finding	
  data	
  
34
Impala	
  Query	
  Execu@on	
  
35
Query	
  Planner	
  
Query	
  Coordinator	
  
Query	
  Executor	
  
HDFS	
  DN	
   HBase	
  
SQL	
  App	
  
ODBC	
  
Hive	
  
Metastore	
  
HDFS	
  NN	
   Statestore	
  
Query	
  Planner	
  
Query	
  Coordinator	
  
Query	
  Executor	
  
HDFS	
  DN	
   HBase	
  
Query	
  Planner	
  
Query	
  Coordinator	
  
Query	
  Executor	
  
HDFS	
  DN	
   HBase	
  
SQL	
  request	
  
1)	
  Request	
  arrives	
  via	
  ODBC/JDBC/HUE/Shell	
  
Impala	
  Query	
  Execu@on	
  
36
Query	
  Planner	
  
Query	
  Coordinator	
  
Query	
  Executor	
  
HDFS	
  DN	
   HBase	
  
SQL	
  App	
  
ODBC	
  
Hive	
  
Metastore	
  
HDFS	
  NN	
   Statestore	
  
Query	
  Planner	
  
Query	
  Coordinator	
  
Query	
  Executor	
  
HDFS	
  DN	
   HBase	
  
Query	
  Planner	
  
Query	
  Coordinator	
  
Query	
  Executor	
  
HDFS	
  DN	
   HBase	
  
2)	
  Planner	
  turns	
  request	
  into	
  collecRons	
  of	
  plan	
  fragments	
  
3)	
  Coordinator	
  iniRates	
  execuRon	
  on	
  impalad(s)	
  local	
  to	
  data	
  
Impala	
  Query	
  Execu@on	
  
37
Query	
  Planner	
  
Query	
  Coordinator	
  
Query	
  Executor	
  
HDFS	
  DN	
   HBase	
  
SQL	
  App	
  
ODBC	
  
Hive	
  
Metastore	
  
HDFS	
  NN	
   Statestore	
  
Query	
  Planner	
  
Query	
  Coordinator	
  
Query	
  Executor	
  
HDFS	
  DN	
   HBase	
  
Query	
  Planner	
  
Query	
  Coordinator	
  
Query	
  Executor	
  
HDFS	
  DN	
   HBase	
  
4)	
  Intermediate	
  results	
  are	
  streamed	
  between	
  impalad(s)	
  
5)	
  Query	
  results	
  are	
  streamed	
  back	
  to	
  client	
  
Query	
  results	
  
Cloudera	
  Impala	
  –	
  Results	
  
•  Allows	
  for	
  fast	
  itera@on/discovery	
  
•  How	
  much	
  faster?	
  
•  3-­‐4x	
  faster	
  on	
  I/O	
  bound	
  workloads	
  
•  up	
  to	
  45x	
  faster	
  on	
  mul@-­‐MR	
  queries	
  
•  up	
  to	
  90x	
  faster	
  on	
  in-­‐memory	
  cache	
  
38
39
Hold	
  onto	
  something,	
  folks.	
  
Demo	
  
What’s	
  Next?	
  
•  Download	
  Hadoop!	
  
•  CDH	
  available	
  at	
  www.cloudera.com	
  
•  Already	
  done	
  that?	
  Contribute…	
  
•  Cloudera	
  provides	
  pre-­‐loaded	
  VMs	
  
•  hSp://@ny.cloudera.com/quickstartvm	
  
•  Clone	
  our	
  repos!	
  
•  hSps://github.com/cloudera	
  
40
PARIS	
  
Special	
  thanks:	
  
41	
  
42
Preferably	
  related	
  to	
  the	
  talk…	
  or	
  not.	
  
Ques@ons?	
  
43
Thank	
  You!	
  
Alex	
  Moundalexis	
  
	
  	
  
@technmsg	
  
	
  
We’re	
  hiring,	
  kids!	
  Well,	
  not	
  kids.	
  

Contenu connexe

Similaire à Introduction to Cloudera Impala

dplyr Interfaces to Large-Scale Data
dplyr Interfaces to Large-Scale Datadplyr Interfaces to Large-Scale Data
dplyr Interfaces to Large-Scale DataCloudera, Inc.
 
Analyzing twitter data with hadoop
Analyzing twitter data with hadoopAnalyzing twitter data with hadoop
Analyzing twitter data with hadoopJoey Echeverria
 
PHP and MySQL.pptx
PHP and MySQL.pptxPHP and MySQL.pptx
PHP and MySQL.pptxnatesanp1234
 
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019Dave Stokes
 
State of The Dolphin - May 2021
State of The Dolphin - May 2021State of The Dolphin - May 2021
State of The Dolphin - May 2021Frederic Descamps
 
Silicon Valley 2014 - API Antipatterns
Silicon Valley 2014 - API AntipatternsSilicon Valley 2014 - API Antipatterns
Silicon Valley 2014 - API AntipatternsManish Pandit
 
Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013
Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013
Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013Publicis Sapient Engineering
 
Solr JDBC: Presented by Kevin Risden, Avalon Consulting
Solr JDBC: Presented by Kevin Risden, Avalon ConsultingSolr JDBC: Presented by Kevin Risden, Avalon Consulting
Solr JDBC: Presented by Kevin Risden, Avalon ConsultingLucidworks
 
AZMS PRESENTATION.pptx
AZMS PRESENTATION.pptxAZMS PRESENTATION.pptx
AZMS PRESENTATION.pptxSonuShaw16
 
Oracle database 12c_and_DevOps
Oracle database 12c_and_DevOpsOracle database 12c_and_DevOps
Oracle database 12c_and_DevOpsMaria Colgan
 
Extending drupal authentication
Extending drupal authenticationExtending drupal authentication
Extending drupal authenticationCharles Russell
 
Which Freaking Database Should I Use?
Which Freaking Database Should I Use?Which Freaking Database Should I Use?
Which Freaking Database Should I Use?Great Wide Open
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingGwen (Chen) Shapira
 
Cassandra Workshop - Cassandra from scratch in one day
Cassandra Workshop - Cassandra from scratch in one dayCassandra Workshop - Cassandra from scratch in one day
Cassandra Workshop - Cassandra from scratch in one dayCarlos Alonso Pérez
 
OUG Scotland 2014 - NoSQL and MySQL - The best of both worlds
OUG Scotland 2014 - NoSQL and MySQL - The best of both worldsOUG Scotland 2014 - NoSQL and MySQL - The best of both worlds
OUG Scotland 2014 - NoSQL and MySQL - The best of both worldsAndrew Morgan
 
RESTful Web APIs – Mike Amundsen, Principal API Architect, Layer 7
RESTful Web APIs – Mike Amundsen, Principal API Architect, Layer 7RESTful Web APIs – Mike Amundsen, Principal API Architect, Layer 7
RESTful Web APIs – Mike Amundsen, Principal API Architect, Layer 7CA API Management
 
Leveraging a distributed architecture to your advantage
Leveraging a distributed architecture to your advantageLeveraging a distributed architecture to your advantage
Leveraging a distributed architecture to your advantageMichelangelo van Dam
 

Similaire à Introduction to Cloudera Impala (20)

dplyr Interfaces to Large-Scale Data
dplyr Interfaces to Large-Scale Datadplyr Interfaces to Large-Scale Data
dplyr Interfaces to Large-Scale Data
 
Database
DatabaseDatabase
Database
 
Analyzing twitter data with hadoop
Analyzing twitter data with hadoopAnalyzing twitter data with hadoop
Analyzing twitter data with hadoop
 
PHP and MySQL.pptx
PHP and MySQL.pptxPHP and MySQL.pptx
PHP and MySQL.pptx
 
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
 
HDP Next: Governance
HDP Next: GovernanceHDP Next: Governance
HDP Next: Governance
 
State of The Dolphin - May 2021
State of The Dolphin - May 2021State of The Dolphin - May 2021
State of The Dolphin - May 2021
 
Silicon Valley 2014 - API Antipatterns
Silicon Valley 2014 - API AntipatternsSilicon Valley 2014 - API Antipatterns
Silicon Valley 2014 - API Antipatterns
 
Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013
Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013
Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013
 
Solr JDBC: Presented by Kevin Risden, Avalon Consulting
Solr JDBC: Presented by Kevin Risden, Avalon ConsultingSolr JDBC: Presented by Kevin Risden, Avalon Consulting
Solr JDBC: Presented by Kevin Risden, Avalon Consulting
 
AZMS PRESENTATION.pptx
AZMS PRESENTATION.pptxAZMS PRESENTATION.pptx
AZMS PRESENTATION.pptx
 
Oracle database 12c_and_DevOps
Oracle database 12c_and_DevOpsOracle database 12c_and_DevOps
Oracle database 12c_and_DevOps
 
Extending drupal authentication
Extending drupal authenticationExtending drupal authentication
Extending drupal authentication
 
Which Freaking Database Should I Use?
Which Freaking Database Should I Use?Which Freaking Database Should I Use?
Which Freaking Database Should I Use?
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
Cassandra Workshop - Cassandra from scratch in one day
Cassandra Workshop - Cassandra from scratch in one dayCassandra Workshop - Cassandra from scratch in one day
Cassandra Workshop - Cassandra from scratch in one day
 
OUG Scotland 2014 - NoSQL and MySQL - The best of both worlds
OUG Scotland 2014 - NoSQL and MySQL - The best of both worldsOUG Scotland 2014 - NoSQL and MySQL - The best of both worlds
OUG Scotland 2014 - NoSQL and MySQL - The best of both worlds
 
RESTful Web APIs – Mike Amundsen, Principal API Architect, Layer 7
RESTful Web APIs – Mike Amundsen, Principal API Architect, Layer 7RESTful Web APIs – Mike Amundsen, Principal API Architect, Layer 7
RESTful Web APIs – Mike Amundsen, Principal API Architect, Layer 7
 
Leveraging a distributed architecture to your advantage
Leveraging a distributed architecture to your advantageLeveraging a distributed architecture to your advantage
Leveraging a distributed architecture to your advantage
 
MySQL Quick Dive
MySQL Quick DiveMySQL Quick Dive
MySQL Quick Dive
 

Plus de Alex Moundalexis

Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationAlex Moundalexis
 
Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxAlex Moundalexis
 
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the FieldSearch in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the FieldAlex Moundalexis
 

Plus de Alex Moundalexis (7)

Powered by the Sun
Powered by the SunPowered by the Sun
Powered by the Sun
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux Configuration
 
YARN
YARNYARN
YARN
 
Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via Linux
 
Hue Visual Tour
Hue Visual TourHue Visual Tour
Hue Visual Tour
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
 
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the FieldSearch in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
 

Dernier

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 

Dernier (20)

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 

Introduction to Cloudera Impala

  • 1. 1 Cloudera  Impala   Charm  City  Linux,  March  2014     Alex  Moundalexis       @technmsg  
  • 2. Thirty  Seconds  About  Alex   •  Solu@ons  Architect   •  aka  consultant   •  government   •  infrastructure   •  former  coder  of  Perl   •  former  administrator   •  likes  shiny  objects   2  
  • 3. What  Does  Cloudera  Do?   •  product   •  distribu@on  of  Hadoop  components,  Apache  licensed   •  enterprise  tooling   •  support   •  training   •  services  (aka  consul@ng)   •  community   3
  • 4. Disclaimer   •  Cloudera  builds  things  soMware   •  most  donated  to  Apache   •  some  closed-­‐source   •  Cloudera  “products”  I  reference  are  open  source   •  Apache  Licensed   •  source  code  is  on  GitHub   •  hSps://github.com/cloudera   4
  • 5. What  This  Talk  Isn’t  About   •  deploying   •  Puppet,  Chef,  Ansible,  homegrown  scripts,  intern  labor   •  sizing  &  tuning   •  depends  heavily  on  data  and  workload   •  coding   •  unless  you  count  XML  or  CSV  or  SQL   •  algorithms   5
  • 6. 6 Quick  and  dirty,  for  context.   The  Apache  Hadoop  Ecosystem  
  • 7. Why  “Ecosystem?”   •  In  the  beginning,  just  Hadoop   •  HDFS   •  MapReduce   •  Today,  dozens  of  interrelated  components   •  I/O   •  Processing   •  Specialty  Applica@ons   •  Configura@on   •  Workflow   7
  • 8. HDFS   •  Distributed,  highly  fault-­‐tolerant  filesystem   •  Op@mized  for  large  streaming  access  to  data   •  Based  on  Google  File  System   •  hSp://research.google.com/archive/gfs.html   8
  • 9. Lots  of  Commodity  Machines   9 Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
  • 10. MapReduce  (MR)   •  Programming  paradigm   •  Batch  oriented,  not  real@me   •  Works  well  with  distributed  compu@ng   •  Lots  of  Java,  but  other  languages  supported   •  Based  on  Google’s  paper   •  hSp://research.google.com/archive/mapreduce.html   10
  • 12. You specify map() and reduce() functions. The framework does the rest. 60
  • 13. Apache  Hive   •  Abstrac@on  of  Hadoop’s  Java  API   •  HiveQL  “compiles”  down  to  MR   •  a  “SQL-­‐like”  language   •  Eases  analysis  using  MapReduce   13
  • 14. Apache  Hive  Metastore   •  Maps  HDFS  files  to  DB-­‐like  resources   •  Databases   •  Tables   •  Column/field  names,  data  types   •  Roles/users   •  InputFormat/OutputFormat   14
  • 15. WHY  DO  WE  NEED  THIS?   But  wait…   15  
  • 16. 16  
  • 17. 17 I  am  not  a  SQL  wizard  by  any  means…   Super  Shady  SQL  Supplement  
  • 18. A  Simple  Rela@onal  Database   name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011   18  
  • 19. Interac@ng  with  Rela@onal  Data   name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011   19  SELECT  *  FROM  people;  
  • 20. Interac@ng  with  Rela@onal  Data   name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011   20  SELECT  *  FROM  people;  
  • 21. Reques@ng  Specific  Fields   name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011   21  SELECT  name,  state  FROM  people;  
  • 22. Reques@ng  Specific  Fields   name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011   22  SELECT  name,  state  FROM  people;  
  • 23. Reques@ng  Specific  Rows   name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011   23  SELECT  name,  state  FROM  people  WHERE  year    2012;  
  • 24. Reques@ng  Specific  Rows   name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011   24  SELECT  name,  state  FROM  people  WHERE  year    2012;  
  • 25. Two  Simple  Tables   owner   species   name   Alex   Cactus   Marvin   Joey   Cat   Brain   Sean   None   Paris   Unknown   25     name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011  
  • 26. Joining  Two  Tables   owner   species   name   Alex   Cactus   Marvin   Joey   Cat   Brain   Sean   None   Paris   Unknown   26    SELECT  people.name  AS  owner,  people.state  AS  state,  pets.name  AS  pet    FROM  people  LEFT  JOIN  pets  ON  people.name  =  pets.owner    name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011  
  • 27. Joining  Two  Tables   owner   species   name   Alex   Cactus   Marvin   Joey   Cat   Brain   Sean   None   Paris   Unknown   27    SELECT  people.name  AS  owner,  people.state  AS  state,  pets.name  AS  pet    FROM  people  LEFT  JOIN  pets  ON  people.name  =  pets.owner    name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011  
  • 28. Joining  Two  Tables   owner   species   name   Alex   Cactus   Marvin   Joey   Cat   Brain   Sean   None   Paris   Unknown   28    SELECT  people.name  AS  owner,  people.state  AS  state,  pets.name  AS  pet    FROM  people  LEFT  JOIN  pets  ON  people.name  =  pets.owner   name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011  
  • 29. Joining  Two  Tables   29  SELECT  people.name  AS  owner,  people.state  AS  state,  pets.name  AS  pet    FROM  people  LEFT  JOIN  pets  ON  people.name  =  pets.owner   owner   state   pet   Alex   Maryland   Marvin   Joey   Maryland   Brain   Sean   Texas   Paris   Maryland  
  • 30. Varying  Implementa@on  of  JOIN   30  SELECT  people.name  AS  owner,  people.state  AS  state,  pets.name  AS  pet    FROM  people  LEFT  JOIN  pets  ON  people.name  =  pets.owner   owner   state   pet   Alex   Maryland   Marvin   Joey   Maryland   Brain   Sean   Texas   ?   Paris   Maryland   ?  
  • 31. 31 Familiar  interface,  but  more  powerful.   Cloudera  Impala  
  • 32. Cloudera  Impala   •  Interac@ve  query  on  Hadoop   •  think  seconds,  not  minutes   •  Nearly  ANSI-­‐92  standard  SQL   •  compa@ble  with  HiveQL   •  Na@ve  MPP  query  engine   •  built  for  low-­‐latency  queries   32
  • 33. Cloudera  Impala  –  Design  Choices   •  Na@ve  daemons,  wriSen  in  C/C++   •  No  JVM,  no  MapReduce   •  Saturate  disks  on  reads   •  Uses  in-­‐memory  HDFS  caching   •  Re-­‐uses  Hive  metastore   •  Not  as  fault-­‐tolerant  as  MapReduce   33
  • 34. Cloudera  Impala  –  Architecture   •  Impala  Daemon   •  runs  on  every  node   •  handles  client  requests   •  handles  query  planning    execu@on   •  State  Store  Daemon   •  provides  name  service   •  metadata  distribu@on   •  used  for  finding  data   34
  • 35. Impala  Query  Execu@on   35 Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   SQL  App   ODBC   Hive   Metastore   HDFS  NN   Statestore   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   SQL  request   1)  Request  arrives  via  ODBC/JDBC/HUE/Shell  
  • 36. Impala  Query  Execu@on   36 Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   SQL  App   ODBC   Hive   Metastore   HDFS  NN   Statestore   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   2)  Planner  turns  request  into  collecRons  of  plan  fragments   3)  Coordinator  iniRates  execuRon  on  impalad(s)  local  to  data  
  • 37. Impala  Query  Execu@on   37 Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   SQL  App   ODBC   Hive   Metastore   HDFS  NN   Statestore   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   4)  Intermediate  results  are  streamed  between  impalad(s)   5)  Query  results  are  streamed  back  to  client   Query  results  
  • 38. Cloudera  Impala  –  Results   •  Allows  for  fast  itera@on/discovery   •  How  much  faster?   •  3-­‐4x  faster  on  I/O  bound  workloads   •  up  to  45x  faster  on  mul@-­‐MR  queries   •  up  to  90x  faster  on  in-­‐memory  cache   38
  • 39. 39 Hold  onto  something,  folks.   Demo  
  • 40. What’s  Next?   •  Download  Hadoop!   •  CDH  available  at  www.cloudera.com   •  Already  done  that?  Contribute…   •  Cloudera  provides  pre-­‐loaded  VMs   •  hSp://@ny.cloudera.com/quickstartvm   •  Clone  our  repos!   •  hSps://github.com/cloudera   40
  • 42. 42 Preferably  related  to  the  talk…  or  not.   Ques@ons?  
  • 43. 43 Thank  You!   Alex  Moundalexis       @technmsg     We’re  hiring,  kids!  Well,  not  kids.