SlideShare une entreprise Scribd logo
1  sur  46
Télécharger pour lire hors ligne
1
Cloudera	
  Impala	
  
Portland	
  Big	
  Data	
  User	
  Group,	
  July	
  2014	
  
	
  
Alex	
  Moundalexis	
  
@technmsg	
  
Thirty	
  Seconds	
  About	
  Alex	
  
•  SoluGons	
  Architect	
  
•  aka	
  consultant	
  
•  government	
  
•  infrastructure	
  
•  former	
  coder	
  of	
  Perl	
  
•  former	
  administrator	
  
•  fan	
  of	
  Portland	
  	
  
2	
  
What	
  Does	
  Cloudera	
  Do?	
  
•  product	
  
•  distribuGon	
  of	
  Hadoop	
  components,	
  Apache	
  licensed	
  
•  enterprise	
  tooling	
  
•  support	
  
•  training	
  
•  services	
  (aka	
  consulGng)	
  
•  community	
  
3
Disclaimer	
  
•  Cloudera	
  builds	
  things	
  soPware	
  
•  most	
  donated	
  to	
  Apache	
  
•  some	
  closed-­‐source	
  
•  Cloudera	
  “products”	
  I	
  reference	
  are	
  open	
  source	
  
•  Apache	
  Licensed	
  
•  source	
  code	
  is	
  on	
  GitHub	
  
•  hVps://github.com/cloudera	
  
4
What	
  This	
  Talk	
  Isn’t	
  About	
  
•  deploying	
  
•  Puppet,	
  Chef,	
  Ansible,	
  homegrown	
  scripts,	
  intern	
  labor	
  
•  sizing	
  &	
  tuning	
  
•  depends	
  heavily	
  on	
  data	
  and	
  workload	
  
•  coding	
  
•  unless	
  you	
  count	
  XML	
  or	
  CSV	
  or	
  SQL	
  
•  algorithms	
  
5
Public	
  Domain	
  IFCAR	
  
CC	
  BY-­‐SA	
  Lilian	
  De	
  Cassai	
  
cloud·∙e·∙ra	
  im·∙pal·∙a	
  
8
/kloudˈi(ə)rə	
  imˈpalə/	
  
	
  
noun	
  
	
  
a	
  modern,	
  open	
  source,	
  MPP	
  SQL	
  query	
  engine	
  
for	
  Apache	
  Hadoop.	
  
	
  
“Cloudera	
  Impala	
  provides	
  fast,	
  ad	
  hoc	
  SQL	
  query	
  
capability	
  for	
  Apache	
  Hadoop,	
  complemenGng	
  
tradiGonal	
  MapReduce	
  batch	
  processing.”	
  
9
Quick	
  and	
  dirty,	
  for	
  context.	
  
The	
  Apache	
  Hadoop	
  Ecosystem	
  
Why	
  “Ecosystem?”	
  
•  In	
  the	
  beginning,	
  just	
  Hadoop	
  
•  HDFS	
  
•  MapReduce	
  
•  Today,	
  dozens	
  of	
  interrelated	
  components	
  
•  I/O	
  
•  Processing	
  
•  Specialty	
  ApplicaGons	
  
•  ConfiguraGon	
  
•  Workflow	
  
10
HDFS	
  
•  Distributed,	
  highly	
  fault-­‐tolerant	
  filesystem	
  
•  OpGmized	
  for	
  large	
  streaming	
  access	
  to	
  data	
  
•  Based	
  on	
  Google	
  File	
  System	
  
•  hVp://research.google.com/archive/gfs.html	
  
11
Lots	
  of	
  Commodity	
  Machines	
  
12
Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
MapReduce	
  (MR)	
  
•  Programming	
  paradigm	
  
•  Batch	
  oriented,	
  not	
  realGme	
  
•  Works	
  well	
  with	
  distributed	
  compuGng	
  
•  Lots	
  of	
  Java,	
  but	
  other	
  languages	
  supported	
  
•  Based	
  on	
  Google’s	
  paper	
  
•  hVp://research.google.com/archive/mapreduce.html	
  
13
Under	
  the	
  Covers	
  
14
You specify map() and
reduce() functions.

The framework does the
rest.	

60
Apache	
  Hive	
  
•  AbstracGon	
  of	
  Hadoop’s	
  Java	
  API	
  
•  HiveQL	
  “compiles”	
  down	
  to	
  MR	
  
•  a	
  “SQL-­‐like”	
  language	
  
•  Eases	
  analysis	
  using	
  MapReduce	
  
16
Apache	
  Hive	
  Metastore	
  
•  Maps	
  HDFS	
  files	
  to	
  DB-­‐like	
  resources	
  
•  Databases	
  
•  Tables	
  
•  Column/field	
  names,	
  data	
  types	
  
•  Roles/users	
  
•  InputFormat/OutputFormat	
  
17
WHY	
  DO	
  WE	
  NEED	
  THIS?	
  
But	
  wait…	
  
18	
  
19	
  
20
I	
  am	
  not	
  a	
  SQL	
  wizard	
  by	
  any	
  means…	
  
Super	
  Shady	
  SQL	
  Supplement	
  
A	
  Simple	
  RelaGonal	
  Database	
  
name	
   state	
   employer	
   year	
  
Alex	
   Maryland	
   Cloudera	
   2013	
  
Joey	
   Maryland	
   Cloudera	
   2011	
  
Sean	
   Texas	
   Cloudera	
   2013	
  
Paris	
   Maryland	
   AOL	
   2011	
  
21
	
  
InteracGng	
  with	
  RelaGonal	
  Data	
  
name	
   state	
   employer	
   year	
  
Alex	
   Maryland	
   Cloudera	
   2013	
  
Joey	
   Maryland	
   Cloudera	
   2011	
  
Sean	
   Texas	
   Cloudera	
   2013	
  
Paris	
   Maryland	
   AOL	
   2011	
  
22
	
  SELECT	
  *	
  FROM	
  people;	
  
InteracGng	
  with	
  RelaGonal	
  Data	
  
name	
   state	
   employer	
   year	
  
Alex	
   Maryland	
   Cloudera	
   2013	
  
Joey	
   Maryland	
   Cloudera	
   2011	
  
Sean	
   Texas	
   Cloudera	
   2013	
  
Paris	
   Maryland	
   AOL	
   2011	
  
23
	
  SELECT	
  *	
  FROM	
  people;	
  
RequesGng	
  Specific	
  Fields	
  
name	
   state	
   employer	
   year	
  
Alex	
   Maryland	
   Cloudera	
   2013	
  
Joey	
   Maryland	
   Cloudera	
   2011	
  
Sean	
   Texas	
   Cloudera	
   2013	
  
Paris	
   Maryland	
   AOL	
   2011	
  
24
	
  SELECT	
  name,	
  state	
  FROM	
  people;	
  
RequesGng	
  Specific	
  Fields	
  
name	
   state	
   employer	
   year	
  
Alex	
   Maryland	
   Cloudera	
   2013	
  
Joey	
   Maryland	
   Cloudera	
   2011	
  
Sean	
   Texas	
   Cloudera	
   2013	
  
Paris	
   Maryland	
   AOL	
   2011	
  
25
	
  SELECT	
  name,	
  state	
  FROM	
  people;	
  
RequesGng	
  Specific	
  Rows	
  
name	
   state	
   employer	
   year	
  
Alex	
   Maryland	
   Cloudera	
   2013	
  
Joey	
   Maryland	
   Cloudera	
   2011	
  
Sean	
   Texas	
   Cloudera	
   2013	
  
Paris	
   Maryland	
   AOL	
   2011	
  
26
	
  SELECT	
  name,	
  state	
  FROM	
  people	
  WHERE	
  year	
  	
  2012;	
  
RequesGng	
  Specific	
  Rows	
  
name	
   state	
   employer	
   year	
  
Alex	
   Maryland	
   Cloudera	
   2013	
  
Joey	
   Maryland	
   Cloudera	
   2011	
  
Sean	
   Texas	
   Cloudera	
   2013	
  
Paris	
   Maryland	
   AOL	
   2011	
  
27
	
  SELECT	
  name,	
  state	
  FROM	
  people	
  WHERE	
  year	
  	
  2012;	
  
Two	
  Simple	
  Tables	
  
owner	
   species	
   name	
  
Alex	
   Cactus	
   Marvin	
  
Joey	
   Cat	
   Brain	
  
Sean	
   None	
  
Paris	
   Unknown	
  
28	
  
	
  
name	
   state	
   employer	
   year	
  
Alex	
   Maryland	
   Cloudera	
   2013	
  
Joey	
   Maryland	
   Cloudera	
   2011	
  
Sean	
   Texas	
   Cloudera	
   2013	
  
Paris	
   Maryland	
   AOL	
   2011	
  
Joining	
  Two	
  Tables	
  
owner	
   species	
   name	
  
Alex	
   Cactus	
   Marvin	
  
Joey	
   Cat	
   Brain	
  
Sean	
   None	
  
Paris	
   Unknown	
  
29	
  
	
  SELECT	
  people.name	
  AS	
  owner,	
  people.state	
  AS	
  state,	
  pets.name	
  AS	
  pet	
  
	
  FROM	
  people	
  LEFT	
  JOIN	
  pets	
  ON	
  people.name	
  =	
  pets.owner	
  
	
  name	
   state	
   employer	
   year	
  
Alex	
   Maryland	
   Cloudera	
   2013	
  
Joey	
   Maryland	
   Cloudera	
   2011	
  
Sean	
   Texas	
   Cloudera	
   2013	
  
Paris	
   Maryland	
   AOL	
   2011	
  
Joining	
  Two	
  Tables	
  
owner	
   species	
   name	
  
Alex	
   Cactus	
   Marvin	
  
Joey	
   Cat	
   Brain	
  
Sean	
   None	
  
Paris	
   Unknown	
  
30	
  
	
  SELECT	
  people.name	
  AS	
  owner,	
  people.state	
  AS	
  state,	
  pets.name	
  AS	
  pet	
  
	
  FROM	
  people	
  LEFT	
  JOIN	
  pets	
  ON	
  people.name	
  =	
  pets.owner	
  
	
  name	
   state	
   employer	
   year	
  
Alex	
   Maryland	
   Cloudera	
   2013	
  
Joey	
   Maryland	
   Cloudera	
   2011	
  
Sean	
   Texas	
   Cloudera	
   2013	
  
Paris	
   Maryland	
   AOL	
   2011	
  
Joining	
  Two	
  Tables	
  
owner	
   species	
   name	
  
Alex	
   Cactus	
   Marvin	
  
Joey	
   Cat	
   Brain	
  
Sean	
   None	
  
Paris	
   Unknown	
  
31	
  
	
  SELECT	
  people.name	
  AS	
  owner,	
  people.state	
  AS	
  state,	
  pets.name	
  AS	
  pet	
  
	
  FROM	
  people	
  LEFT	
  JOIN	
  pets	
  ON	
  people.name	
  =	
  pets.owner	
  
name	
   state	
   employer	
   year	
  
Alex	
   Maryland	
   Cloudera	
   2013	
  
Joey	
   Maryland	
   Cloudera	
   2011	
  
Sean	
   Texas	
   Cloudera	
   2013	
  
Paris	
   Maryland	
   AOL	
   2011	
  
Joining	
  Two	
  Tables	
  
32
	
  SELECT	
  people.name	
  AS	
  owner,	
  people.state	
  AS	
  state,	
  pets.name	
  AS	
  pet	
  
	
  FROM	
  people	
  LEFT	
  JOIN	
  pets	
  ON	
  people.name	
  =	
  pets.owner	
  
owner	
   state	
   pet	
  
Alex	
   Maryland	
   Marvin	
  
Joey	
   Maryland	
   Brain	
  
Sean	
   Texas	
  
Paris	
   Maryland	
  
Varying	
  ImplementaGon	
  of	
  JOIN	
  
33
	
  SELECT	
  people.name	
  AS	
  owner,	
  people.state	
  AS	
  state,	
  pets.name	
  AS	
  pet	
  
	
  FROM	
  people	
  LEFT	
  JOIN	
  pets	
  ON	
  people.name	
  =	
  pets.owner	
  
owner	
   state	
   pet	
  
Alex	
   Maryland	
   Marvin	
  
Joey	
   Maryland	
   Brain	
  
Sean	
   Texas	
   ?	
  
Paris	
   Maryland	
   ?	
  
34
Familiar	
  interface,	
  but	
  more	
  powerful.	
  
Cloudera	
  Impala	
  
Cloudera	
  Impala	
  
•  InteracGve	
  query	
  on	
  Hadoop	
  
•  think	
  seconds,	
  not	
  minutes	
  
•  Nearly	
  ANSI-­‐92	
  standard	
  SQL	
  
•  compaGble	
  with	
  HiveQL	
  
•  NaGve	
  MPP	
  query	
  engine	
  
•  built	
  for	
  low-­‐latency	
  queries	
  
35
Cloudera	
  Impala	
  –	
  Design	
  Choices	
  
•  NaGve	
  daemons,	
  wriVen	
  in	
  C/C++	
  
•  No	
  JVM,	
  no	
  MapReduce	
  
•  Saturate	
  disks	
  on	
  reads	
  
•  Uses	
  in-­‐memory	
  HDFS	
  caching	
  
•  Re-­‐uses	
  Hive	
  metastore	
  
•  Not	
  as	
  fault-­‐tolerant	
  as	
  MapReduce	
  
36
Cloudera	
  Impala	
  –	
  Architecture	
  
•  Impala	
  Daemon	
  
•  runs	
  on	
  every	
  node	
  
•  handles	
  client	
  requests	
  
•  handles	
  query	
  planning	
  	
  execuGon	
  
•  State	
  Store	
  Daemon	
  
•  provides	
  name	
  service	
  
•  metadata	
  distribuGon	
  
•  used	
  for	
  finding	
  data	
  
37
Impala	
  Query	
  ExecuGon	
  
38
Query	
  Planner	
  
Query	
  Coordinator	
  
Query	
  Executor	
  
HDFS	
  DN	
   HBase	
  
SQL	
  App	
  
ODBC	
  
Hive	
  
Metastore	
  
HDFS	
  NN	
   Statestore	
  
Query	
  Planner	
  
Query	
  Coordinator	
  
Query	
  Executor	
  
HDFS	
  DN	
   HBase	
  
Query	
  Planner	
  
Query	
  Coordinator	
  
Query	
  Executor	
  
HDFS	
  DN	
   HBase	
  
SQL	
  request	
  
1)	
  Request	
  arrives	
  via	
  ODBC/JDBC/HUE/Shell	
  
Impala	
  Query	
  ExecuGon	
  
39
Query	
  Planner	
  
Query	
  Coordinator	
  
Query	
  Executor	
  
HDFS	
  DN	
   HBase	
  
SQL	
  App	
  
ODBC	
  
Hive	
  
Metastore	
  
HDFS	
  NN	
   Statestore	
  
Query	
  Planner	
  
Query	
  Coordinator	
  
Query	
  Executor	
  
HDFS	
  DN	
   HBase	
  
Query	
  Planner	
  
Query	
  Coordinator	
  
Query	
  Executor	
  
HDFS	
  DN	
   HBase	
  
2)	
  Planner	
  turns	
  request	
  into	
  collecPons	
  of	
  plan	
  fragments	
  
3)	
  Coordinator	
  iniPates	
  execuPon	
  on	
  impalad(s)	
  local	
  to	
  data	
  
Impala	
  Query	
  ExecuGon	
  
40
Query	
  Planner	
  
Query	
  Coordinator	
  
Query	
  Executor	
  
HDFS	
  DN	
   HBase	
  
SQL	
  App	
  
ODBC	
  
Hive	
  
Metastore	
  
HDFS	
  NN	
   Statestore	
  
Query	
  Planner	
  
Query	
  Coordinator	
  
Query	
  Executor	
  
HDFS	
  DN	
   HBase	
  
Query	
  Planner	
  
Query	
  Coordinator	
  
Query	
  Executor	
  
HDFS	
  DN	
   HBase	
  
4)	
  Intermediate	
  results	
  are	
  streamed	
  between	
  impalad(s)	
  
5)	
  Query	
  results	
  are	
  streamed	
  back	
  to	
  client	
  
Query	
  results	
  
Cloudera	
  Impala	
  –	
  Results	
  
•  Allows	
  for	
  fast	
  iteraGon/discovery	
  
•  How	
  much	
  faster?	
  
•  3-­‐4x	
  faster	
  on	
  I/O	
  bound	
  workloads	
  
•  up	
  to	
  45x	
  faster	
  on	
  mulG-­‐MR	
  queries	
  
•  up	
  to	
  90x	
  faster	
  on	
  in-­‐memory	
  cache	
  
41
42
Hold	
  onto	
  something,	
  folks.	
  
Demo	
  
What’s	
  Next?	
  
•  Download	
  Hadoop!	
  
•  CDH	
  available	
  at	
  www.cloudera.com	
  
•  Already	
  done	
  that?	
  Contribute…	
  
•  Cloudera	
  provides	
  pre-­‐loaded	
  VMs	
  
•  hVp://Gny.cloudera.com/quickstartvm	
  
•  Clone	
  our	
  repos!	
  
•  hVps://github.com/cloudera	
  
43
PORTLAND	
  
Special	
  thanks:	
  
44	
  
45
Preferably	
  related	
  to	
  the	
  talk…	
  or	
  not.	
  
QuesGons?	
  
46
Thank	
  You!	
  
Alex	
  Moundalexis	
  
@technmsg	
  
	
  
We’re	
  hiring,	
  kids!	
  Well,	
  not	
  kids.	
  

Contenu connexe

Tendances

New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2
DataWorks Summit
 
Use case for using the ElastiCache for Redis in production
Use case for using the ElastiCache for Redis in productionUse case for using the ElastiCache for Redis in production
Use case for using the ElastiCache for Redis in production
知教 本間
 
Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox Gateway
DataWorks Summit
 

Tendances (20)

Spark in yarn managed multi-tenant clusters
Spark in yarn managed multi-tenant clustersSpark in yarn managed multi-tenant clusters
Spark in yarn managed multi-tenant clusters
 
Apache Ambari: Simplified Hadoop Cluster Operation & Troubleshooting
Apache Ambari: Simplified Hadoop Cluster Operation & TroubleshootingApache Ambari: Simplified Hadoop Cluster Operation & Troubleshooting
Apache Ambari: Simplified Hadoop Cluster Operation & Troubleshooting
 
New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
 
Solr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudSolr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloud
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorial
 
Running a container cloud on YARN
Running a container cloud on YARNRunning a container cloud on YARN
Running a container cloud on YARN
 
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)
 
Use case for using the ElastiCache for Redis in production
Use case for using the ElastiCache for Redis in productionUse case for using the ElastiCache for Redis in production
Use case for using the ElastiCache for Redis in production
 
Nl HUG 2016 Feb Hadoop security from the trenches
Nl HUG 2016 Feb Hadoop security from the trenchesNl HUG 2016 Feb Hadoop security from the trenches
Nl HUG 2016 Feb Hadoop security from the trenches
 
One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)
 
Inside hadoop-dev
Inside hadoop-devInside hadoop-dev
Inside hadoop-dev
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox Gateway
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariStreamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
 
Hadoop Summit 2012 | A New Generation of Data Transfer Tools for Hadoop: Sqoop 2
Hadoop Summit 2012 | A New Generation of Data Transfer Tools for Hadoop: Sqoop 2Hadoop Summit 2012 | A New Generation of Data Transfer Tools for Hadoop: Sqoop 2
Hadoop Summit 2012 | A New Generation of Data Transfer Tools for Hadoop: Sqoop 2
 
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 editionHadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
 
SQL AlwaysON for SharePoint HA/DR on Azure Global Azure Bootcamp 2017 Eisenac...
SQL AlwaysON for SharePoint HA/DR on Azure Global Azure Bootcamp 2017 Eisenac...SQL AlwaysON for SharePoint HA/DR on Azure Global Azure Bootcamp 2017 Eisenac...
SQL AlwaysON for SharePoint HA/DR on Azure Global Azure Bootcamp 2017 Eisenac...
 
Managing Enterprise Hadoop Clusters with Apache Ambari
Managing Enterprise Hadoop Clusters with Apache AmbariManaging Enterprise Hadoop Clusters with Apache Ambari
Managing Enterprise Hadoop Clusters with Apache Ambari
 

Similaire à Cloudera Impala

Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013
Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013
Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013
Publicis Sapient Engineering
 
Analyzing twitter data with hadoop
Analyzing twitter data with hadoopAnalyzing twitter data with hadoop
Analyzing twitter data with hadoop
Joey Echeverria
 

Similaire à Cloudera Impala (20)

Introduction to Cloudera Impala
Introduction to Cloudera ImpalaIntroduction to Cloudera Impala
Introduction to Cloudera Impala
 
dplyr Interfaces to Large-Scale Data
dplyr Interfaces to Large-Scale Datadplyr Interfaces to Large-Scale Data
dplyr Interfaces to Large-Scale Data
 
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
 Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac... Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
 
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
 
Connecting Hadoop and Oracle
Connecting Hadoop and OracleConnecting Hadoop and Oracle
Connecting Hadoop and Oracle
 
Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013
Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013
Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013
 
Analyzing twitter data with hadoop
Analyzing twitter data with hadoopAnalyzing twitter data with hadoop
Analyzing twitter data with hadoop
 
Oracle database 12c_and_DevOps
Oracle database 12c_and_DevOpsOracle database 12c_and_DevOps
Oracle database 12c_and_DevOps
 
Extending drupal authentication
Extending drupal authenticationExtending drupal authentication
Extending drupal authentication
 
Cassandra Workshop - Cassandra from scratch in one day
Cassandra Workshop - Cassandra from scratch in one dayCassandra Workshop - Cassandra from scratch in one day
Cassandra Workshop - Cassandra from scratch in one day
 
Data Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the EnterpriseData Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the Enterprise
 
Database
DatabaseDatabase
Database
 
[db tech showcase Tokyo 2018] #dbts2018 #B31 『1,2,3 and Done! 3 easy ways to ...
[db tech showcase Tokyo 2018] #dbts2018 #B31 『1,2,3 and Done! 3 easy ways to ...[db tech showcase Tokyo 2018] #dbts2018 #B31 『1,2,3 and Done! 3 easy ways to ...
[db tech showcase Tokyo 2018] #dbts2018 #B31 『1,2,3 and Done! 3 easy ways to ...
 
Leveraging a distributed architecture to your advantage
Leveraging a distributed architecture to your advantageLeveraging a distributed architecture to your advantage
Leveraging a distributed architecture to your advantage
 
State of The Dolphin - May 2021
State of The Dolphin - May 2021State of The Dolphin - May 2021
State of The Dolphin - May 2021
 
State of the Dolphin - May 2022
State of the Dolphin - May 2022State of the Dolphin - May 2022
State of the Dolphin - May 2022
 
Mini-Training: Redis
Mini-Training: RedisMini-Training: Redis
Mini-Training: Redis
 
Les nouveautés de MySQL 8.0
Les nouveautés de MySQL 8.0Les nouveautés de MySQL 8.0
Les nouveautés de MySQL 8.0
 

Plus de Alex Moundalexis

Plus de Alex Moundalexis (6)

Powered by the Sun
Powered by the SunPowered by the Sun
Powered by the Sun
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux Configuration
 
Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via Linux
 
Many Hats at Cloudera
Many Hats at ClouderaMany Hats at Cloudera
Many Hats at Cloudera
 
Hue Visual Tour
Hue Visual TourHue Visual Tour
Hue Visual Tour
 
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the FieldSearch in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
 

Cloudera Impala

  • 1. 1 Cloudera  Impala   Portland  Big  Data  User  Group,  July  2014     Alex  Moundalexis   @technmsg  
  • 2. Thirty  Seconds  About  Alex   •  SoluGons  Architect   •  aka  consultant   •  government   •  infrastructure   •  former  coder  of  Perl   •  former  administrator   •  fan  of  Portland     2  
  • 3. What  Does  Cloudera  Do?   •  product   •  distribuGon  of  Hadoop  components,  Apache  licensed   •  enterprise  tooling   •  support   •  training   •  services  (aka  consulGng)   •  community   3
  • 4. Disclaimer   •  Cloudera  builds  things  soPware   •  most  donated  to  Apache   •  some  closed-­‐source   •  Cloudera  “products”  I  reference  are  open  source   •  Apache  Licensed   •  source  code  is  on  GitHub   •  hVps://github.com/cloudera   4
  • 5. What  This  Talk  Isn’t  About   •  deploying   •  Puppet,  Chef,  Ansible,  homegrown  scripts,  intern  labor   •  sizing  &  tuning   •  depends  heavily  on  data  and  workload   •  coding   •  unless  you  count  XML  or  CSV  or  SQL   •  algorithms   5
  • 7. CC  BY-­‐SA  Lilian  De  Cassai  
  • 8. cloud·∙e·∙ra  im·∙pal·∙a   8 /kloudˈi(ə)rə  imˈpalə/     noun     a  modern,  open  source,  MPP  SQL  query  engine   for  Apache  Hadoop.     “Cloudera  Impala  provides  fast,  ad  hoc  SQL  query   capability  for  Apache  Hadoop,  complemenGng   tradiGonal  MapReduce  batch  processing.”  
  • 9. 9 Quick  and  dirty,  for  context.   The  Apache  Hadoop  Ecosystem  
  • 10. Why  “Ecosystem?”   •  In  the  beginning,  just  Hadoop   •  HDFS   •  MapReduce   •  Today,  dozens  of  interrelated  components   •  I/O   •  Processing   •  Specialty  ApplicaGons   •  ConfiguraGon   •  Workflow   10
  • 11. HDFS   •  Distributed,  highly  fault-­‐tolerant  filesystem   •  OpGmized  for  large  streaming  access  to  data   •  Based  on  Google  File  System   •  hVp://research.google.com/archive/gfs.html   11
  • 12. Lots  of  Commodity  Machines   12 Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
  • 13. MapReduce  (MR)   •  Programming  paradigm   •  Batch  oriented,  not  realGme   •  Works  well  with  distributed  compuGng   •  Lots  of  Java,  but  other  languages  supported   •  Based  on  Google’s  paper   •  hVp://research.google.com/archive/mapreduce.html   13
  • 15. You specify map() and reduce() functions. The framework does the rest. 60
  • 16. Apache  Hive   •  AbstracGon  of  Hadoop’s  Java  API   •  HiveQL  “compiles”  down  to  MR   •  a  “SQL-­‐like”  language   •  Eases  analysis  using  MapReduce   16
  • 17. Apache  Hive  Metastore   •  Maps  HDFS  files  to  DB-­‐like  resources   •  Databases   •  Tables   •  Column/field  names,  data  types   •  Roles/users   •  InputFormat/OutputFormat   17
  • 18. WHY  DO  WE  NEED  THIS?   But  wait…   18  
  • 19. 19  
  • 20. 20 I  am  not  a  SQL  wizard  by  any  means…   Super  Shady  SQL  Supplement  
  • 21. A  Simple  RelaGonal  Database   name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011   21  
  • 22. InteracGng  with  RelaGonal  Data   name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011   22  SELECT  *  FROM  people;  
  • 23. InteracGng  with  RelaGonal  Data   name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011   23  SELECT  *  FROM  people;  
  • 24. RequesGng  Specific  Fields   name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011   24  SELECT  name,  state  FROM  people;  
  • 25. RequesGng  Specific  Fields   name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011   25  SELECT  name,  state  FROM  people;  
  • 26. RequesGng  Specific  Rows   name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011   26  SELECT  name,  state  FROM  people  WHERE  year    2012;  
  • 27. RequesGng  Specific  Rows   name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011   27  SELECT  name,  state  FROM  people  WHERE  year    2012;  
  • 28. Two  Simple  Tables   owner   species   name   Alex   Cactus   Marvin   Joey   Cat   Brain   Sean   None   Paris   Unknown   28     name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011  
  • 29. Joining  Two  Tables   owner   species   name   Alex   Cactus   Marvin   Joey   Cat   Brain   Sean   None   Paris   Unknown   29    SELECT  people.name  AS  owner,  people.state  AS  state,  pets.name  AS  pet    FROM  people  LEFT  JOIN  pets  ON  people.name  =  pets.owner    name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011  
  • 30. Joining  Two  Tables   owner   species   name   Alex   Cactus   Marvin   Joey   Cat   Brain   Sean   None   Paris   Unknown   30    SELECT  people.name  AS  owner,  people.state  AS  state,  pets.name  AS  pet    FROM  people  LEFT  JOIN  pets  ON  people.name  =  pets.owner    name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011  
  • 31. Joining  Two  Tables   owner   species   name   Alex   Cactus   Marvin   Joey   Cat   Brain   Sean   None   Paris   Unknown   31    SELECT  people.name  AS  owner,  people.state  AS  state,  pets.name  AS  pet    FROM  people  LEFT  JOIN  pets  ON  people.name  =  pets.owner   name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011  
  • 32. Joining  Two  Tables   32  SELECT  people.name  AS  owner,  people.state  AS  state,  pets.name  AS  pet    FROM  people  LEFT  JOIN  pets  ON  people.name  =  pets.owner   owner   state   pet   Alex   Maryland   Marvin   Joey   Maryland   Brain   Sean   Texas   Paris   Maryland  
  • 33. Varying  ImplementaGon  of  JOIN   33  SELECT  people.name  AS  owner,  people.state  AS  state,  pets.name  AS  pet    FROM  people  LEFT  JOIN  pets  ON  people.name  =  pets.owner   owner   state   pet   Alex   Maryland   Marvin   Joey   Maryland   Brain   Sean   Texas   ?   Paris   Maryland   ?  
  • 34. 34 Familiar  interface,  but  more  powerful.   Cloudera  Impala  
  • 35. Cloudera  Impala   •  InteracGve  query  on  Hadoop   •  think  seconds,  not  minutes   •  Nearly  ANSI-­‐92  standard  SQL   •  compaGble  with  HiveQL   •  NaGve  MPP  query  engine   •  built  for  low-­‐latency  queries   35
  • 36. Cloudera  Impala  –  Design  Choices   •  NaGve  daemons,  wriVen  in  C/C++   •  No  JVM,  no  MapReduce   •  Saturate  disks  on  reads   •  Uses  in-­‐memory  HDFS  caching   •  Re-­‐uses  Hive  metastore   •  Not  as  fault-­‐tolerant  as  MapReduce   36
  • 37. Cloudera  Impala  –  Architecture   •  Impala  Daemon   •  runs  on  every  node   •  handles  client  requests   •  handles  query  planning    execuGon   •  State  Store  Daemon   •  provides  name  service   •  metadata  distribuGon   •  used  for  finding  data   37
  • 38. Impala  Query  ExecuGon   38 Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   SQL  App   ODBC   Hive   Metastore   HDFS  NN   Statestore   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   SQL  request   1)  Request  arrives  via  ODBC/JDBC/HUE/Shell  
  • 39. Impala  Query  ExecuGon   39 Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   SQL  App   ODBC   Hive   Metastore   HDFS  NN   Statestore   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   2)  Planner  turns  request  into  collecPons  of  plan  fragments   3)  Coordinator  iniPates  execuPon  on  impalad(s)  local  to  data  
  • 40. Impala  Query  ExecuGon   40 Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   SQL  App   ODBC   Hive   Metastore   HDFS  NN   Statestore   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   4)  Intermediate  results  are  streamed  between  impalad(s)   5)  Query  results  are  streamed  back  to  client   Query  results  
  • 41. Cloudera  Impala  –  Results   •  Allows  for  fast  iteraGon/discovery   •  How  much  faster?   •  3-­‐4x  faster  on  I/O  bound  workloads   •  up  to  45x  faster  on  mulG-­‐MR  queries   •  up  to  90x  faster  on  in-­‐memory  cache   41
  • 42. 42 Hold  onto  something,  folks.   Demo  
  • 43. What’s  Next?   •  Download  Hadoop!   •  CDH  available  at  www.cloudera.com   •  Already  done  that?  Contribute…   •  Cloudera  provides  pre-­‐loaded  VMs   •  hVp://Gny.cloudera.com/quickstartvm   •  Clone  our  repos!   •  hVps://github.com/cloudera   43
  • 45. 45 Preferably  related  to  the  talk…  or  not.   QuesGons?  
  • 46. 46 Thank  You!   Alex  Moundalexis   @technmsg     We’re  hiring,  kids!  Well,  not  kids.