SlideShare une entreprise Scribd logo
1  sur  22
Télécharger pour lire hors ligne
Hive Anatomy



Data Infrastructure Team, Facebook
Part of Apache Hadoop Hive Project
Overview	
  
•  Conceptual	
  level	
  architecture	
  
•  (Pseudo-­‐)code	
  level	
  architecture	
  
       •  Parser	
  
       •  Seman:c	
  analyzer	
  
       •  Execu:on	
  
•  Example:	
  adding	
  a	
  new	
  Semijoin	
  Operator	
  
Conceptual	
  Level	
  Architecture 	
  	
  
•  Hive	
  Components:	
  
   –  Parser	
  (antlr):	
  HiveQL	
  	
  Abstract	
  Syntax	
  Tree	
  (AST)	
  
   –  Seman:c	
  Analyzer:	
  AST	
  	
  DAG	
  of	
  MapReduce	
  Tasks	
  
       •  Logical	
  Plan	
  Generator:	
  AST	
  	
  operator	
  trees	
  
       •  Op:mizer	
  (logical	
  rewrite):	
  operator	
  trees	
  	
  operator	
  trees	
  	
  
       •  Physical	
  Plan	
  Generator:	
  operator	
  trees	
  -­‐>	
  MapReduce	
  Tasks	
  
   –  Execu:on	
  Libraries:	
  	
  
       •  Operator	
  implementa:ons,	
  UDF/UDAF/UDTF	
  
       •  SerDe	
  &	
  ObjectInspector,	
  Metastore	
  
       •  FileFormat	
  &	
  RecordReader	
  
Hive	
  User/Applica:on	
  Interfaces	
  

                              CliDriver	
               Hive	
  shell	
  script	
     HiPal*	
  



Driver	
                    HiveServer	
                        ODBC	
  


                            JDBCDriver	
  
                           (inheren:ng	
  
                              Driver)	
  



       *HiPal is a Web-basedHive client developed internally at Facebook.
Parser	
  
                               •  ANTLR	
  is	
  a	
  parser	
  generator.	
  	
  
                               •  $HIVE_SRC/ql/src/java/org/
                                  apache/hadoop/hive/ql/parse/
                                  Hive.g	
  
                               •  Hive.g	
  defines	
  keywords,	
  
ParseDriver	
     Driver	
        tokens,	
  transla:ons	
  from	
  
                                  HiveQL	
  to	
  AST	
  (ASTNode.java)	
  
                               •  Every	
  :me	
  Hive.g	
  is	
  changed,	
  
                                  you	
  need	
  to	
  ‘ant	
  clean’	
  first	
  and	
  
                                  rebuild	
  using	
  ‘ant	
  package’	
  
Seman:c	
  Analyzer	
  
                                          •  BaseSeman:cAnalyzer	
  is	
  the	
  
                                             base	
  class	
  for	
  
                                             DDLSeman:cAnalyzer	
  and	
  
                                             Seman:cAnalyzer	
  
BaseSeman:cAnalyzer	
        Driver	
  
                                             –  Seman:cAnalyzer	
  handles	
  
                                                queries,	
  DML,	
  and	
  some	
  DDL	
  
                                                (create-­‐table)	
  
                                             –  DDLSeman:cAnalyzer	
  
                                                handles	
  alter	
  table	
  etc.	
  
Logical	
  Plan	
  Genera:on	
  
•  Seman:cAnalyzer.analyzeInternal()	
  is	
  the	
  main	
  
   funciton	
  
   –  doPhase1():	
  recursively	
  traverse	
  AST	
  tree	
  and	
  check	
  for	
  
      seman:c	
  errors	
  and	
  gather	
  metadata	
  which	
  is	
  put	
  in	
  QB	
  
      and	
  	
  QBParseInfo.	
  
   –  getMetaData():	
  query	
  metastore	
  and	
  get	
  metadata	
  for	
  the	
  
      data	
  sources	
  and	
  put	
  them	
  into	
  QB	
  and	
  QBParseInfo.	
  
   –  genPlan():	
  takes	
  the	
  QB/QBParseInfo	
  and	
  AST	
  tree	
  and	
  
      generate	
  an	
  operator	
  tree.	
  
Logical	
  Plan	
  Genera:on	
  (cont.)	
  
•  genPlan()	
  is	
  recursively	
  called	
  for	
  each	
  subqueries	
  
   (QB),	
  and	
  output	
  the	
  root	
  of	
  the	
  operator	
  tree.	
  
•  For	
  each	
  subquery,	
  genPlan	
  create	
  operators	
  
   “bocom-­‐up”*	
  staring	
  from	
  
   FROMWHEREGROUPBYORDERBY	
  	
  SELECT	
  
•  In	
  the	
  FROM	
  clause,	
  generate	
  a	
  TableScanOperator	
  
   for	
  each	
  source	
  table,	
  Then	
  genLateralView()	
  and	
  
   genJoinPlan().	
  	
  
    *Hive code actually names each leaf operator as “root” and its downstream operators as children.
Logical	
  Plan	
  Genera:on	
  (cont.)	
  
•  genBodyPlan()	
  is	
  then	
  called	
  to	
  handle	
  WHERE-­‐
   GROUPBY-­‐ORDERBY-­‐SELECT	
  clauses.	
  
    –  genFilterPlan()	
  for	
  WHERE	
  clause	
  
    –  genGroupByPalnMapAgg1MR/2MR()	
  for	
  map-­‐side	
  par:al	
  
       aggrega:on	
  
    –  genGroupByPlan1MR/2MR()	
  for	
  reduce-­‐side	
  aggrega:on	
  
    –  genSelectPlan()	
  for	
  SELECT-­‐clause	
  
    –  genReduceSink()	
  for	
  marking	
  the	
  boundary	
  between	
  map/
       reduce	
  phases.	
  
    –  genFileSink()	
  to	
  store	
  intermediate	
  results	
  
Op:mizer	
  
•  The	
  resul:ng	
  operator	
  tree,	
  along	
  with	
  other	
  
   parsing	
  info,	
  is	
  stored	
  in	
  ParseContext	
  and	
  passed	
  
   to	
  Op:mizer.	
  
•  Op:mizer	
  is	
  a	
  set	
  of	
  Transforma:on	
  rules	
  on	
  the	
  
   operator	
  tree.	
  	
  
•  The	
  transforma:on	
  rules	
  are	
  specified	
  by	
  a	
  regexp	
  
   pacern	
  on	
  the	
  tree	
  and	
  a	
  Worker/Dispatcher	
  
   framework.	
  	
  	
  
Op:mizer	
  (cont.)	
  
•  Current	
  rewrite	
  rules	
  include:	
  
    –  ColumnPruner	
  
    –  PredicatePushDown	
  
    –  Par::onPruner	
  
    –  GroupByOp:mizer	
  
    –  SamplePruner	
  
    –  MapJoinProcessor	
  
    –  UnionProcessor	
  
    –  JoinReorder	
  
Physical	
  Plan	
  Genera:on	
  
•  genMapRedWorks()	
  takes	
  the	
  QB/QBParseInfo	
  and	
  the	
  
   operator	
  tree	
  and	
  generate	
  a	
  DAG	
  of	
  
   MapReduceTasks.	
  
•  The	
  genera:on	
  is	
  also	
  based	
  on	
  the	
  Worker/
   Dispatcher	
  framework	
  while	
  traversing	
  the	
  operator	
  
   tree.	
  
•  Different	
  task	
  types:	
  MapRedTask,	
  Condi:onalTask,	
  
   FetchTask,	
  MoveTask,	
  DDLTask,	
  CounterTask	
  
•  Validate()	
  on	
  the	
  physical	
  plan	
  is	
  called	
  at	
  the	
  end	
  of	
  
   Driver.compile().	
  	
  
Preparing	
  Execu:on	
  
•  Driver.execute	
  takes	
  the	
  output	
  from	
  Driver.compile	
  
   and	
  prepare	
  hadoop	
  command	
  line	
  (in	
  local	
  mode)	
  or	
  
   call	
  ExecDriver.execute	
  (in	
  remote	
  mode).	
  
    –  Start	
  a	
  session	
  
    –  Execute	
  PreExecu:onHooks	
  
    –  Create	
  a	
  Runnable	
  for	
  each	
  Task	
  that	
  can	
  be	
  executed	
  in	
  
       parallel	
  and	
  launch	
  Threads	
  within	
  a	
  certain	
  limit	
  
    –  Monitor	
  Thread	
  status	
  and	
  update	
  Session	
  
    –  Execute	
  PostExecu:onHooks	
  
Preparing	
  Execu:on	
  (cont.)	
  
•  Hadoop	
  jobs	
  are	
  started	
  from	
  
   MapRedTask.execute().	
  
    –  Get	
  info	
  of	
  all	
  needed	
  JAR	
  files	
  with	
  ExecDriver	
  as	
  the	
  
       star:ng	
  class	
  
    –  Serialize	
  the	
  Physical	
  Plan	
  (MapRedTask)	
  to	
  an	
  XML	
  file	
  
    –  Gather	
  other	
  info	
  such	
  as	
  Hadoop	
  version	
  and	
  prepare	
  
       the	
  hadoop	
  command	
  line	
  
    –  Execute	
  the	
  hadoop	
  command	
  line	
  in	
  a	
  separate	
  
       process.	
  	
  
Star:ng	
  Hadoop	
  Jobs	
  
•  ExecDriver	
  deserialize	
  the	
  plan	
  from	
  the	
  XML	
  file	
  and	
  
   call	
  execute().	
  
•  Execute()	
  set	
  up	
  #	
  of	
  reducers,	
  job	
  scratch	
  dir,	
  the	
  
   star:ng	
  mapper	
  class	
  (ExecMapper)	
  and	
  star:ng	
  
   reducer	
  class	
  (ExecReducer),	
  and	
  other	
  info	
  to	
  JobConf	
  
   and	
  submit	
  the	
  job	
  through	
  hadoop.mapred.JobClient.	
  	
  
•  The	
  query	
  plan	
  is	
  again	
  serialized	
  into	
  a	
  file	
  and	
  put	
  
   into	
  DistributedCache	
  to	
  be	
  sent	
  out	
  to	
  mappers/
   reducers	
  before	
  the	
  job	
  is	
  started.	
  
Operator	
  
•  ExecMapper	
  create	
  a	
  MapOperator	
  as	
  the	
  parent	
  of	
  all	
  root	
  operators	
  in	
  
   the	
  query	
  plan	
  and	
  start	
  execu:ng	
  on	
  the	
  operator	
  tree.	
  
•  Each	
  Operator	
  class	
  comes	
  with	
  a	
  descriptor	
  class,	
  which	
  contains	
  
   metadata	
  passing	
  from	
  compila:on	
  to	
  execu:on.	
  
     –  Any	
  metadata	
  variable	
  that	
  needs	
  to	
  be	
  passed	
  should	
  have	
  a	
  public	
  secer	
  &	
  
        gecer	
  in	
  order	
  for	
  the	
  XML	
  serializer/deserializer	
  to	
  work.	
  
•  The	
  operator’s	
  interface	
  contains:	
  
     –    ini:lize():	
  called	
  once	
  per	
  operator	
  life:me	
  
     –    startGroup()	
  *:called	
  once	
  for	
  each	
  group	
  (groupby/join)	
  
     –    process()*:	
  called	
  one	
  for	
  each	
  input	
  row.	
  
     –    endGroup()*:	
  called	
  once	
  for	
  each	
  group	
  (groupby/join)	
  
     –    close():	
  called	
  once	
  per	
  operator	
  life:me	
  
Example:	
  adding	
  Semijoin	
  operator	
  
•  Lek	
  semijoin	
  is	
  similar	
  to	
  inner	
  join	
  except	
  only	
  the	
  
   lek	
  hand	
  side	
  table	
  is	
  output	
  and	
  no	
  duplicated	
  
   join	
  values	
  if	
  there	
  are	
  duplicated	
  keys	
  in	
  RHS	
  
   table.	
  
    –  IN/EXISTS	
  seman:cs	
  
    –  SELECT	
  *	
  FROM	
  S	
  LEFT	
  SEMI	
  JOIN	
  T	
  ON	
  S.KEY	
  =	
  T.KEY	
  
       AND	
  S.VALUE	
  =	
  T.VALUE	
  
    –  Output	
  all	
  columns	
  in	
  table	
  S	
  if	
  its	
  (key,value)	
  matches	
  
       at	
  least	
  one	
  (key,value)	
  pair	
  in	
  T	
  
Semijoin	
  Implementa:on	
  
•  Parser:	
  adding	
  the	
  SEMI	
  keyword	
  
•  Seman:cAnalyzer:	
  
   –  doPhase1():	
  keep	
  a	
  mapping	
  of	
  the	
  RHS	
  table	
  name	
  
      and	
  its	
  join	
  key	
  columns	
  in	
  QBParseInfo.	
  	
  
   –  genJoinTree:	
  set	
  new	
  join	
  type	
  in	
  joinDesc.joinCond	
  
   –  genJoinPlan:	
  	
  
       •  generate	
  a	
  map-­‐side	
  par:al	
  groupby	
  operator	
  right	
  aker	
  the	
  
          TableScanOperator	
  for	
  the	
  RHS	
  table.	
  The	
  input	
  &	
  output	
  
          columns	
  of	
  the	
  groupby	
  operator	
  is	
  the	
  RHS	
  join	
  keys.	
  	
  
Semijoin	
  Implementa:on	
  (cont.)	
  
•  Seman:cAnalyzer	
  
   –  genJoinOperator:	
  generate	
  a	
  JoinOperator	
  (lek	
  semi	
  
      type)	
  and	
  set	
  the	
  output	
  fields	
  as	
  the	
  LHS	
  table’s	
  fields	
  
•  Execu:on	
  
   –  In	
  CommonJoinOperator,	
  implement	
  lek	
  semi	
  join	
  with	
  
      early-­‐exit	
  op:miza:on:	
  as	
  long	
  as	
  the	
  RHS	
  table	
  of	
  lek	
  
      semi	
  join	
  is	
  non-­‐null,	
  return	
  the	
  row	
  from	
  the	
  LHS	
  
      table.	
  
Debugging	
  
•  Debugging	
  compile-­‐:me	
  code	
  (Driver	
  :ll	
  
   ExecDriver)	
  is	
  rela:vely	
  easy	
  since	
  it	
  is	
  running	
  on	
  
   the	
  JVM	
  on	
  the	
  client	
  side.	
  
•  Debugging	
  execu:on	
  :me	
  code	
  (aker	
  ExecDriver	
  
   calls	
  hadoop)	
  need	
  some	
  configura:on.	
  See	
  wiki	
  
   hcp://wiki.apache.org/hadoop/Hive/
   DeveloperGuide#Debugging_Hive_code	
  
Questions?

Contenu connexe

Tendances

Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleHive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
DataWorks Summit
 

Tendances (20)

Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleDebugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
 
Sharding Methods for MongoDB
Sharding Methods for MongoDBSharding Methods for MongoDB
Sharding Methods for MongoDB
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
6.hive
6.hive6.hive
6.hive
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Introduction à Cassandra - campus plex
Introduction à Cassandra - campus plexIntroduction à Cassandra - campus plex
Introduction à Cassandra - campus plex
 
Sqoop
SqoopSqoop
Sqoop
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleHive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
 
NOSQLEU - Graph Databases and Neo4j
NOSQLEU - Graph Databases and Neo4jNOSQLEU - Graph Databases and Neo4j
NOSQLEU - Graph Databases and Neo4j
 
Session 14 - Hive
Session 14 - HiveSession 14 - Hive
Session 14 - Hive
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 

En vedette

Hive sourcecodereading
Hive sourcecodereadingHive sourcecodereading
Hive sourcecodereading
wyukawa
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
Zheng Shao
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
ragho
 
Programming Hive Reading #3
Programming Hive Reading #3Programming Hive Reading #3
Programming Hive Reading #3
moai kids
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014
alanfgates
 

En vedette (20)

Internal Hive
Internal HiveInternal Hive
Internal Hive
 
Hive sourcecodereading
Hive sourcecodereadingHive sourcecodereading
Hive sourcecodereading
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
 
20081030linkedin
20081030linkedin20081030linkedin
20081030linkedin
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
 
Programming Hive Reading #3
Programming Hive Reading #3Programming Hive Reading #3
Programming Hive Reading #3
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014
 
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for HadoopLearning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Hive - SerDe and LazySerde
Hive - SerDe and LazySerdeHive - SerDe and LazySerde
Hive - SerDe and LazySerde
 

Similaire à Hive Anatomy

Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Modern Data Stack France
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
DataWorks Summit
 

Similaire à Hive Anatomy (20)

Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for Hadoop
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
OGSA-DAI DQP: A Developer's View
OGSA-DAI DQP: A Developer's ViewOGSA-DAI DQP: A Developer's View
OGSA-DAI DQP: A Developer's View
 
Riak add presentation
Riak add presentationRiak add presentation
Riak add presentation
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 
Cloudera Impala presentation
Cloudera Impala presentationCloudera Impala presentation
Cloudera Impala presentation
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache Hadoop
 
Learning Puppet basic thing
Learning Puppet basic thing Learning Puppet basic thing
Learning Puppet basic thing
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 

Hive Anatomy

  • 1. Hive Anatomy Data Infrastructure Team, Facebook Part of Apache Hadoop Hive Project
  • 2. Overview   •  Conceptual  level  architecture   •  (Pseudo-­‐)code  level  architecture   •  Parser   •  Seman:c  analyzer   •  Execu:on   •  Example:  adding  a  new  Semijoin  Operator  
  • 3. Conceptual  Level  Architecture     •  Hive  Components:   –  Parser  (antlr):  HiveQL    Abstract  Syntax  Tree  (AST)   –  Seman:c  Analyzer:  AST    DAG  of  MapReduce  Tasks   •  Logical  Plan  Generator:  AST    operator  trees   •  Op:mizer  (logical  rewrite):  operator  trees    operator  trees     •  Physical  Plan  Generator:  operator  trees  -­‐>  MapReduce  Tasks   –  Execu:on  Libraries:     •  Operator  implementa:ons,  UDF/UDAF/UDTF   •  SerDe  &  ObjectInspector,  Metastore   •  FileFormat  &  RecordReader  
  • 4.
  • 5. Hive  User/Applica:on  Interfaces   CliDriver   Hive  shell  script   HiPal*   Driver   HiveServer   ODBC   JDBCDriver   (inheren:ng   Driver)   *HiPal is a Web-basedHive client developed internally at Facebook.
  • 6. Parser   •  ANTLR  is  a  parser  generator.     •  $HIVE_SRC/ql/src/java/org/ apache/hadoop/hive/ql/parse/ Hive.g   •  Hive.g  defines  keywords,   ParseDriver   Driver   tokens,  transla:ons  from   HiveQL  to  AST  (ASTNode.java)   •  Every  :me  Hive.g  is  changed,   you  need  to  ‘ant  clean’  first  and   rebuild  using  ‘ant  package’  
  • 7. Seman:c  Analyzer   •  BaseSeman:cAnalyzer  is  the   base  class  for   DDLSeman:cAnalyzer  and   Seman:cAnalyzer   BaseSeman:cAnalyzer   Driver   –  Seman:cAnalyzer  handles   queries,  DML,  and  some  DDL   (create-­‐table)   –  DDLSeman:cAnalyzer   handles  alter  table  etc.  
  • 8. Logical  Plan  Genera:on   •  Seman:cAnalyzer.analyzeInternal()  is  the  main   funciton   –  doPhase1():  recursively  traverse  AST  tree  and  check  for   seman:c  errors  and  gather  metadata  which  is  put  in  QB   and    QBParseInfo.   –  getMetaData():  query  metastore  and  get  metadata  for  the   data  sources  and  put  them  into  QB  and  QBParseInfo.   –  genPlan():  takes  the  QB/QBParseInfo  and  AST  tree  and   generate  an  operator  tree.  
  • 9. Logical  Plan  Genera:on  (cont.)   •  genPlan()  is  recursively  called  for  each  subqueries   (QB),  and  output  the  root  of  the  operator  tree.   •  For  each  subquery,  genPlan  create  operators   “bocom-­‐up”*  staring  from   FROMWHEREGROUPBYORDERBY    SELECT   •  In  the  FROM  clause,  generate  a  TableScanOperator   for  each  source  table,  Then  genLateralView()  and   genJoinPlan().     *Hive code actually names each leaf operator as “root” and its downstream operators as children.
  • 10. Logical  Plan  Genera:on  (cont.)   •  genBodyPlan()  is  then  called  to  handle  WHERE-­‐ GROUPBY-­‐ORDERBY-­‐SELECT  clauses.   –  genFilterPlan()  for  WHERE  clause   –  genGroupByPalnMapAgg1MR/2MR()  for  map-­‐side  par:al   aggrega:on   –  genGroupByPlan1MR/2MR()  for  reduce-­‐side  aggrega:on   –  genSelectPlan()  for  SELECT-­‐clause   –  genReduceSink()  for  marking  the  boundary  between  map/ reduce  phases.   –  genFileSink()  to  store  intermediate  results  
  • 11. Op:mizer   •  The  resul:ng  operator  tree,  along  with  other   parsing  info,  is  stored  in  ParseContext  and  passed   to  Op:mizer.   •  Op:mizer  is  a  set  of  Transforma:on  rules  on  the   operator  tree.     •  The  transforma:on  rules  are  specified  by  a  regexp   pacern  on  the  tree  and  a  Worker/Dispatcher   framework.      
  • 12. Op:mizer  (cont.)   •  Current  rewrite  rules  include:   –  ColumnPruner   –  PredicatePushDown   –  Par::onPruner   –  GroupByOp:mizer   –  SamplePruner   –  MapJoinProcessor   –  UnionProcessor   –  JoinReorder  
  • 13. Physical  Plan  Genera:on   •  genMapRedWorks()  takes  the  QB/QBParseInfo  and  the   operator  tree  and  generate  a  DAG  of   MapReduceTasks.   •  The  genera:on  is  also  based  on  the  Worker/ Dispatcher  framework  while  traversing  the  operator   tree.   •  Different  task  types:  MapRedTask,  Condi:onalTask,   FetchTask,  MoveTask,  DDLTask,  CounterTask   •  Validate()  on  the  physical  plan  is  called  at  the  end  of   Driver.compile().    
  • 14. Preparing  Execu:on   •  Driver.execute  takes  the  output  from  Driver.compile   and  prepare  hadoop  command  line  (in  local  mode)  or   call  ExecDriver.execute  (in  remote  mode).   –  Start  a  session   –  Execute  PreExecu:onHooks   –  Create  a  Runnable  for  each  Task  that  can  be  executed  in   parallel  and  launch  Threads  within  a  certain  limit   –  Monitor  Thread  status  and  update  Session   –  Execute  PostExecu:onHooks  
  • 15. Preparing  Execu:on  (cont.)   •  Hadoop  jobs  are  started  from   MapRedTask.execute().   –  Get  info  of  all  needed  JAR  files  with  ExecDriver  as  the   star:ng  class   –  Serialize  the  Physical  Plan  (MapRedTask)  to  an  XML  file   –  Gather  other  info  such  as  Hadoop  version  and  prepare   the  hadoop  command  line   –  Execute  the  hadoop  command  line  in  a  separate   process.    
  • 16. Star:ng  Hadoop  Jobs   •  ExecDriver  deserialize  the  plan  from  the  XML  file  and   call  execute().   •  Execute()  set  up  #  of  reducers,  job  scratch  dir,  the   star:ng  mapper  class  (ExecMapper)  and  star:ng   reducer  class  (ExecReducer),  and  other  info  to  JobConf   and  submit  the  job  through  hadoop.mapred.JobClient.     •  The  query  plan  is  again  serialized  into  a  file  and  put   into  DistributedCache  to  be  sent  out  to  mappers/ reducers  before  the  job  is  started.  
  • 17. Operator   •  ExecMapper  create  a  MapOperator  as  the  parent  of  all  root  operators  in   the  query  plan  and  start  execu:ng  on  the  operator  tree.   •  Each  Operator  class  comes  with  a  descriptor  class,  which  contains   metadata  passing  from  compila:on  to  execu:on.   –  Any  metadata  variable  that  needs  to  be  passed  should  have  a  public  secer  &   gecer  in  order  for  the  XML  serializer/deserializer  to  work.   •  The  operator’s  interface  contains:   –  ini:lize():  called  once  per  operator  life:me   –  startGroup()  *:called  once  for  each  group  (groupby/join)   –  process()*:  called  one  for  each  input  row.   –  endGroup()*:  called  once  for  each  group  (groupby/join)   –  close():  called  once  per  operator  life:me  
  • 18. Example:  adding  Semijoin  operator   •  Lek  semijoin  is  similar  to  inner  join  except  only  the   lek  hand  side  table  is  output  and  no  duplicated   join  values  if  there  are  duplicated  keys  in  RHS   table.   –  IN/EXISTS  seman:cs   –  SELECT  *  FROM  S  LEFT  SEMI  JOIN  T  ON  S.KEY  =  T.KEY   AND  S.VALUE  =  T.VALUE   –  Output  all  columns  in  table  S  if  its  (key,value)  matches   at  least  one  (key,value)  pair  in  T  
  • 19. Semijoin  Implementa:on   •  Parser:  adding  the  SEMI  keyword   •  Seman:cAnalyzer:   –  doPhase1():  keep  a  mapping  of  the  RHS  table  name   and  its  join  key  columns  in  QBParseInfo.     –  genJoinTree:  set  new  join  type  in  joinDesc.joinCond   –  genJoinPlan:     •  generate  a  map-­‐side  par:al  groupby  operator  right  aker  the   TableScanOperator  for  the  RHS  table.  The  input  &  output   columns  of  the  groupby  operator  is  the  RHS  join  keys.    
  • 20. Semijoin  Implementa:on  (cont.)   •  Seman:cAnalyzer   –  genJoinOperator:  generate  a  JoinOperator  (lek  semi   type)  and  set  the  output  fields  as  the  LHS  table’s  fields   •  Execu:on   –  In  CommonJoinOperator,  implement  lek  semi  join  with   early-­‐exit  op:miza:on:  as  long  as  the  RHS  table  of  lek   semi  join  is  non-­‐null,  return  the  row  from  the  LHS   table.  
  • 21. Debugging   •  Debugging  compile-­‐:me  code  (Driver  :ll   ExecDriver)  is  rela:vely  easy  since  it  is  running  on   the  JVM  on  the  client  side.   •  Debugging  execu:on  :me  code  (aker  ExecDriver   calls  hadoop)  need  some  configura:on.  See  wiki   hcp://wiki.apache.org/hadoop/Hive/ DeveloperGuide#Debugging_Hive_code