SlideShare une entreprise Scribd logo
1  sur  33
Télécharger pour lire hors ligne
1
Introduction to Apache Hive (and
HCatalog)	
  
Mark Grover
github.com/markgrover/nyc-hug-
hive
Me!
•  Contributor	
  to	
  Apache	
  Hive	
  
•  Sec3on	
  Author	
  of	
  O’Reilly’s	
  Programming	
  Hive	
  book	
  
•  So?ware	
  Developer	
  at	
  Cloudera	
  
•  @mark_grover	
  
•  mgrover@cloudera.com	
  
•  hFps://github.com/markgrover/nyc-­‐hug-­‐hive	
  
Agenda
•  What	
  is	
  Hive?	
  
•  Why	
  use	
  Hive?	
  
•  Hive	
  features	
  
•  Hive	
  architecture	
  
•  HCatalog	
  
•  Demo!	
  
Preamble
•  This	
  is	
  a	
  remote	
  talk	
  
•  Feel	
  free	
  to	
  ask	
  ques3ons	
  any	
  3me!	
  
Agenda
•  What	
  is	
  Hive?	
  
•  Why	
  use	
  Hive?	
  
•  Hive	
  features	
  
•  Hive	
  architecture	
  
•  HCatalog	
  
•  Demo!	
  
Hive
•  Data	
  warehouse	
  system	
  for	
  Hadoop	
  
•  Enables	
  Extract/Transform/Load	
  (ETL)	
  
•  Associate	
  structure	
  with	
  a	
  variety	
  of	
  data	
  formats	
  
•  Logical	
  Table	
  -­‐>	
  Physical	
  Loca3on	
  
•  Logical	
  Table	
  -­‐>	
  Physical	
  Data	
  Format	
  Handler	
  (SerDe)	
  
•  Integrates	
  with	
  HDFS,	
  HBase,	
  MongoDB,	
  etc.	
  
•  Query	
  execu3on	
  in	
  MapReduce	
  
Agenda
•  What	
  is	
  Hive?	
  
•  Why	
  use	
  Hive?	
  
•  Hive	
  features	
  
•  Hive	
  architecture	
  
•  HCatalog	
  
•  Demo!	
  
Why use Hive?
•  MapReduce	
  is	
  catered	
  towards	
  developers	
  
•  Run	
  SQL-­‐like	
  queries	
  that	
  get	
  compiled	
  and	
  run	
  as	
  
MapReduce	
  jobs	
  
•  Data	
  in	
  Hadoop	
  even	
  though	
  generally	
  unstructured	
  
has	
  some	
  vague	
  structure	
  associated	
  with	
  it	
  
•  Benefits	
  of	
  MapReduce	
  +	
  HDFS	
  (Hadoop)	
  
•  Fault	
  tolerant	
  
•  Robust	
  
•  Scalable	
  
Agenda
•  What	
  is	
  Hive?	
  
•  Why	
  use	
  Hive?	
  
•  Hive	
  features	
  
•  Hive	
  architecture	
  
•  HCatalog	
  
•  Demo!	
  
Hive features
•  Create	
  table,	
  create	
  view,	
  create	
  index	
  -­‐	
  DDL	
  
•  Select,	
  where	
  clause,	
  group	
  by,	
  order	
  by,	
  joins	
  
•  Pluggable	
  User	
  Defined	
  Func3ons	
  -­‐	
  UDFs	
  (e.g	
  
from_unix3me)	
  
•  Pluggable	
  User	
  Defined	
  Aggregate	
  Func3ons	
  -­‐	
  
UDAFs	
  (e.g.	
  count,	
  avg)	
  
•  Pluggable	
  User	
  Defined	
  Table	
  Genera3ng	
  Func3ons	
  
-­‐	
  UDTFs	
  (e.g.	
  explode)	
  
Hive features
•  Pluggable	
  custom	
  Input/Output	
  format	
  
•  Pluggable	
  Serializa3on	
  Deserializa3on	
  libraries	
  
(SerDes)	
  
•  Pluggable	
  custom	
  map	
  and	
  reduce	
  scripts	
  
What Hive does NOT support
•  OLTP	
  workloads	
  -­‐	
  low	
  latency	
  
•  Correlated	
  subqueries	
  
•  Not	
  super	
  performant	
  with	
  small	
  amounts	
  of	
  data	
  
•  How	
  much	
  data	
  do	
  you	
  need	
  to	
  call	
  it	
  “Big	
  Data”?	
  
Other Hive features
•  Par33oning	
  
•  Sampling	
  
•  Bucke3ng	
  
•  Various	
  join	
  op3miza3ons	
  
•  Integra3on	
  with	
  HBase	
  and	
  other	
  storage	
  handlers	
  
•  Views	
  –	
  Unmaterialized	
  
•  Complex	
  data	
  types	
  –	
  arrays,	
  structs,	
  maps	
  
Connecting to Hive
•  Hive	
  Shell	
  
•  JDBC	
  driver	
  
•  ODBC	
  driver	
  
•  Thri?	
  client	
  
Agenda
•  What	
  is	
  Hive?	
  
•  Why	
  use	
  Hive?	
  
•  Hive	
  features	
  
•  Hive	
  architecture	
  
•  HCatalog	
  
•  Demo!	
  
Hive metastore
•  Backed	
  by	
  RDBMS	
  
•  Derby,	
  MySQL,	
  PostgreSQL,	
  etc.	
  supported	
  
•  Default	
  Embedded	
  Derby	
  
•  Not	
  recommend	
  for	
  anything	
  but	
  a	
  quick	
  Proof	
  of	
  
Concept	
  
•  3	
  different	
  modes	
  of	
  opera3on:	
  
•  Embedded	
  Derby	
  (default)	
  
•  Local	
  
•  Remote	
  
Hive architecture
Hive Remote Mode
Hive server
Problems with Hive Server
•  No	
  sessions/concurrency	
  
•  Essen3ally	
  need	
  1	
  server	
  per	
  client	
  
•  Security	
  
•  Auding/Logging	
  
Hive server 2
Hive architecture
•  Compiler	
  
•  Parser	
  
•  Type	
  checking	
  
•  Seman3c	
  Analyzer	
  
•  Plan	
  Genera3on	
  
•  Task	
  Genera3on	
  
Hive architecture
•  Execu3on	
  Engine	
  
•  Plan	
  
•  Operators	
  
•  SerDes	
  
•  UDFs/UDAFs/UDTFs	
  
•  Metastore	
  
•  Stores	
  schema	
  of	
  data	
  
•  HCatalog	
  
Architecture Summary
•  Use	
  remote	
  metastore	
  service	
  for	
  sharing	
  the	
  
metastore	
  with	
  HCatalog	
  and	
  other	
  tools	
  
•  Use	
  Hive	
  Server2	
  for	
  concurrent	
  queries	
  
Agenda
•  What	
  is	
  Hive?	
  
•  Why	
  use	
  Hive?	
  
•  Hive	
  features	
  
•  Hive	
  architecture	
  
•  HCatalog	
  
•  Demo!	
  
Hive Metastore Remote Mode
HCatalog
•  Sub-­‐component	
  of	
  Hive	
  
•  Table	
  and	
  storage	
  management	
  service	
  
•  Public	
  APIs	
  and	
  webservice	
  wrappers	
  for	
  accessing	
  
metadata	
  in	
  Hive	
  metastore	
  
•  Metastore	
  contains	
  informa3on	
  of	
  interest	
  to	
  other	
  
tools	
  (Pig,	
  MapReduce	
  jobs)	
  
•  Expose	
  that	
  informa3on	
  as	
  REST	
  interface	
  
•  WebHCat:	
  Web	
  Server	
  for	
  engaging	
  with	
  the	
  Hive	
  
metastore	
  	
  
WebHCat
REST
REST
Agenda
•  What	
  is	
  Hive?	
  
•  Why	
  use	
  Hive?	
  
•  Hive	
  features	
  
•  Hive	
  architecture	
  
•  HCatalog	
  
•  Demo!	
  
Applications of Hive
•  Web	
  Analy3cs	
  
•  Retail	
  
•  Healthcare	
  
•  Spam	
  detec3on	
  
•  Data	
  Mining	
  
•  Ad	
  op3miza3on	
  
•  ETL	
  workloads	
  
How to install Hive
•  Download	
  Apache	
  Hive	
  tarball	
  
•  Use	
  Apache	
  Bigtop	
  packages	
  
•  Use	
  a	
  Demo	
  VM	
  
Want to learn more about Hive?
Contact info
@mark_grover	
  
github.com/markgrover	
  
linkedin.com/in/grovermark	
  
mgrover@cloudera.com	
  
	
  
	
  

Contenu connexe

Tendances

How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsDatabricks
 
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...StreamNative
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningDatabricks
 
Comparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsComparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsDavid Portnoy
 
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | EdurekaWhat Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | EdurekaEdureka!
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQLkristinferrier
 
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark DownscalingDatabricks
 
Dynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremDynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremGrisha Weintraub
 
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...Spark Summit
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
Lambda kappa architecture - the jury are still out
Lambda   kappa architecture - the jury are still outLambda   kappa architecture - the jury are still out
Lambda kappa architecture - the jury are still outYoav chernobroda
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Cloudera, Inc.
 

Tendances (20)

storm at twitter
storm at twitterstorm at twitter
storm at twitter
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized Optimizations
 
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
 
Session 14 - Hive
Session 14 - HiveSession 14 - Hive
Session 14 - Hive
 
Avro introduction
Avro introductionAvro introduction
Avro introduction
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine Learning
 
Comparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsComparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse Platforms
 
ORC Deep Dive 2020
ORC Deep Dive 2020ORC Deep Dive 2020
ORC Deep Dive 2020
 
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | EdurekaWhat Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQL
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
 
Dynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremDynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theorem
 
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Lambda kappa architecture - the jury are still out
Lambda   kappa architecture - the jury are still outLambda   kappa architecture - the jury are still out
Lambda kappa architecture - the jury are still out
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 

En vedette

HiveServer2 for Apache Hive
HiveServer2 for Apache HiveHiveServer2 for Apache Hive
HiveServer2 for Apache HiveCarl Steinbach
 
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Big Data Spain
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 
Inside Hulu's Data platform (BigDataCamp LA 2013)
Inside Hulu's Data platform (BigDataCamp LA 2013)Inside Hulu's Data platform (BigDataCamp LA 2013)
Inside Hulu's Data platform (BigDataCamp LA 2013)Prasan Samtani
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 
Improving HDFS Availability with Hadoop RPC Quality of Service
Improving HDFS Availability with Hadoop RPC Quality of ServiceImproving HDFS Availability with Hadoop RPC Quality of Service
Improving HDFS Availability with Hadoop RPC Quality of ServiceMing Ma
 
Hadoop and Hive
Hadoop and HiveHadoop and Hive
Hadoop and HiveZheng Shao
 
Hive and HiveQL - Module6
Hive and HiveQL - Module6Hive and HiveQL - Module6
Hive and HiveQL - Module6Rohit Agrawal
 
Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?
Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?
Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?Cloudera, Inc.
 
Analytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoopAnalytics in olap with lucene & hadoop
Analytics in olap with lucene & hadooplucenerevolution
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Sparkdatamantra
 
Hadoop Summit 2009 Hive
Hadoop Summit 2009 HiveHadoop Summit 2009 Hive
Hadoop Summit 2009 HiveNamit Jain
 
How to Create 80% of a Big Data Pilot Project
How to Create 80% of a Big Data Pilot ProjectHow to Create 80% of a Big Data Pilot Project
How to Create 80% of a Big Data Pilot ProjectGreg Makowski
 
Lessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at HuluLessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at HuluDataWorks Summit
 
SC4 Workshop 1: Logistics and big data German herrero
SC4 Workshop 1: Logistics and big data  German herreroSC4 Workshop 1: Logistics and big data  German herrero
SC4 Workshop 1: Logistics and big data German herreroBigData_Europe
 

En vedette (20)

HCatalog
HCatalogHCatalog
HCatalog
 
HiveServer2 for Apache Hive
HiveServer2 for Apache HiveHiveServer2 for Apache Hive
HiveServer2 for Apache Hive
 
Hive hcatalog
Hive hcatalogHive hcatalog
Hive hcatalog
 
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
 
HiveServer2
HiveServer2HiveServer2
HiveServer2
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Inside Hulu's Data platform (BigDataCamp LA 2013)
Inside Hulu's Data platform (BigDataCamp LA 2013)Inside Hulu's Data platform (BigDataCamp LA 2013)
Inside Hulu's Data platform (BigDataCamp LA 2013)
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Improving HDFS Availability with Hadoop RPC Quality of Service
Improving HDFS Availability with Hadoop RPC Quality of ServiceImproving HDFS Availability with Hadoop RPC Quality of Service
Improving HDFS Availability with Hadoop RPC Quality of Service
 
Future of HCatalog
Future of HCatalogFuture of HCatalog
Future of HCatalog
 
Hadoop and Hive
Hadoop and HiveHadoop and Hive
Hadoop and Hive
 
Hive and HiveQL - Module6
Hive and HiveQL - Module6Hive and HiveQL - Module6
Hive and HiveQL - Module6
 
Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?
Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?
Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?
 
Analytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoopAnalytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoop
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Spark
 
Hadoop Summit 2009 Hive
Hadoop Summit 2009 HiveHadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
 
How to Create 80% of a Big Data Pilot Project
How to Create 80% of a Big Data Pilot ProjectHow to Create 80% of a Big Data Pilot Project
How to Create 80% of a Big Data Pilot Project
 
Lessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at HuluLessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at Hulu
 
Restaurant management
Restaurant managementRestaurant management
Restaurant management
 
SC4 Workshop 1: Logistics and big data German herrero
SC4 Workshop 1: Logistics and big data  German herreroSC4 Workshop 1: Logistics and big data  German herrero
SC4 Workshop 1: Logistics and big data German herrero
 

Similaire à Introduction to Hive and HCatalog

Search onhadoopsfhug081413
Search onhadoopsfhug081413Search onhadoopsfhug081413
Search onhadoopsfhug081413gregchanan
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoopgregchanan
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Hadoop Training in Hyderabad
Hadoop Training in HyderabadHadoop Training in Hyderabad
Hadoop Training in HyderabadRajitha D
 
Search On Hadoop Frontier Meetup
Search On Hadoop Frontier MeetupSearch On Hadoop Frontier Meetup
Search On Hadoop Frontier Meetupgregchanan
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Middleware in Golang: InVision's Rye
Middleware in Golang: InVision's RyeMiddleware in Golang: InVision's Rye
Middleware in Golang: InVision's RyeCale Hoopes
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Cask Data
 
Service-Oriented Design and Implement with Rails3
Service-Oriented Design and Implement with Rails3Service-Oriented Design and Implement with Rails3
Service-Oriented Design and Implement with Rails3Wen-Tien Chang
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
 
Getting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsightGetting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsightNilesh Gule
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding EdgeCIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding EdgeCloudIDSummit
 
Cloud Infrastructures Slide Set 7 - Docker - Neo4j | anynines
Cloud Infrastructures Slide Set 7 - Docker - Neo4j | anyninesCloud Infrastructures Slide Set 7 - Docker - Neo4j | anynines
Cloud Infrastructures Slide Set 7 - Docker - Neo4j | anyninesanynines GmbH
 
Apache Geode Meetup, London
Apache Geode Meetup, LondonApache Geode Meetup, London
Apache Geode Meetup, LondonApache Geode
 

Similaire à Introduction to Hive and HCatalog (20)

Search onhadoopsfhug081413
Search onhadoopsfhug081413Search onhadoopsfhug081413
Search onhadoopsfhug081413
 
Search On Hadoop
Search On HadoopSearch On Hadoop
Search On Hadoop
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Hadoop Training in Hyderabad
Hadoop Training in HyderabadHadoop Training in Hyderabad
Hadoop Training in Hyderabad
 
Hadoop Training in Hyderabad
Hadoop Training in HyderabadHadoop Training in Hyderabad
Hadoop Training in Hyderabad
 
Search On Hadoop Frontier Meetup
Search On Hadoop Frontier MeetupSearch On Hadoop Frontier Meetup
Search On Hadoop Frontier Meetup
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Middleware in Golang: InVision's Rye
Middleware in Golang: InVision's RyeMiddleware in Golang: InVision's Rye
Middleware in Golang: InVision's Rye
 
An intro to Azure Data Lake
An intro to Azure Data LakeAn intro to Azure Data Lake
An intro to Azure Data Lake
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?
 
Service-Oriented Design and Implement with Rails3
Service-Oriented Design and Implement with Rails3Service-Oriented Design and Implement with Rails3
Service-Oriented Design and Implement with Rails3
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
CC -Unit4.pptx
CC -Unit4.pptxCC -Unit4.pptx
CC -Unit4.pptx
 
Getting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsightGetting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsight
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding EdgeCIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge
 
Cloud Infrastructures Slide Set 7 - Docker - Neo4j | anynines
Cloud Infrastructures Slide Set 7 - Docker - Neo4j | anyninesCloud Infrastructures Slide Set 7 - Docker - Neo4j | anynines
Cloud Infrastructures Slide Set 7 - Docker - Neo4j | anynines
 
Apache Geode Meetup, London
Apache Geode Meetup, LondonApache Geode Meetup, London
Apache Geode Meetup, London
 

Plus de markgrover

From discovering to trusting data
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting datamarkgrover
 
Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020 Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020 markgrover
 
Amundsen at Brex and Looker integration
Amundsen at Brex and Looker integrationAmundsen at Brex and Looker integration
Amundsen at Brex and Looker integrationmarkgrover
 
REA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and AmundsenREA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and Amundsenmarkgrover
 
Amundsen gremlin proxy design
Amundsen gremlin proxy designAmundsen gremlin proxy design
Amundsen gremlin proxy designmarkgrover
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security datamarkgrover
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security datamarkgrover
 
Data Discovery & Trust through Metadata
Data Discovery & Trust through MetadataData Discovery & Trust through Metadata
Data Discovery & Trust through Metadatamarkgrover
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadatamarkgrover
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futuremarkgrover
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discoverymarkgrover
 
TensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache BeamTensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache Beammarkgrover
 
Big Data at Speed
Big Data at SpeedBig Data at Speed
Big Data at Speedmarkgrover
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyftmarkgrover
 
Dogfooding data at Lyft
Dogfooding data at LyftDogfooding data at Lyft
Dogfooding data at Lyftmarkgrover
 
Fighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache SpotFighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache Spotmarkgrover
 
Fraud Detection with Hadoop
Fraud Detection with HadoopFraud Detection with Hadoop
Fraud Detection with Hadoopmarkgrover
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsmarkgrover
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsmarkgrover
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 

Plus de markgrover (20)

From discovering to trusting data
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting data
 
Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020 Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020
 
Amundsen at Brex and Looker integration
Amundsen at Brex and Looker integrationAmundsen at Brex and Looker integration
Amundsen at Brex and Looker integration
 
REA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and AmundsenREA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and Amundsen
 
Amundsen gremlin proxy design
Amundsen gremlin proxy designAmundsen gremlin proxy design
Amundsen gremlin proxy design
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
 
Data Discovery & Trust through Metadata
Data Discovery & Trust through MetadataData Discovery & Trust through Metadata
Data Discovery & Trust through Metadata
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 
TensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache BeamTensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache Beam
 
Big Data at Speed
Big Data at SpeedBig Data at Speed
Big Data at Speed
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyft
 
Dogfooding data at Lyft
Dogfooding data at LyftDogfooding data at Lyft
Dogfooding data at Lyft
 
Fighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache SpotFighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache Spot
 
Fraud Detection with Hadoop
Fraud Detection with HadoopFraud Detection with Hadoop
Fraud Detection with Hadoop
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 

Dernier

Comparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization TechniquesComparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization Techniquesugginaramesh
 
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction managementMariconPadriquez1
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .Satyam Kumar
 
An introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptxAn introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptxPurva Nikam
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 

Dernier (20)

Comparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization TechniquesComparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization Techniques
 
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction management
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .
 
An introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptxAn introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptx
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 

Introduction to Hive and HCatalog

  • 1. 1 Introduction to Apache Hive (and HCatalog)   Mark Grover github.com/markgrover/nyc-hug- hive
  • 2. Me! •  Contributor  to  Apache  Hive   •  Sec3on  Author  of  O’Reilly’s  Programming  Hive  book   •  So?ware  Developer  at  Cloudera   •  @mark_grover   •  mgrover@cloudera.com   •  hFps://github.com/markgrover/nyc-­‐hug-­‐hive  
  • 3. Agenda •  What  is  Hive?   •  Why  use  Hive?   •  Hive  features   •  Hive  architecture   •  HCatalog   •  Demo!  
  • 4. Preamble •  This  is  a  remote  talk   •  Feel  free  to  ask  ques3ons  any  3me!  
  • 5. Agenda •  What  is  Hive?   •  Why  use  Hive?   •  Hive  features   •  Hive  architecture   •  HCatalog   •  Demo!  
  • 6. Hive •  Data  warehouse  system  for  Hadoop   •  Enables  Extract/Transform/Load  (ETL)   •  Associate  structure  with  a  variety  of  data  formats   •  Logical  Table  -­‐>  Physical  Loca3on   •  Logical  Table  -­‐>  Physical  Data  Format  Handler  (SerDe)   •  Integrates  with  HDFS,  HBase,  MongoDB,  etc.   •  Query  execu3on  in  MapReduce  
  • 7. Agenda •  What  is  Hive?   •  Why  use  Hive?   •  Hive  features   •  Hive  architecture   •  HCatalog   •  Demo!  
  • 8. Why use Hive? •  MapReduce  is  catered  towards  developers   •  Run  SQL-­‐like  queries  that  get  compiled  and  run  as   MapReduce  jobs   •  Data  in  Hadoop  even  though  generally  unstructured   has  some  vague  structure  associated  with  it   •  Benefits  of  MapReduce  +  HDFS  (Hadoop)   •  Fault  tolerant   •  Robust   •  Scalable  
  • 9. Agenda •  What  is  Hive?   •  Why  use  Hive?   •  Hive  features   •  Hive  architecture   •  HCatalog   •  Demo!  
  • 10. Hive features •  Create  table,  create  view,  create  index  -­‐  DDL   •  Select,  where  clause,  group  by,  order  by,  joins   •  Pluggable  User  Defined  Func3ons  -­‐  UDFs  (e.g   from_unix3me)   •  Pluggable  User  Defined  Aggregate  Func3ons  -­‐   UDAFs  (e.g.  count,  avg)   •  Pluggable  User  Defined  Table  Genera3ng  Func3ons   -­‐  UDTFs  (e.g.  explode)  
  • 11. Hive features •  Pluggable  custom  Input/Output  format   •  Pluggable  Serializa3on  Deserializa3on  libraries   (SerDes)   •  Pluggable  custom  map  and  reduce  scripts  
  • 12. What Hive does NOT support •  OLTP  workloads  -­‐  low  latency   •  Correlated  subqueries   •  Not  super  performant  with  small  amounts  of  data   •  How  much  data  do  you  need  to  call  it  “Big  Data”?  
  • 13. Other Hive features •  Par33oning   •  Sampling   •  Bucke3ng   •  Various  join  op3miza3ons   •  Integra3on  with  HBase  and  other  storage  handlers   •  Views  –  Unmaterialized   •  Complex  data  types  –  arrays,  structs,  maps  
  • 14. Connecting to Hive •  Hive  Shell   •  JDBC  driver   •  ODBC  driver   •  Thri?  client  
  • 15. Agenda •  What  is  Hive?   •  Why  use  Hive?   •  Hive  features   •  Hive  architecture   •  HCatalog   •  Demo!  
  • 16. Hive metastore •  Backed  by  RDBMS   •  Derby,  MySQL,  PostgreSQL,  etc.  supported   •  Default  Embedded  Derby   •  Not  recommend  for  anything  but  a  quick  Proof  of   Concept   •  3  different  modes  of  opera3on:   •  Embedded  Derby  (default)   •  Local   •  Remote  
  • 20. Problems with Hive Server •  No  sessions/concurrency   •  Essen3ally  need  1  server  per  client   •  Security   •  Auding/Logging  
  • 22. Hive architecture •  Compiler   •  Parser   •  Type  checking   •  Seman3c  Analyzer   •  Plan  Genera3on   •  Task  Genera3on  
  • 23. Hive architecture •  Execu3on  Engine   •  Plan   •  Operators   •  SerDes   •  UDFs/UDAFs/UDTFs   •  Metastore   •  Stores  schema  of  data   •  HCatalog  
  • 24. Architecture Summary •  Use  remote  metastore  service  for  sharing  the   metastore  with  HCatalog  and  other  tools   •  Use  Hive  Server2  for  concurrent  queries  
  • 25. Agenda •  What  is  Hive?   •  Why  use  Hive?   •  Hive  features   •  Hive  architecture   •  HCatalog   •  Demo!  
  • 27. HCatalog •  Sub-­‐component  of  Hive   •  Table  and  storage  management  service   •  Public  APIs  and  webservice  wrappers  for  accessing   metadata  in  Hive  metastore   •  Metastore  contains  informa3on  of  interest  to  other   tools  (Pig,  MapReduce  jobs)   •  Expose  that  informa3on  as  REST  interface   •  WebHCat:  Web  Server  for  engaging  with  the  Hive   metastore    
  • 29. Agenda •  What  is  Hive?   •  Why  use  Hive?   •  Hive  features   •  Hive  architecture   •  HCatalog   •  Demo!  
  • 30. Applications of Hive •  Web  Analy3cs   •  Retail   •  Healthcare   •  Spam  detec3on   •  Data  Mining   •  Ad  op3miza3on   •  ETL  workloads  
  • 31. How to install Hive •  Download  Apache  Hive  tarball   •  Use  Apache  Bigtop  packages   •  Use  a  Demo  VM  
  • 32. Want to learn more about Hive?
  • 33. Contact info @mark_grover   github.com/markgrover   linkedin.com/in/grovermark   mgrover@cloudera.com