SlideShare a Scribd company logo
1 of 24
Download to read offline
An Introduction to Hive:
Components and Query Language


Jeff Hammerbacher
Chief Scientist and VP of Product
October 30, 2008
Hive Components
A Leaky Database
▪ Hadoop

 ▪ HDFS

 ▪ MapReduce     (bundles Resource Manager and Job Scheduler)
▪ Hive

 ▪ Logical   data partitioning
 ▪ Metastore    (command line and web interfaces)
 ▪ Query   Language
 ▪ Libraries   to handle different serialization formats (SerDes)
 ▪ JDBC   interface
Related Work
Glaringly Incomplete
▪ Gamma,    Bubba, Volcano, etc.
▪ Google:   Sawzall
▪ Yahoo:   Pig
▪ IBM   Research: JAQL
▪ Microsoft:     SCOPE
▪ Greenplum:      YAML MapReduce
▪ Aster   Data: In-Database MapReduce
▪ Business.com:     CloudBase
Hive Resources
▪ Facebook    Mirror: http://mirror.facebook.com/facebook/hive
 ▪ Currently   the best place to get the Hive distribution


▪ Wiki    page: http://wiki.apache.org/hadoop/Hive
 ▪ Getting    started: http://wiki.apache.org/hadoop/Hive/GettingStarted
 ▪ Query    language reference: http://wiki.apache.org/hadoop/Hive/HiveQL
 ▪ Presentations:   http://wiki.apache.org/hadoop/Hive/Presentations
 ▪ Roadmap:     http://wiki.apache.org/hadoop/Hive/Roadmap


▪ Mailing   list: hive-users@publists.facebook.com


▪ JIRA:   https://issues.apache.org/jira/browse/HADOOP/component/12312455
Running Hive
Quickstart
▪ <install     Hadoop>
▪ wget    http://mirror.facebook.com/facebook/hive/hadoop-0.19/dist.tar.gz
 ▪ (Replace      0.19 with 0.17 if you’re still on 0.17)
▪ tar   xvzf dist.tar.gz
▪ cd    dist
▪ export       HADOOP=<path to bin/hadoop in your Hadoop distribution>
 ▪ Or:    edit hadoop.bin.path and hadoop.conf.dir in conf/hive-default.xml
▪ bin/hive

▪ hive>
Running Hive
Configuration Details
▪ conf/hive-default.xml

 ▪ hadoop.bin.path:     Points to bin/hadoop in your Hadoop installation
 ▪ hadoop.config.dir:     Points to conf/ in your Hadoop installation
 ▪ hive.exec.scratchdir:    HDFS directory where execution information is written
 ▪ hive.metastore.warehouse.dir:      HDFS directory managed by Hive
 ▪ The    rest of the properties relate to the Metastore
▪ conf/hive-log4j.properties

 ▪ Will   put data into /tmp/{user.name}/hive.log by default
▪ conf/jpox.properties

 ▪ JPOX    is a Java object persistence library used by the Metastore
Populating Hive
MovieLens Data
▪   <cd into your hive directory>
▪   wget http://www.grouplens.org/system/files/ml-data.tar__0.gz
▪   tar xvzf ml-data.tar__0.gz
▪   CREATE TABLE u_data (userid INT, movieid INT, rating INT, unixtime TIMESTAMP)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY 't';
    ▪   The first query can take ten seconds or more, as the Metastore needs to be created
▪   To confirm our table has been created:
    ▪   SHOW TABLES;
    ▪   DESCRIBE u_data;
▪   LOAD DATA LOCAL INPATH 'ml-data/u.data'
    OVERWRITE INTO TABLE u_data;
▪   SELECT COUNT(1) FROM u_data;
    ▪   Should fire off 2 MapReduce jobs and ultimately return a count of 100,000
Hive Query Language
Utility Statements
▪   SHOW TABLES [table_name | table_name_pattern]

▪   DESCRIBE [EXTENDED] table_name
    [PARTITION (partition_col = partition_col_value, ...)]

▪   EXPLAIN [EXTENDED] query_statement

▪   SET [EXTENDED]

    ▪   “SET property_name=property_value” to modify a value
Hive Query Language
CREATE TABLE Syntax
▪   CREATE [EXTERNAL] TABLE table_name (col_name data_type [col_comment], ...)
    [PARTITIONED BY (col_name data_type [col_comment], ...)]
    [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name, ...)] INTO num_buckets BUCKETS]
    [ROW FORMAT row_format]
    [STORED AS file_format]
    [LOCATION hdfs_path]

▪   PARTITION columns are virtual columns; they are not part of the data itself but are derived on load
▪   CLUSTERED columns are real columns, hash partitioned into num_buckets folders
▪   ROW FORMAT can be used to specify a delimited data set or a custom deserializer
▪   Use EXTERNAL with ROW FORMAT, STORED AS, and LOCATION to analyze HDFS files in place
▪   “DROP TABLE table_name” can reverse this operation
    ▪   NB: Currently, DROP TABLE will delete both data and metadata
Hive Query Language
CREATE TABLE Syntax, Part Two
▪   data_type: primitive_type | array_type | map_type
▪   primitive_type:
    ▪   TINYINT | INT | BIGINT | BOOLEAN | FLOAT | DOUBLE | STRING
    ▪   DATE | DATETIME | TIMESTAMP
▪   array_type: ARRAY < primitive_type >
▪   map_type: MAP < primitive_type, primitive_type >
▪   row_format:
    ▪   DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char]
        [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
    ▪   SERIALIZER serde_name [WITH PROPERTIES property_name=property_value,
        property_name=property_value, ...]
▪   file_format: SEQUENCEFILE | TEXTFILE
Hive Query Language
ALTER TABLE Syntax
▪   ALTER TABLE table_name RENAME TO new_table_name;
▪   ALTER TABLE table_name ADD COLUMNS (col_name data_type [col_comment], ...);
▪   ALTER TABLE DROP partition_spec, partition_spec, ...;


▪   Future work:
     ▪   Support for removing or renaming columns
     ▪   Support for altering serialization format
Hive Query Language
LOAD DATA Syntax
▪   LOAD DATA [LOCAL] INPATH '/path/to/file'
    [OVERWRITE] INTO TABLE table_name
    [PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...)]

▪   You can load data from the local filesystem or anywhere in HDFS (cf. CREATE TABLE EXTERNAL)

▪   If you don’t specify OVERWRITE, data will be appended to existing table
Hive Query Language
SELECT Syntax
▪   [insert_clause]
    SELECT [ALL|DISTINCT] select_list
    FROM [table_source|join_source]
    [WHERE where_condition]
    [GROUP BY col_list]
    [ORDER BY col_list]
    [CLUSTER BY col_list]

▪   insert_clause: INSERT OVERWRITE destination

▪   destination:

    ▪   LOCAL DIRECTORY '/local/path'
    ▪   DIRECTORY '/hdfs/path'
    ▪   TABLE table_name [PARTITION (partition_col = partiton_col_value, ...)]
Hive Query Language
SELECT Syntax
▪   join_source: table_source join_clause table_source join_clause table_source ...

▪   join_clause

    ▪   [LEFT OUTER|RIGHT OUTER|FULL OUTER] JOIN ON (equality_expression, equality_expression, ...)



▪   Currently, only outer equi-joins are supported in Hive.

▪   There are two join algorithms

    ▪   Map-side merge join

    ▪   Reduce-side merge join
Hive Query Language
Building a Histogram of Review Counts
▪   CREATE TABLE review_counts (userid INT, review_count INT);
▪   INSERT OVERWRITE TABLE review_counts
    SELECT a.userid, COUNT(1) AS review_count
    FROM u_data a
    GROUP BY a.userid;
▪   SELECT b.review_count, COUNT(1)
    FROM review_counts b
    GROUP BY b.review_count;
▪   Notes:
    ▪   No INSERT OVERWRITE for second query means output is dumped to the shell
    ▪   Hive does not currently support CREATE TABLE AS
        ▪   We have to create the table and then INSERT into it
    ▪   Hive does not currently support subqueries
        ▪   We have to write two queries
Hive Query Language
Running Custom MapReduce
▪   Put the following into weekday_mapper.py:
    ▪   import sys
        import datetime

        for line in sys.stdin:
         line = line.strip()
         userid, movieid, rating, unixtime = line.split('t')
         weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
         print ','.join([userid, movieid, rating, str(weekday)])
▪   CREATE TABLE u_data_new (userid INT, movieid INT, rating INT, weekday INT)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’;
▪   FROM u_data a
    INSERT OVERWRITE TABLE u_data_new
    SELECT
     TRANSFORM (a.userid, a.movieid, a.rating, a.unixtime)
     AS (userid, movieid, rating, weekday)
     USING ‘python /full/path/to/weekday_mapper.py’
Hive Query Language
Programmatic Access
▪ The   Hive shell can take a file with queries to be executed
▪ bin/hive   -f /path/to/query/file


▪ You   can also run a Hive query straight from the command line
▪ bin/hive   -e 'quoted query string'


▪A   simple JDBC interface is available for experimentation as well
▪ https://issues.apache.org/jira/browse/HADOOP-4101
Hive Components
Metastore
▪ Currently        uses an embedded Derby database for persistence
▪ While  Derby is in place, you’ll need to put it into Server Mode to
    have more than one Hive concurrent Hive user
▪ See      http://wiki.apache.org/hadoop/HiveDerbyServerMode
▪ Next     release will use MySQL as default persistent data store
▪ The    goal is have the persistent store be pluggable
▪ You     can view the Thrift IDL for the metastore online
▪   https://svn.apache.org/repos/asf/hadoop/core/trunk/src/contrib/hive/metastore/if/hive_metastore.thrift
Hive Components
Query Processing
▪ Compiler

 ▪ Parser

 ▪ Type   Checking
 ▪ Semantic   Analysis
 ▪ Plan   Generation
 ▪ Task   Generation
▪ Execution   Engine
 ▪ Plan

 ▪ Operators

 ▪ UDFs   and UDAFs
Future Directions
▪   Query Optimization
    ▪   Support for Statistics
        ▪   These stats are needed to make optimization decisions
    ▪   Join Optimizations
        ▪   Map-side joins, semi join techniques etc to do the join faster
    ▪   Predicate Pushdown Optimizations
        ▪   Pushing predicates just above the table scan for certain situations in joins as well as ensuring that
            only required columns are sent across map/reduce boundaries
    ▪   Group By Optimizations
        ▪   Various optimizations to make group by faster
    ▪   Optimizations to reduce the number of map files created by filter operations
        ▪   Filters with a large number of mappers produces a lot of files which slows down the following
            operations.
Future Directions
▪   MapReduce Integration
    ▪   Schema-less MapReduce
        ▪   TRANSFORM needs a schema while MapReduce is schema-less.
    ▪   Improvements to TRANSFORM
        ▪   Make this more intuitive to MapReduce developers - evaluate some other keywords, etc.


▪   User Experience
    ▪   Create a web interface
    ▪   Error reporting improvements for parse errors
    ▪   Add “help” command to the CLI
    ▪   JDBC driver to enable traditional database tools to be used with Hive
Future Directions
▪   Integrating Dynamic SerDe with the DDL
    ▪   This allows the users to create typed tables along with list and map types from the DDL


▪   Transformations in LOAD DATA
    ▪   LOAD DATA currently does not transform the input data if it is not in the format expected by the
        destination table.


▪   Explode and Collect Operators
    ▪   Explode and collect operators to convert collections to individual items and vice versa.


▪   Propagating sort properties to destination tables
    ▪   If the query produces sorted we want to capture that in the destination table's metadata so that
        downstream optimizations can be enabled.
(c) 2008 Cloudera, Inc. or its licensors.  quot;Clouderaquot; is a registered trademark of Cloudera, Inc. All rights reserved. 1.0

More Related Content

What's hot

Hive User Meeting 2009 8 Facebook
Hive User Meeting 2009 8 FacebookHive User Meeting 2009 8 Facebook
Hive User Meeting 2009 8 FacebookZheng Shao
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebookragho
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveWill Du
 
Hadoop Summit 2009 Hive
Hadoop Summit 2009 HiveHadoop Summit 2009 Hive
Hadoop Summit 2009 HiveZheng Shao
 
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit JainApache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit JainYahoo Developer Network
 
report on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hivereport on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hivesiddharthboora
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat SheetHortonworks
 
Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Jeremy Walsh
 
Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Takrim Ul Islam Laskar
 
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperHadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperDataWorks Summit
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingMitsuharu Hamba
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Julian Hyde
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design PatternsDonald Miner
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQLkristinferrier
 

What's hot (18)

Hive User Meeting 2009 8 Facebook
Hive User Meeting 2009 8 FacebookHive User Meeting 2009 8 Facebook
Hive User Meeting 2009 8 Facebook
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
 
Hadoop Summit 2009 Hive
Hadoop Summit 2009 HiveHadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
 
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit JainApache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
 
report on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hivereport on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hive
 
Advanced topics in hive
Advanced topics in hiveAdvanced topics in hive
Advanced topics in hive
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14
 
Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)
 
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperHadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS Developer
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQL
 

Similar to 20081030linkedin

Hadoop & Zing
Hadoop & ZingHadoop & Zing
Hadoop & ZingLong Dao
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zingzingopen
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewDan Morrill
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiJoydeep Sen Sarma
 
Build tons of multi-device JavaScript applications - Part 1 : Boilerplate, de...
Build tons of multi-device JavaScript applications - Part 1 : Boilerplate, de...Build tons of multi-device JavaScript applications - Part 1 : Boilerplate, de...
Build tons of multi-device JavaScript applications - Part 1 : Boilerplate, de...Skilld
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Custom post-framworks
Custom post-framworksCustom post-framworks
Custom post-framworkswcto2017
 
Custom post-framworks
Custom post-framworksCustom post-framworks
Custom post-framworksKiera Howe
 
DBIx-DataModel v2.0 in detail
DBIx-DataModel v2.0 in detail DBIx-DataModel v2.0 in detail
DBIx-DataModel v2.0 in detail Laurent Dami
 
Survey of Front End Topics in Rails
Survey of Front End Topics in RailsSurvey of Front End Topics in Rails
Survey of Front End Topics in RailsBenjamin Vandgrift
 
ちょっとHadoopについて語ってみるか(仮題)
ちょっとHadoopについて語ってみるか(仮題)ちょっとHadoopについて語ってみるか(仮題)
ちょっとHadoopについて語ってみるか(仮題)moai kids
 
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINAGetting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINAJISC GECO
 
Web applications with Catalyst
Web applications with CatalystWeb applications with Catalyst
Web applications with Catalystsvilen.ivanov
 

Similar to 20081030linkedin (20)

Hadoop & Zing
Hadoop & ZingHadoop & Zing
Hadoop & Zing
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zing
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overview
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-Delhi
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
HivePart1.pptx
HivePart1.pptxHivePart1.pptx
HivePart1.pptx
 
Build tons of multi-device JavaScript applications - Part 1 : Boilerplate, de...
Build tons of multi-device JavaScript applications - Part 1 : Boilerplate, de...Build tons of multi-device JavaScript applications - Part 1 : Boilerplate, de...
Build tons of multi-device JavaScript applications - Part 1 : Boilerplate, de...
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
 
Custom post-framworks
Custom post-framworksCustom post-framworks
Custom post-framworks
 
Custom post-framworks
Custom post-framworksCustom post-framworks
Custom post-framworks
 
DBIx-DataModel v2.0 in detail
DBIx-DataModel v2.0 in detail DBIx-DataModel v2.0 in detail
DBIx-DataModel v2.0 in detail
 
Survey of Front End Topics in Rails
Survey of Front End Topics in RailsSurvey of Front End Topics in Rails
Survey of Front End Topics in Rails
 
Death of a Themer
Death of a ThemerDeath of a Themer
Death of a Themer
 
ちょっとHadoopについて語ってみるか(仮題)
ちょっとHadoopについて語ってみるか(仮題)ちょっとHadoopについて語ってみるか(仮題)
ちょっとHadoopについて語ってみるか(仮題)
 
barplotv4.pdf
barplotv4.pdfbarplotv4.pdf
barplotv4.pdf
 
Getting started with PostGIS geographic database
Getting started with PostGIS geographic databaseGetting started with PostGIS geographic database
Getting started with PostGIS geographic database
 
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINAGetting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
 
Hive Hadoop
Hive HadoopHive Hadoop
Hive Hadoop
 
Web applications with Catalyst
Web applications with CatalystWeb applications with Catalyst
Web applications with Catalyst
 

More from Jeff Hammerbacher (20)

20120223keystone
20120223keystone20120223keystone
20120223keystone
 
20100714accel
20100714accel20100714accel
20100714accel
 
20100608sigmod
20100608sigmod20100608sigmod
20100608sigmod
 
20100513brown
20100513brown20100513brown
20100513brown
 
20100423sage
20100423sage20100423sage
20100423sage
 
20100418sos
20100418sos20100418sos
20100418sos
 
20100301icde
20100301icde20100301icde
20100301icde
 
20100201hplabs
20100201hplabs20100201hplabs
20100201hplabs
 
20100128ebay
20100128ebay20100128ebay
20100128ebay
 
20091203gemini
20091203gemini20091203gemini
20091203gemini
 
20091203gemini
20091203gemini20091203gemini
20091203gemini
 
20091110startup2startup
20091110startup2startup20091110startup2startup
20091110startup2startup
 
20091030nasajpl
20091030nasajpl20091030nasajpl
20091030nasajpl
 
20091027genentech
20091027genentech20091027genentech
20091027genentech
 
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
 
20090622 Velocity
20090622 Velocity20090622 Velocity
20090622 Velocity
 
20090422 Www
20090422 Www20090422 Www
20090422 Www
 
20090309berkeley
20090309berkeley20090309berkeley
20090309berkeley
 
20081022cca
20081022cca20081022cca
20081022cca
 
20081009nychive
20081009nychive20081009nychive
20081009nychive
 

Recently uploaded

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Recently uploaded (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 

20081030linkedin

  • 1.
  • 2. An Introduction to Hive: Components and Query Language Jeff Hammerbacher Chief Scientist and VP of Product October 30, 2008
  • 3. Hive Components A Leaky Database ▪ Hadoop ▪ HDFS ▪ MapReduce (bundles Resource Manager and Job Scheduler) ▪ Hive ▪ Logical data partitioning ▪ Metastore (command line and web interfaces) ▪ Query Language ▪ Libraries to handle different serialization formats (SerDes) ▪ JDBC interface
  • 4. Related Work Glaringly Incomplete ▪ Gamma, Bubba, Volcano, etc. ▪ Google: Sawzall ▪ Yahoo: Pig ▪ IBM Research: JAQL ▪ Microsoft: SCOPE ▪ Greenplum: YAML MapReduce ▪ Aster Data: In-Database MapReduce ▪ Business.com: CloudBase
  • 5. Hive Resources ▪ Facebook Mirror: http://mirror.facebook.com/facebook/hive ▪ Currently the best place to get the Hive distribution ▪ Wiki page: http://wiki.apache.org/hadoop/Hive ▪ Getting started: http://wiki.apache.org/hadoop/Hive/GettingStarted ▪ Query language reference: http://wiki.apache.org/hadoop/Hive/HiveQL ▪ Presentations: http://wiki.apache.org/hadoop/Hive/Presentations ▪ Roadmap: http://wiki.apache.org/hadoop/Hive/Roadmap ▪ Mailing list: hive-users@publists.facebook.com ▪ JIRA: https://issues.apache.org/jira/browse/HADOOP/component/12312455
  • 6. Running Hive Quickstart ▪ <install Hadoop> ▪ wget http://mirror.facebook.com/facebook/hive/hadoop-0.19/dist.tar.gz ▪ (Replace 0.19 with 0.17 if you’re still on 0.17) ▪ tar xvzf dist.tar.gz ▪ cd dist ▪ export HADOOP=<path to bin/hadoop in your Hadoop distribution> ▪ Or: edit hadoop.bin.path and hadoop.conf.dir in conf/hive-default.xml ▪ bin/hive ▪ hive>
  • 7. Running Hive Configuration Details ▪ conf/hive-default.xml ▪ hadoop.bin.path: Points to bin/hadoop in your Hadoop installation ▪ hadoop.config.dir: Points to conf/ in your Hadoop installation ▪ hive.exec.scratchdir: HDFS directory where execution information is written ▪ hive.metastore.warehouse.dir: HDFS directory managed by Hive ▪ The rest of the properties relate to the Metastore ▪ conf/hive-log4j.properties ▪ Will put data into /tmp/{user.name}/hive.log by default ▪ conf/jpox.properties ▪ JPOX is a Java object persistence library used by the Metastore
  • 8. Populating Hive MovieLens Data ▪ <cd into your hive directory> ▪ wget http://www.grouplens.org/system/files/ml-data.tar__0.gz ▪ tar xvzf ml-data.tar__0.gz ▪ CREATE TABLE u_data (userid INT, movieid INT, rating INT, unixtime TIMESTAMP) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'; ▪ The first query can take ten seconds or more, as the Metastore needs to be created ▪ To confirm our table has been created: ▪ SHOW TABLES; ▪ DESCRIBE u_data; ▪ LOAD DATA LOCAL INPATH 'ml-data/u.data' OVERWRITE INTO TABLE u_data; ▪ SELECT COUNT(1) FROM u_data; ▪ Should fire off 2 MapReduce jobs and ultimately return a count of 100,000
  • 9. Hive Query Language Utility Statements ▪ SHOW TABLES [table_name | table_name_pattern] ▪ DESCRIBE [EXTENDED] table_name [PARTITION (partition_col = partition_col_value, ...)] ▪ EXPLAIN [EXTENDED] query_statement ▪ SET [EXTENDED] ▪ “SET property_name=property_value” to modify a value
  • 10. Hive Query Language CREATE TABLE Syntax ▪ CREATE [EXTERNAL] TABLE table_name (col_name data_type [col_comment], ...) [PARTITIONED BY (col_name data_type [col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name, ...)] INTO num_buckets BUCKETS] [ROW FORMAT row_format] [STORED AS file_format] [LOCATION hdfs_path] ▪ PARTITION columns are virtual columns; they are not part of the data itself but are derived on load ▪ CLUSTERED columns are real columns, hash partitioned into num_buckets folders ▪ ROW FORMAT can be used to specify a delimited data set or a custom deserializer ▪ Use EXTERNAL with ROW FORMAT, STORED AS, and LOCATION to analyze HDFS files in place ▪ “DROP TABLE table_name” can reverse this operation ▪ NB: Currently, DROP TABLE will delete both data and metadata
  • 11. Hive Query Language CREATE TABLE Syntax, Part Two ▪ data_type: primitive_type | array_type | map_type ▪ primitive_type: ▪ TINYINT | INT | BIGINT | BOOLEAN | FLOAT | DOUBLE | STRING ▪ DATE | DATETIME | TIMESTAMP ▪ array_type: ARRAY < primitive_type > ▪ map_type: MAP < primitive_type, primitive_type > ▪ row_format: ▪ DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char] [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char] ▪ SERIALIZER serde_name [WITH PROPERTIES property_name=property_value, property_name=property_value, ...] ▪ file_format: SEQUENCEFILE | TEXTFILE
  • 12. Hive Query Language ALTER TABLE Syntax ▪ ALTER TABLE table_name RENAME TO new_table_name; ▪ ALTER TABLE table_name ADD COLUMNS (col_name data_type [col_comment], ...); ▪ ALTER TABLE DROP partition_spec, partition_spec, ...; ▪ Future work: ▪ Support for removing or renaming columns ▪ Support for altering serialization format
  • 13. Hive Query Language LOAD DATA Syntax ▪ LOAD DATA [LOCAL] INPATH '/path/to/file' [OVERWRITE] INTO TABLE table_name [PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...)] ▪ You can load data from the local filesystem or anywhere in HDFS (cf. CREATE TABLE EXTERNAL) ▪ If you don’t specify OVERWRITE, data will be appended to existing table
  • 14. Hive Query Language SELECT Syntax ▪ [insert_clause] SELECT [ALL|DISTINCT] select_list FROM [table_source|join_source] [WHERE where_condition] [GROUP BY col_list] [ORDER BY col_list] [CLUSTER BY col_list] ▪ insert_clause: INSERT OVERWRITE destination ▪ destination: ▪ LOCAL DIRECTORY '/local/path' ▪ DIRECTORY '/hdfs/path' ▪ TABLE table_name [PARTITION (partition_col = partiton_col_value, ...)]
  • 15. Hive Query Language SELECT Syntax ▪ join_source: table_source join_clause table_source join_clause table_source ... ▪ join_clause ▪ [LEFT OUTER|RIGHT OUTER|FULL OUTER] JOIN ON (equality_expression, equality_expression, ...) ▪ Currently, only outer equi-joins are supported in Hive. ▪ There are two join algorithms ▪ Map-side merge join ▪ Reduce-side merge join
  • 16. Hive Query Language Building a Histogram of Review Counts ▪ CREATE TABLE review_counts (userid INT, review_count INT); ▪ INSERT OVERWRITE TABLE review_counts SELECT a.userid, COUNT(1) AS review_count FROM u_data a GROUP BY a.userid; ▪ SELECT b.review_count, COUNT(1) FROM review_counts b GROUP BY b.review_count; ▪ Notes: ▪ No INSERT OVERWRITE for second query means output is dumped to the shell ▪ Hive does not currently support CREATE TABLE AS ▪ We have to create the table and then INSERT into it ▪ Hive does not currently support subqueries ▪ We have to write two queries
  • 17. Hive Query Language Running Custom MapReduce ▪ Put the following into weekday_mapper.py: ▪ import sys import datetime for line in sys.stdin: line = line.strip() userid, movieid, rating, unixtime = line.split('t') weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday() print ','.join([userid, movieid, rating, str(weekday)]) ▪ CREATE TABLE u_data_new (userid INT, movieid INT, rating INT, weekday INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’; ▪ FROM u_data a INSERT OVERWRITE TABLE u_data_new SELECT TRANSFORM (a.userid, a.movieid, a.rating, a.unixtime) AS (userid, movieid, rating, weekday) USING ‘python /full/path/to/weekday_mapper.py’
  • 18. Hive Query Language Programmatic Access ▪ The Hive shell can take a file with queries to be executed ▪ bin/hive -f /path/to/query/file ▪ You can also run a Hive query straight from the command line ▪ bin/hive -e 'quoted query string' ▪A simple JDBC interface is available for experimentation as well ▪ https://issues.apache.org/jira/browse/HADOOP-4101
  • 19. Hive Components Metastore ▪ Currently uses an embedded Derby database for persistence ▪ While Derby is in place, you’ll need to put it into Server Mode to have more than one Hive concurrent Hive user ▪ See http://wiki.apache.org/hadoop/HiveDerbyServerMode ▪ Next release will use MySQL as default persistent data store ▪ The goal is have the persistent store be pluggable ▪ You can view the Thrift IDL for the metastore online ▪ https://svn.apache.org/repos/asf/hadoop/core/trunk/src/contrib/hive/metastore/if/hive_metastore.thrift
  • 20. Hive Components Query Processing ▪ Compiler ▪ Parser ▪ Type Checking ▪ Semantic Analysis ▪ Plan Generation ▪ Task Generation ▪ Execution Engine ▪ Plan ▪ Operators ▪ UDFs and UDAFs
  • 21. Future Directions ▪ Query Optimization ▪ Support for Statistics ▪ These stats are needed to make optimization decisions ▪ Join Optimizations ▪ Map-side joins, semi join techniques etc to do the join faster ▪ Predicate Pushdown Optimizations ▪ Pushing predicates just above the table scan for certain situations in joins as well as ensuring that only required columns are sent across map/reduce boundaries ▪ Group By Optimizations ▪ Various optimizations to make group by faster ▪ Optimizations to reduce the number of map files created by filter operations ▪ Filters with a large number of mappers produces a lot of files which slows down the following operations.
  • 22. Future Directions ▪ MapReduce Integration ▪ Schema-less MapReduce ▪ TRANSFORM needs a schema while MapReduce is schema-less. ▪ Improvements to TRANSFORM ▪ Make this more intuitive to MapReduce developers - evaluate some other keywords, etc. ▪ User Experience ▪ Create a web interface ▪ Error reporting improvements for parse errors ▪ Add “help” command to the CLI ▪ JDBC driver to enable traditional database tools to be used with Hive
  • 23. Future Directions ▪ Integrating Dynamic SerDe with the DDL ▪ This allows the users to create typed tables along with list and map types from the DDL ▪ Transformations in LOAD DATA ▪ LOAD DATA currently does not transform the input data if it is not in the format expected by the destination table. ▪ Explode and Collect Operators ▪ Explode and collect operators to convert collections to individual items and vice versa. ▪ Propagating sort properties to destination tables ▪ If the query produces sorted we want to capture that in the destination table's metadata so that downstream optimizations can be enabled.
  • 24. (c) 2008 Cloudera, Inc. or its licensors.  quot;Clouderaquot; is a registered trademark of Cloudera, Inc. All rights reserved. 1.0