Hive: Data Warehousing for Hadoop

•

5 j'aime•1,511 vues

bigdatasyd

Ben Lever, NICTA Meetup #2, 27 Mar 2012 - http://sydney.bigdataaustralia.com.au/events/53934632/

Technologie

Hive: Data Warehousing for
Hadoop

Ben Lever
@bmlever

Big Data Analytics Meetup
27 March 2012

Another Data Warehousing System?
• Problem:
– Lots of data
• Partial solution:
– Hadoop
• Another problem:
– MapReduce can be hard
– Schema information embedded in program – a lot
of data is still structured

Solution: Hive
• A system for querying and managing
structured data within Hadoop
– MapReduce for execution
– HDFS for storage
• Designed for end-users that know more SQL
than Java
• Apache v2
• hive.apache.org

Working example: MovieLens
• Movie ratings
• 3 “tables”:
Users Movies Ratings
id id user id
age title movie id
gender release date rating (1 – 5)
occupation action timestamp
zip code adventure
romance
...

www.grouplens.org

So far
• Hive shell
• Creating and loading tables
• Data model:
– INT, BIGINT, TINYINT, STRING, etc
– Also: FLOAT, DOUBLE, ARRAY, MAP, STRUCT
• Simple queries with filtering
• Table data is immutable
• Schema on readvsschema on write

Hive components
TABLE customer (
customer_id BIGINT,
Metastore gender STRING,
...

schema info
launch MapReduce
Driver MapReduc
e job

Hive query
HDFS
(SQL-like)
raw source data
(compressed)
SELECT *
FROM customers CLI
WHERE gender = ‘M’;

Metastore

Hadoop – The Definitive Guide

Other SQL-like features

• Aggregation – COUNT, AVG
• JOIN
• GROUP BY
• SORT BY
• Sub queries

Built in functions
• Text mining:
– ngrams()
– context_ngrams()
– sentences()
• Statistics + mathematics:
– stddev()
– histogram_numeric()
– log
– radians

User Defined Functions
• Written in Java
• User Defined Functions (UDFs):
– Single row  Single row
– e.g. mathematical and string functions
• User Defined Aggregate Functions (UDAFs):
– Multiple rows  Single row
– e.g. AVG
• User Defined Table Functions (UDTFs):
– Single row  Multiple rows
– e.g. “explode”

Hive Clients

Hadoop – The Definitive Guide

Sqoop
Move data between Hadoop
and relational databases

RDBMS Sqoop Hadoop
Hive

Metastore
schema

http://incubator.apache.org/projects/sqoop.html

Conclusion
• Scales to handle much more data than traditional
systems:
– Leverages Hadoop HDFS and MapReduce
– Relational/structured data
– Schema on read vs schema on write
• Supports rapid iteration of ad-hoc queries
– SQL-like querying language
– Complex queries (joins, etc) with minimal code
• Is not a database replacement:
– Treats data as immutable
– No indexing

Contenu connexe

Tendances

Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisTrieu Nguyen

Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.Zekeriya Besiroglu

Big Data and Hadoop EcosystemRajkumar Singh

Big Data Fundamentals in the Emerging New Data WorldJongwook Woo

Apache Hadoop at 10Cloudera, Inc.

Hadoop introduction葵慶李

The Evolution of the Hadoop EcosystemCloudera, Inc.

Hive and data analysis using pandasPurna Chander K

P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.MaharajothiP

Intro to Apache HadoopSufi Nawaz

Sep 2012 HUG: Apache Drill for Interactive Analysis Yahoo Developer Network

Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...AyeeshaParveen

Big data and HadoopRahul Agarwal

Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...AyeeshaParveen

BIG DATA: Apache HadoopOleksiy Krotov

Hadoop hbase introductionJakub Stransky

WaterlooHiveTalknzhang

Nextag talkJoydeep Sen Sarma

HADOOP TECHNOLOGY pptsravya raju

Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz

Tendances (20)

Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis

Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.

Big Data and Hadoop Ecosystem

Big Data Fundamentals in the Emerging New Data World

Apache Hadoop at 10

Hadoop introduction

The Evolution of the Hadoop Ecosystem

Hive and data analysis using pandas

P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.

Intro to Apache Hadoop

Sep 2012 HUG: Apache Drill for Interactive Analysis

Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...

Big data and Hadoop

Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...

BIG DATA: Apache Hadoop

Hadoop hbase introduction

WaterlooHiveTalk

Nextag talk

HADOOP TECHNOLOGY ppt

Introduction to the Hadoop Ecosystem (FrOSCon Edition)

Similaire à Hive: Data Warehousing for Hadoop

Apache DrillTed Dunning

Apache Hadoop 1.1Sperasoft

Microsoft's Hadoop StoryMichael Rys

Hadoop on Azure, Blue elephantsOvidiu Dimulescu

An introduction to apache drill presentationMapR Technologies

Real time hadoop + mapreduce introGeoff Hendrey

Drill njhug -19 feb2013MapR Technologies

Big data Hadoop Ayyappan Paramesh

Modern Big Data Analytics Tools: An OverviewGreat Wide Open

Etu L2 Training - Hadoop 企業應用實作James Chen

Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer

Big Data and Hadoop Training in ChandigarhBig Boxx Animation Academy

Apache Drill at ApacheCon2014Neeraja Rentachintala

2016-07-21-Godil-presentation.pptxD21CE161GOSWAMIPARTH

Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow

Big Data ProcessingMichael Ming Lei

SQL on Hadoop for the Oracle ProfessionalMichael Rainey

Paris Data Geek - Spark Streaming Djamel Zouaoui

Big data hadoop ecosystem and nosqlKhanderao Kand

02 data warehouse applications with hiveSubhas Kumar Ghosh

Similaire à Hive: Data Warehousing for Hadoop (20)

Apache Drill

Apache Hadoop 1.1

Microsoft's Hadoop Story

Hadoop on Azure, Blue elephants

An introduction to apache drill presentation

Real time hadoop + mapreduce intro

Drill njhug -19 feb2013

Big data Hadoop

Modern Big Data Analytics Tools: An Overview

Etu L2 Training - Hadoop 企業應用實作

Big Data Analytics with Hadoop, MongoDB and SQL Server

Big Data and Hadoop Training in Chandigarh

Apache Drill at ApacheCon2014

2016-07-21-Godil-presentation.pptx

Big Data Developers Moscow Meetup 1 - sql on hadoop

Big Data Processing

SQL on Hadoop for the Oracle Professional

Paris Data Geek - Spark Streaming

Big data hadoop ecosystem and nosql

02 data warehouse applications with hive

Dernier

Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez

MINDCTI Revenue Release Quarter One 2024MIND CTI

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

Manulife - Insurer Innovation Award 2024The Digital Insurer

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

Why Teams call analytics are critical to your entire businesspanagenda

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

Dernier (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

MINDCTI Revenue Release Quarter One 2024

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

Strategies for Landing an Oracle DBA Job as a Fresher

Automating Google Workspace (GWS) & more with Apps Script

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

A Domino Admins Adventures (Engage 2024)

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams

presentation ICT roal in 21st century education

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

Boost PC performance: How more available memory can improve productivity

Manulife - Insurer Innovation Award 2024

Exploring the Future Potential of AI-Enabled Smartphone Processors

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Why Teams call analytics are critical to your entire business

Apidays New York 2024 - The value of a flexible API Management solution for O...

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

Powerful Google developer tools for immediate impact! (2023-24 C)

Hive: Data Warehousing for Hadoop

1. Hive: Data Warehousing for Hadoop Ben Lever @bmlever Big Data Analytics Meetup 27 March 2012

2. Another Data Warehousing System? • Problem: – Lots of data • Partial solution: – Hadoop • Another problem: – MapReduce can be hard – Schema information embedded in program – a lot of data is still structured

3. Solution: Hive • A system for querying and managing structured data within Hadoop – MapReduce for execution – HDFS for storage • Designed for end-users that know more SQL than Java • Apache v2 • hive.apache.org

4. Working example: MovieLens • Movie ratings • 3 “tables”: Users Movies Ratings id id user id age title movie id gender release date rating (1 – 5) occupation action timestamp zip code adventure romance ... www.grouplens.org

5. Demo

6. So far • Hive shell • Creating and loading tables • Data model: – INT, BIGINT, TINYINT, STRING, etc – Also: FLOAT, DOUBLE, ARRAY, MAP, STRUCT • Simple queries with filtering • Table data is immutable • Schema on readvsschema on write

7. Hive components TABLE customer ( customer_id BIGINT, Metastore gender STRING, ... schema info launch MapReduce Driver MapReduc e job Hive query HDFS (SQL-like) raw source data (compressed) SELECT * FROM customers CLI WHERE gender = ‘M’;

8. Metastore Hadoop – The Definitive Guide

9. Other SQL-like features • Aggregation – COUNT, AVG • JOIN • GROUP BY • SORT BY • Sub queries

10. Demo

11. Built in functions • Text mining: – ngrams() – context_ngrams() – sentences() • Statistics + mathematics: – stddev() – histogram_numeric() – log – radians

12. User Defined Functions • Written in Java • User Defined Functions (UDFs): – Single row  Single row – e.g. mathematical and string functions • User Defined Aggregate Functions (UDAFs): – Multiple rows  Single row – e.g. AVG • User Defined Table Functions (UDTFs): – Single row  Multiple rows – e.g. “explode”

13. Hive Clients Hadoop – The Definitive Guide

14. Hive Server JDBC ODBC

15. Sqoop Move data between Hadoop and relational databases RDBMS Sqoop Hadoop Hive Metastore schema http://incubator.apache.org/projects/sqoop.html

16. Sqoop adapters

17. Conclusion • Scales to handle much more data than traditional systems: – Leverages Hadoop HDFS and MapReduce – Relational/structured data – Schema on read vs schema on write • Supports rapid iteration of ad-hoc queries – SQL-like querying language – Complex queries (joins, etc) with minimal code • Is not a database replacement: – Treats data as immutable – No indexing

Notes de l'éditeur

# of users = 943# of movies = 1682# of ratings = 100,000
ShellDriverCompilerExecution engineMetastore

Hive: Data Warehousing for Hadoop

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Hive: Data Warehousing for Hadoop

Similaire à Hive: Data Warehousing for Hadoop (20)

Dernier

Dernier (20)

Hive: Data Warehousing for Hadoop

Notes de l'éditeur