SlideShare une entreprise Scribd logo
1  sur  26
Data Warehouse applications with Hive
Motivation 
 Limitation of MR 
 Have to use M/R model 
 Not Reusable 
 Error prone 
 For complex jobs: 
 Multiple stage of Map/Reduce functions 
 write specify physical execution plan in the database
What is HIVE? 
• A system for querying and managing structured data 
built on top of Map/Reduce and Hadoop 
• Example: 
– Structured logs with rich data types (structs, lists and maps) 
– A user base wanting to access this data in the language of 
their choice 
– A lot of traditional SQL workloads on this data (filters, joins 
and aggregations) 
– Other non SQL workloads
Overview 
 Intuitive 
 Make the unstructured data looks like tables regardless 
how it really lay out 
 SQL based query can be directly against these tables 
 Generate specify execution plan for this query 
 What’s Hive 
 A data warehousing system to store structured data on 
Hadoop file system 
 Provide an easy query these data by execution Hadoop 
MapReduce plans
Dealing with Structured Data 
• Type system 
– Primitive types 
– Recursively build up using Composition/Maps/Lists 
• Generic (De)Serialization Interface (SerDe) 
– To recursively list schema 
– To recursively access fields within a row object 
• Serialization families implement interface 
– Thrift DDL based SerDe 
– Delimited text based SerDe 
– You can write your own SerDe 
• Schema Evolution
MetaStore 
• Stores Table/Partition properties: 
– Table schema and SerDe library 
– Table Location on HDFS 
– Logical Partitioning keys and types 
– Other information 
• Thrift API 
– Current clients in Php (Web Interface), Python (old CLI), Java 
(Query Engine and CLI), Perl (Tests) 
• Metadata can be stored as text files or even in a SQL 
backend
Data Model 
 Tables 
 Basic type columns (int, float, boolean) 
 Complex type: List / Map ( associate array) 
 Partitions 
 Buckets 
 CREATE TABLE sales( id INT, items 
ARRAY<STRUCT<id:INT,name:STRING> 
) PARITIONED BY (ds STRING) 
CLUSTERED BY (id) INTO 32 BUCKETS; 
 SELECT id FROM sales TABLESAMPLE (BUCKET 1 OUT OF 32)
Metadata 
 Database namespace 
 Table definitions 
 schema info, physical location In HDFS 
 Partition data 
 ORM Framework 
 All the metadata can be stored in Derby by default 
 Any database with JDBC can be configed
Hive CLI 
• DDL: 
– create table/drop table/rename table 
– alter table add column 
• Browsing: 
– show tables 
– describe table 
– cat table 
• Loading Data 
• Queries
Architecture 
HDFS 
Hive CLI 
Browsing Queries DDL 
Map Reduce 
Thrift API 
MetaStore 
Execution 
SerDe 
Thrift Jute JSON.. 
Hive QL 
Parser 
Planner 
Mgmt. Web UI
Data Model 
Hash 
Partitioning 
Logical Partitioning 
Schema 
Library 
clicks 
/hive/clicks 
/hive/clicks/ds=2008-03-25 
/hive/clicks/ds=2008-03-25/0 
… 
Tables 
HDFS MetaStore 
#Buckets=32 
Bucketing Info 
Partitioning Cols
Hive Query Language 
• Philosophy 
– SQL like constructs + Hadoop Streaming 
• Query Operators in initial version 
– Projections 
– Equijoins and Cogroups 
– Group by 
– Sampling 
• Output of these operators can be: 
– passed to Streaming mappers/reducers 
– can be stored in another Hive Table 
– can be output to HDFS files 
– can be output to local files
Hive Query Language 
• Package these capabilities into a more formal SQL like 
query language in next version 
• Introduce other important constructs: 
– Ability to stream data thru custom mappers/reducers 
– Multi table inserts 
– Multiple group bys 
– SQL like column expressions and some XPath like expressions 
– Etc..
Joins 
• Joins 
FROM page_view pv JOIN user u ON (pv.userid = u.id) 
INSERT INTO TABLE pv_users 
SELECT pv.*, u.gender, u.age 
WHERE pv.date = 2008-03-03; 
• Outer Joins 
FROM page_view pv FULL OUTER JOIN user u ON (pv.userid = 
u.id) INSERT INTO TABLE pv_users 
SELECT pv.*, u.gender, u.age 
WHERE pv.date = 2008-03-03;
Hive QL – Join 
pag 
eid 
use 
rid 
1 111 9:08:01 
2 111 9:08:13 
1 222 9:08:14 
• QL: 
time 
use 
rid 
age gende 
r 
111 25 femal 
e 
222 32 male 
pag 
eid 
INSERT INTO TABLE pv_users 
SELECT pv.pageid, u.age 
FROM page_view pv JOIN user u ON (pv.userid = u.userid); 
age 
1 25 
2 25 
1 32 
X = 
page_view 
user 
pv_users
Aggregations and Multi-Table Inserts 
FROM pv_users 
INSERT INTO TABLE pv_gender_uu 
SELECT pv_users.gender, count(DISTINCT 
pv_users.userid) 
GROUP BY(pv_users.gender) 
INSERT INTO TABLE pv_ip_uu 
SELECT pv_users.ip, count(DISTINCT pv_users.id) 
GROUP BY(pv_users.ip);
Running Custom Map/Reduce Scripts 
FROM ( 
FROM pv_users 
SELECT TRANSFORM(pv_users.userid, 
pv_users.date) USING 'map_script' 
AS(dt, uid) 
CLUSTER BY(dt)) map 
INSERT INTO TABLE pv_users_reduced 
SELECT TRANSFORM(map.dt, map.uid) USING 
'reduce_script' AS (date, count);
Inserts into Files, Tables and Local Files 
FROM pv_users 
INSERT INTO TABLE pv_gender_sum 
SELECT pv_users.gender, 
count_distinct(pv_users.userid) 
GROUP BY(pv_users.gender) 
INSERT INTO DIRECTORY 
‘/user/facebook/tmp/pv_age_sum.dir’ 
SELECT pv_users.age, count_distinct(pv_users.userid) 
GROUP BY(pv_users.age) 
INSERT INTO LOCAL DIRECTORY 
‘/home/me/pv_age_sum.dir’ 
FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY 013 
SELECT pv_users.age, count_distinct(pv_users.userid) 
GROUP BY(pv_users.age);
Wordcount in Hive 
FROM ( 
MAP doctext USING 'python wc_mapper.py' AS (word, 
cnt) 
FROM docs 
CLUSTER BY word 
) map_output 
REDUCE word, cnt USING 'pythonwc_reduce.py';
Performance 
 GROUP BY operation 
 Efficient execution plans based on: 
 Data skew: 
 how evenly distributed data across a number of 
physical nodes 
 bottleneck VS load balance 
 Partial aggregation: 
 Group the data with the same group by value as 
soon as possible 
 In memory hash-table for mapper 
 Earlier than combiner
Performance 
 JOIN operation 
 Traditional Map-Reduce Join 
 Early Map-side Join 
 very efficient for joining a small table with a large table 
 Keep smaller table data in memory first 
 Join with a chunk of larger table data each time 
 Space complexity for time complexity
Performance 
 Ser/De 
 Describe how to load the data from the file into a 
representation that make it looks like a table; 
 Lazy load 
 Create the field object when necessary 
 Reduce the overhead to create unnecessary objects in Hive 
 Java is expensive to create objects 
 Increase performance
Pros 
 Pros 
 A easy way to process large scale data 
 Support SQL-based queries 
 Provide more user defined interfaces to extend 
 Programmability 
 Efficient execution plans for performance 
 Interoperability with other database tools
Cons 
 Cons 
 No easy way to append data 
 Files in HDFS are immutable
Application 
 Log processing 
 Daily Report 
 User Activity Measurement 
 Data/Text mining 
 Machine learning (Training Data) 
 Business intelligence 
 Advertising Delivery 
 Spam Detection
End of session 
Day – 4: Data Warehouse applications with Hive

Contenu connexe

Tendances

Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFramePrashant Gupta
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBaseCloudera, Inc.
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialFarzad Nozarian
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandasPurna Chander K
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveWill Du
 
Advance Hive, NoSQL Database (HBase) - Module 7
Advance Hive, NoSQL Database (HBase) - Module 7Advance Hive, NoSQL Database (HBase) - Module 7
Advance Hive, NoSQL Database (HBase) - Module 7Rohit Agrawal
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 

Tendances (20)

Apache hive
Apache hiveApache hive
Apache hive
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
06 pig etl features
06 pig etl features06 pig etl features
06 pig etl features
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshop
 
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandas
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
 
Advance Hive, NoSQL Database (HBase) - Module 7
Advance Hive, NoSQL Database (HBase) - Module 7Advance Hive, NoSQL Database (HBase) - Module 7
Advance Hive, NoSQL Database (HBase) - Module 7
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Hadoop
HadoopHadoop
Hadoop
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
Hadoop architecture by ajay
Hadoop architecture by ajayHadoop architecture by ajay
Hadoop architecture by ajay
 

Similaire à 02 data warehouse applications with hive

Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 
Facebook hadoop-summit
Facebook hadoop-summitFacebook hadoop-summit
Facebook hadoop-summitS S
 
WaterlooHiveTalk
WaterlooHiveTalkWaterlooHiveTalk
WaterlooHiveTalknzhang
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Hive Evolution: ApacheCon NA 2010
Hive Evolution:  ApacheCon NA 2010Hive Evolution:  ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010John Sichi
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1Sperasoft
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010ragho
 
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxvishwasgarade1
 
Hive Apachecon 2008
Hive Apachecon 2008Hive Apachecon 2008
Hive Apachecon 2008athusoo
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at FacebookS S
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebookelliando dias
 
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit JainApache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit JainYahoo Developer Network
 
Hive : WareHousing Over hadoop
Hive :  WareHousing Over hadoopHive :  WareHousing Over hadoop
Hive : WareHousing Over hadoopChirag Ahuja
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalMichael Rainey
 

Similaire à 02 data warehouse applications with hive (20)

Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
Facebook hadoop-summit
Facebook hadoop-summitFacebook hadoop-summit
Facebook hadoop-summit
 
WaterlooHiveTalk
WaterlooHiveTalkWaterlooHiveTalk
WaterlooHiveTalk
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Hive Evolution: ApacheCon NA 2010
Hive Evolution:  ApacheCon NA 2010Hive Evolution:  ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010
 
Hive Hadoop
Hive HadoopHive Hadoop
Hive Hadoop
 
20080529dublinpt3
20080529dublinpt320080529dublinpt3
20080529dublinpt3
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
 
Unit 5-lecture4
Unit 5-lecture4Unit 5-lecture4
Unit 5-lecture4
 
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptx
 
Hive Apachecon 2008
Hive Apachecon 2008Hive Apachecon 2008
Hive Apachecon 2008
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at Facebook
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
 
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit JainApache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
 
Hive : WareHousing Over hadoop
Hive :  WareHousing Over hadoopHive :  WareHousing Over hadoop
Hive : WareHousing Over hadoop
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle Professional
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014
 

Plus de Subhas Kumar Ghosh

07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descentSubhas Kumar Ghosh
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clusteringSubhas Kumar Ghosh
 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)Subhas Kumar Ghosh
 
02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysisSubhas Kumar Ghosh
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tipsSubhas Kumar Ghosh
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorSubhas Kumar Ghosh
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitionerSubhas Kumar Ghosh
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepSubhas Kumar Ghosh
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operationSubhas Kumar Ghosh
 
02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configurationSubhas Kumar Ghosh
 

Plus de Subhas Kumar Ghosh (20)

07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)
 
04 pig data operations
04 pig data operations04 pig data operations
04 pig data operations
 
02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
Hadoop Day 3
Hadoop Day 3Hadoop Day 3
Hadoop Day 3
 
Hadoop exercise
Hadoop exerciseHadoop exercise
Hadoop exercise
 
Hadoop map reduce v2
Hadoop map reduce v2Hadoop map reduce v2
Hadoop map reduce v2
 
Hadoop job chaining
Hadoop job chainingHadoop job chaining
Hadoop job chaining
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparator
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitioner
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operation
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
Hadoop availability
Hadoop availabilityHadoop availability
Hadoop availability
 
Hadoop scheduler
Hadoop schedulerHadoop scheduler
Hadoop scheduler
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
 
02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configuration
 

Dernier

Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 

Dernier (20)

Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 

02 data warehouse applications with hive

  • 2. Motivation  Limitation of MR  Have to use M/R model  Not Reusable  Error prone  For complex jobs:  Multiple stage of Map/Reduce functions  write specify physical execution plan in the database
  • 3. What is HIVE? • A system for querying and managing structured data built on top of Map/Reduce and Hadoop • Example: – Structured logs with rich data types (structs, lists and maps) – A user base wanting to access this data in the language of their choice – A lot of traditional SQL workloads on this data (filters, joins and aggregations) – Other non SQL workloads
  • 4. Overview  Intuitive  Make the unstructured data looks like tables regardless how it really lay out  SQL based query can be directly against these tables  Generate specify execution plan for this query  What’s Hive  A data warehousing system to store structured data on Hadoop file system  Provide an easy query these data by execution Hadoop MapReduce plans
  • 5. Dealing with Structured Data • Type system – Primitive types – Recursively build up using Composition/Maps/Lists • Generic (De)Serialization Interface (SerDe) – To recursively list schema – To recursively access fields within a row object • Serialization families implement interface – Thrift DDL based SerDe – Delimited text based SerDe – You can write your own SerDe • Schema Evolution
  • 6. MetaStore • Stores Table/Partition properties: – Table schema and SerDe library – Table Location on HDFS – Logical Partitioning keys and types – Other information • Thrift API – Current clients in Php (Web Interface), Python (old CLI), Java (Query Engine and CLI), Perl (Tests) • Metadata can be stored as text files or even in a SQL backend
  • 7. Data Model  Tables  Basic type columns (int, float, boolean)  Complex type: List / Map ( associate array)  Partitions  Buckets  CREATE TABLE sales( id INT, items ARRAY<STRUCT<id:INT,name:STRING> ) PARITIONED BY (ds STRING) CLUSTERED BY (id) INTO 32 BUCKETS;  SELECT id FROM sales TABLESAMPLE (BUCKET 1 OUT OF 32)
  • 8. Metadata  Database namespace  Table definitions  schema info, physical location In HDFS  Partition data  ORM Framework  All the metadata can be stored in Derby by default  Any database with JDBC can be configed
  • 9. Hive CLI • DDL: – create table/drop table/rename table – alter table add column • Browsing: – show tables – describe table – cat table • Loading Data • Queries
  • 10. Architecture HDFS Hive CLI Browsing Queries DDL Map Reduce Thrift API MetaStore Execution SerDe Thrift Jute JSON.. Hive QL Parser Planner Mgmt. Web UI
  • 11. Data Model Hash Partitioning Logical Partitioning Schema Library clicks /hive/clicks /hive/clicks/ds=2008-03-25 /hive/clicks/ds=2008-03-25/0 … Tables HDFS MetaStore #Buckets=32 Bucketing Info Partitioning Cols
  • 12. Hive Query Language • Philosophy – SQL like constructs + Hadoop Streaming • Query Operators in initial version – Projections – Equijoins and Cogroups – Group by – Sampling • Output of these operators can be: – passed to Streaming mappers/reducers – can be stored in another Hive Table – can be output to HDFS files – can be output to local files
  • 13. Hive Query Language • Package these capabilities into a more formal SQL like query language in next version • Introduce other important constructs: – Ability to stream data thru custom mappers/reducers – Multi table inserts – Multiple group bys – SQL like column expressions and some XPath like expressions – Etc..
  • 14. Joins • Joins FROM page_view pv JOIN user u ON (pv.userid = u.id) INSERT INTO TABLE pv_users SELECT pv.*, u.gender, u.age WHERE pv.date = 2008-03-03; • Outer Joins FROM page_view pv FULL OUTER JOIN user u ON (pv.userid = u.id) INSERT INTO TABLE pv_users SELECT pv.*, u.gender, u.age WHERE pv.date = 2008-03-03;
  • 15. Hive QL – Join pag eid use rid 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 • QL: time use rid age gende r 111 25 femal e 222 32 male pag eid INSERT INTO TABLE pv_users SELECT pv.pageid, u.age FROM page_view pv JOIN user u ON (pv.userid = u.userid); age 1 25 2 25 1 32 X = page_view user pv_users
  • 16. Aggregations and Multi-Table Inserts FROM pv_users INSERT INTO TABLE pv_gender_uu SELECT pv_users.gender, count(DISTINCT pv_users.userid) GROUP BY(pv_users.gender) INSERT INTO TABLE pv_ip_uu SELECT pv_users.ip, count(DISTINCT pv_users.id) GROUP BY(pv_users.ip);
  • 17. Running Custom Map/Reduce Scripts FROM ( FROM pv_users SELECT TRANSFORM(pv_users.userid, pv_users.date) USING 'map_script' AS(dt, uid) CLUSTER BY(dt)) map INSERT INTO TABLE pv_users_reduced SELECT TRANSFORM(map.dt, map.uid) USING 'reduce_script' AS (date, count);
  • 18. Inserts into Files, Tables and Local Files FROM pv_users INSERT INTO TABLE pv_gender_sum SELECT pv_users.gender, count_distinct(pv_users.userid) GROUP BY(pv_users.gender) INSERT INTO DIRECTORY ‘/user/facebook/tmp/pv_age_sum.dir’ SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY(pv_users.age) INSERT INTO LOCAL DIRECTORY ‘/home/me/pv_age_sum.dir’ FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY 013 SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY(pv_users.age);
  • 19. Wordcount in Hive FROM ( MAP doctext USING 'python wc_mapper.py' AS (word, cnt) FROM docs CLUSTER BY word ) map_output REDUCE word, cnt USING 'pythonwc_reduce.py';
  • 20. Performance  GROUP BY operation  Efficient execution plans based on:  Data skew:  how evenly distributed data across a number of physical nodes  bottleneck VS load balance  Partial aggregation:  Group the data with the same group by value as soon as possible  In memory hash-table for mapper  Earlier than combiner
  • 21. Performance  JOIN operation  Traditional Map-Reduce Join  Early Map-side Join  very efficient for joining a small table with a large table  Keep smaller table data in memory first  Join with a chunk of larger table data each time  Space complexity for time complexity
  • 22. Performance  Ser/De  Describe how to load the data from the file into a representation that make it looks like a table;  Lazy load  Create the field object when necessary  Reduce the overhead to create unnecessary objects in Hive  Java is expensive to create objects  Increase performance
  • 23. Pros  Pros  A easy way to process large scale data  Support SQL-based queries  Provide more user defined interfaces to extend  Programmability  Efficient execution plans for performance  Interoperability with other database tools
  • 24. Cons  Cons  No easy way to append data  Files in HDFS are immutable
  • 25. Application  Log processing  Daily Report  User Activity Measurement  Data/Text mining  Machine learning (Training Data)  Business intelligence  Advertising Delivery  Spam Detection
  • 26. End of session Day – 4: Data Warehouse applications with Hive