Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011

Introduction to Data Analysis with Hadoop and Hive
Jonathan Seidman
ChicagoDB
February 21 | 2011

About Me

•  Lead Engineer on Business Intelligence/Data Infrastructure team at
Orbitz, former member of Machine Learning team

•  Co-organizer/founder of Chicago Hadoop User Group (http://
www.meetup.com/Chicago-area-Hadoop-User-Group-CHUG/)

•  Recovering Java developer

•  jseidman@orbitz.com

•  @jseidman

•  @OrbitzTalent

page 2

Why Hadoop and Hive?

page 3

Some Hadoop “Clichés” (Which are still true…)

Hadoop allows you to store and process data that was
previously impractical because of cost, technical issues,
etc.

page 4

Utterly redonkulous amounts of money

$ per managed TB

page 5

Utterly redonkulous amounts of money

More reasonable amounts of money

$ per managed TB

page 6

Adding data to our data warehouse also requires a lengthy
plan/implement/deploy cycle.

Because of the expense and time our data teams need to be
very judicious about which data gets added. This means
that potentially valuable data may not be saved.

page 7

Hadoop brings our cost per TB down to $1500 (or even less)

page 8

Hadoop Distributed File System

HDFS provides economical, reliable, fault tolerant and
scalable storage of very large datasets across machines in
a cluster.

page 9


Hadoop places no constraints on how data is processed.

page 10


Hadoop makes it relatively easy to efﬁciently process all the
data stored in HDFS.

MapReduce is a programming model for efﬁcient
distributed processing. Designed to reliably perform
computations on large volumes of data in parallel.

MapRedue Removes much of the burden of writing
distributed computations.

page 11

The Problem with MapReduce

•  package org.myorg;
•  2.
•  3. import java.io.IOException;
•  4. import java.util.*;
•  5.
•  6. import org.apache.hadoop.fs.Path;
•  7. import org.apache.hadoop.conf.*;
•  8. import org.apache.hadoop.io.*;
•  9. import org.apache.hadoop.mapred.*;
•  10. import org.apache.hadoop.util.*;
•  11.
•  12. public class WordCount {
•  13.
•  14. public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
•  15. private final static IntWritable one = new IntWritable(1);
•  16. private Text word = new Text();
•  17.
•  18. public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
•  19. String line = value.toString();
•  20. StringTokenizer tokenizer = new StringTokenizer(line);
•  21. while (tokenizer.hasMoreTokens()) {
•  22. word.set(tokenizer.nextToken());
•  23. output.collect(word, one);
•  24. }
•  25. }
•  26. }
•  27.
•  28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
•  29. public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
•  30. int sum = 0;
•  31. while (values.hasNext()) {
•  32. sum += values.next().get();
•  33. }
•  34. output.collect(key, new IntWritable(sum));
•  35. }
•  36. }
•  37.
•  38. public static void main(String[] args) throws Exception {
•  39. JobConf conf = new JobConf(WordCount.class);
•  40. conf.setJobName("wordcount");
•  41.
•  42. conf.setOutputKeyClass(Text.class);
•  43. conf.setOutputValueClass(IntWritable.class);
•  44.
•  45. conf.setMapperClass(Map.class);
•  46. conf.setCombinerClass(Reduce.class);
•  47. conf.setReducerClass(Reduce.class);
•  48.
•  49. conf.setInputFormat(TextInputFormat.class);
•  50. conf.setOutputFormat(TextOutputFormat.class);
•  51.
•  52. FileInputFormat.setInputPaths(conf, new Path(args[0]));
•  53. FileOutputFormat.setOutputPath(conf, new Path(args[1]));
•  54.
•  55. JobClient.runJob(conf);
•  57. }
•  58. }

page 12

Hive Overview

Hive is an open-source data warehousing solution built on top of
Hadoop which allows for easy data summarization, ad-hoc querying
and analysis of large datasets stored in Hadoop.

Developed at Facebook to provide a structured data model over Hadoop
data.

Simpliﬁes Hadoop data analysis – users can use a familiar SQL model
rather than writing low level custom code.

Hive queries are compiled into Hadoop MapReduce jobs.

Designed for scalability, not low latency.

page 13

Hive provides the basis for a new data analysis infrastructure.

We currently run Hive 0.6.0 with Cloudera CDH2 (Hadoop 0.20.1)

page 14

Hive Architecture (Simplified)

page 15

Hive Overview – Comparison to Traditional DBMS Systems

Although Hive uses a model familiar to database users, it does not
support a full relational model and only supports a subset of SQL.

Schema on read vs. schema on write

What Hadoop/Hive offers is highly scalable and fault-tolerant
processing of very large data sets.

Hive However is moving more and more towards being a parallel
DBMS.

page 16

Hive - Data Model

Tables – analogous to tables in a standard RDBMS.

Partitions and buckets – Allow Hive to prune data during
query processing.

page 17

Not Yet, But Soon

Multiple databases

Views

Indexes

page 18

Hive – Data Types

Supports primitive types such as int, double, and string.

Also supports complex types such as structs, maps (key/
value tuples), and arrays (indexable lists).

page 19

Extensible Storage Model

Row formats determine how records are stored.

Row format is deﬁned by a SerDe (Serializer-Deserializer).

Container format is determined by the ﬁle format.

page 20

Hive – Hive Query Language

HiveQL – Supports basic SQL-like operations such as select, join,
aggregate, union, sub-queries, etc.

HiveQL queries are compiled into MapReduce processes.

Supports embedding custom MapReduce scripts.

Built in support for standard relational, arithmetic, and boolean
operators.

Supports aggregate functions, including statistical functions (avg,
standard deviation, covariance, percentiles).

page 21

Hive – User Defined Functions

HiveQL is extensible through user deﬁned functions
implemented in Java.

Also supports aggregation functions.

Provides table functions when more than one value needs to
be returned.

page 22

Example UDF – Find hotel’s position in an impression list:

package com.orbitz.hive;!
import org.apache.hadoop.hive.ql.exec.UDF;!
import org.apache.hadoop.io.Text;!

/**!
* returns hotel_id's position given a hotel_id and impression list!
*/!
public final class GetPos extends UDF {!
public Text evaluate(final Text hotel_id, final Text impressions) {!
if (hotel_id == null || impressions == null)!
return null;!

String[] hotels = impressions.toString().split(";");!
String position;!
String id = hotel_id.toString();!
int begin=0, end=0;!

for (int i=0; i<hotels.length; i++) {!
begin = hotels[i].indexOf(",");!
end = hotels[i].lastIndexOf(",");!
position = hotels[i].substring(begin+1,end);!
if (id.equals(hotels[i].substring(0,begin)))!
return new Text(position);!
}!
return null;!
}!
}!

page 23


hive> add jar path-to-jar/pos.jar; !
hive> create temporary function getpos as
'com.orbitz.hive.GetPos';!
hive> select getpos(‘1’,
‘1,3,100.00;2,1,100.00’);!
…!
hive> 3 !

page 24

Hive MapReduce

Allows analysis not possible through standard HiveQL
queries.

Can be implemented in any language.

page 25

Hive MapReduce

•  #!/usr/bin/python 

import sys 

for line in sys.stdin: 
        line = line.replace(';', '|') 
        impressions = line.split('|') 
        for impression in impressions: 
                fields = "".join(impression).split(',') 
                print "%st%s" % (fields[0], fields[1]) 

hive> ADD FILE /home/jseidman/parse_impressions.py; 
hive> FROM 
    >   hotel_searches          
    > SELECT 
    >   TRANSFORM(impressions)                
    > USING 
    >   'parse_impressions.py'                 
    > AS 
    >   hotel, pos; 

page 26

Processing Web Analytics Logs

Hive provides the infrastructure to support analysis of web
analytics logs stored in Hadoop

Used to support analysis for machine learning tasks, cache
optimization, keyword performance, etc.

page 27

Processing Flow – Step 1

page 28


page 29


page 30


page 31


page 32


page 33

Importing Prepared Data to Hive

$HIVE_HOME/bin/hive -e "LOAD DATA INPATH !
’/output/part-00000' OVERWRITE INTO!
TABLE hotel_searches PARTITION(dt='$YEAR-$MONTH-$DAY')"!

CREATE TABLE hotel_searches( !
session_id STRING, host STRING, visitors_ip STRING,
search_date STRING, search_time STRING, dept_date STRING,
ret_date STRING, destination STRING, location_id STRING,
number_of_guests INT, number_of_rooms INT, !
impressions STRING)!
PARTITIONED BY (dt STRING)!
ROW FORMAT DELIMITED!
FIELDS TERMINATED BY 't’!
STORED AS TEXTFILE;!

page 34

Exporting Data from Hive Tables

hive> INSERT OVERWRITE LOCAL DIRECTORY !
> '/tmp/searches.dat' !
> SELECT * FROM hotel_searches; !

page 35

Analyzing Prepared Data

Example - Find the Position of Each Booked Hotel in Search Results:

CREATE TABLE positions(!
session_id STRING,!
booked_hotel_id STRING,!
position INT);!

INSERT OVERWRITE TABLE 
positions!
SELECT 
h.session_id, h.booked_hotel_id, i.position!
FROM 
hotel_impressions i JOIN hotel_bookings h!
ON 
(h.booked_hotel_id = i.hotel_id and h.session_id = i.session_id);!

page 36

Analyzing Prepared Data

Example - Aggregate Booking Position by Location by Day:

CREATE TABLE position_aggregate_by_day(!
location_id STRING,!
  booking_date STRING,!
  position INT,!
  pcount INT);!

INSERT OVERWRITE TABLE!
position_aggregate_by_day!
SELECT!
h.location_id, h.booking_date, i.position, count(1)!
FROM!
hotel_bookings h JOIN hotel_impressions i!
ON!
(i.hotel_id = h.booked_hotel_id and i.session_id = h.session_id)!
GROUP BY!
h.location_id, h.booking_date, i.position!

page 37

Hive vs. Pig

Both are declarative languages, but Hive is SQL-like, Pig is
a scripting language.

Explicit schema vs. implicit schema.

Hive metadata can be accessed by external tools.

page 38

Hive vs. HBase

HBase is a column-based key value store as opposed to an
SQL model.

HBase offers lower latency and random access to data.

Hive/HBase integration was recently released, allowing
Hive queries to be executed over HBase tables.

page 39

Hive – Lessons Learned

Job scheduling – Default Hadoop scheduling is FIFO. Consider using
something like the fair scheduler.

Multi-user Hive – Default install is single user. Multi-user installs
require an external relational store.

set mapred.reduce.tasks is your friend.

Migrating Hive between clusters is not fun.

Documentation is still a little sparse.

page 40

References

•  Hadoop project: http://hadoop.apache.org/

•  Hive project: http://hadoop.apache.org/hive/

•  Hive – A Petabyte Scale Data Warehouse Using Hadoop:
http://i.stanford.edu/~ragho/hive-icde2010.pdf

•  Hadoop The Deﬁnitive Guide, Second Edition, Tom White, O’Reilly
Press, 2011

•  Hive Evolution, John Sichi, November 2010: http://
www.slideshare.net/jsichi/hive-evolution-apachecon-2010

page 41

Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (20)

Similaire à Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011

Similaire à Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011 (20)

Plus de Jonathan Seidman

Plus de Jonathan Seidman (13)

Dernier

Dernier (20)

Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011