SlideShare une entreprise Scribd logo
1  sur  41
Télécharger pour lire hors ligne
Introduction to Data Analysis with Hadoop and Hive
                                   Jonathan Seidman
                                         ChicagoDB
                                   February 21 | 2011
About Me


•  Lead Engineer on Business Intelligence/Data Infrastructure team at
   Orbitz, former member of Machine Learning team	

•  Co-organizer/founder of Chicago Hadoop User Group (http://
   www.meetup.com/Chicago-area-Hadoop-User-Group-CHUG/)	

•  Recovering Java developer	

•  jseidman@orbitz.com	

•  @jseidman	

•  @OrbitzTalent	





                                                                        page 2
Why Hadoop and Hive?	





                          page 3
Some Hadoop “Clichés” (Which are still true…)


Hadoop allows you to store and process data that was
 previously impractical because of cost, technical issues,
 etc.	





                                                             page 4
Utterly redonkulous amounts of money 	





$ per managed TB	





                                                                 page 5
Utterly redonkulous amounts of money	





                                More reasonable amounts of money	

$ per managed TB	





                                                                      page 6
Adding data to our data warehouse also requires a lengthy
 plan/implement/deploy cycle.	



Because of the expense and time our data teams need to be
 very judicious about which data gets added. This means
 that potentially valuable data may not be saved.	





                                                            page 7
Hadoop brings our cost per TB down to $1500 (or even less)	





                                                                page 8
Hadoop Distributed File System

HDFS provides economical, reliable, fault tolerant and
 scalable storage of very large datasets across machines in
 a cluster.	





                                                              page 9
Some Hadoop “Clichés” (Which are still true…)


 Hadoop places no constraints on how data is processed.	





                                                             page 10
Some Hadoop “Clichés” (Which are still true…)


 Hadoop makes it relatively easy to efficiently process all the
  data stored in HDFS.	



 MapReduce is a programming model for efficient
  distributed processing. Designed to reliably perform
  computations on large volumes of data in parallel.	



 MapRedue Removes much of the burden of writing
  distributed computations.	




                                                            page 11
The Problem with MapReduce

•          package org.myorg;
•    2.
•    3.    import java.io.IOException;
•    4.    import java.util.*;
•    5.
•    6.    import org.apache.hadoop.fs.Path;
•    7.    import org.apache.hadoop.conf.*;
•    8.    import org.apache.hadoop.io.*;
•    9.    import org.apache.hadoop.mapred.*;
•    10.   import org.apache.hadoop.util.*;
•    11.
•    12.   public class WordCount {
•    13.
•    14.       public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
•    15.        private final static IntWritable one = new IntWritable(1);
•    16.           private Text word = new Text();
•    17.
•    18.           public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
•    19.            String line = value.toString();
•    20.               StringTokenizer tokenizer = new StringTokenizer(line);
•    21.               while (tokenizer.hasMoreTokens()) {
•    22.                   word.set(tokenizer.nextToken());
•    23.                   output.collect(word, one);
•    24.               }
•    25.           }
•    26.       }
•    27.
•    28.       public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
•    29.        public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
•    30.               int sum = 0;
•    31.               while (values.hasNext()) {
•    32.                   sum += values.next().get();
•    33.               }
•    34.               output.collect(key, new IntWritable(sum));
•    35.           }
•    36.       }
•    37.
•    38.       public static void main(String[] args) throws Exception {
•    39.           JobConf conf = new JobConf(WordCount.class);
•    40.           conf.setJobName("wordcount");
•    41.
•    42.           conf.setOutputKeyClass(Text.class);
•    43.           conf.setOutputValueClass(IntWritable.class);
•    44.
•    45.           conf.setMapperClass(Map.class);
•    46.           conf.setCombinerClass(Reduce.class);
•    47.           conf.setReducerClass(Reduce.class);
•    48.
•    49.           conf.setInputFormat(TextInputFormat.class);
•    50.           conf.setOutputFormat(TextOutputFormat.class);
•    51.
•    52.           FileInputFormat.setInputPaths(conf, new Path(args[0]));
•    53.           FileOutputFormat.setOutputPath(conf, new Path(args[1]));
•    54.
•    55.           JobClient.runJob(conf);
•    57.       }
•    58.   }




                                                                                                                                                                page 12
Hive Overview

Hive is an open-source data warehousing solution built on top of
  Hadoop which allows for easy data summarization, ad-hoc querying
  and analysis of large datasets stored in Hadoop.	



Developed at Facebook to provide a structured data model over Hadoop
 data.	



Simplifies Hadoop data analysis – users can use a familiar SQL model
  rather than writing low level custom code.	



Hive queries are compiled into Hadoop MapReduce jobs.	



Designed for scalability, not low latency. 	


                                                                       page 13
Hive provides the basis for a new data analysis infrastructure.	



We currently run Hive 0.6.0 with Cloudera CDH2 (Hadoop 0.20.1)	





                                                                     page 14
Hive Architecture (Simplified)




                                 page 15
Hive Overview – Comparison to Traditional DBMS Systems


Although Hive uses a model familiar to database users, it does not
  support a full relational model and only supports a subset of SQL.	



Schema on read vs. schema on write	



What Hadoop/Hive offers is highly scalable and fault-tolerant
 processing of very large data sets.	



Hive However is moving more and more towards being a parallel
  DBMS.	





                                                                          page 16
Hive - Data Model


Tables – analogous to tables in a standard RDBMS.	



Partitions and buckets – Allow Hive to prune data during
 query processing.	





                                                           page 17
Not Yet, But Soon


Multiple databases	



Views	



Indexes	





                        page 18
Hive – Data Types

Supports primitive types such as int, double, and string.	



Also supports complex types such as structs, maps (key/
 value tuples), and arrays (indexable lists).	





                                                               page 19
Extensible Storage Model


Row formats determine how records are stored.	



Row format is defined by a SerDe (Serializer-Deserializer).	



Container format is determined by the file format.	





                                                                page 20
Hive – Hive Query Language

HiveQL – Supports basic SQL-like operations such as select, join,
  aggregate, union, sub-queries, etc.	



HiveQL queries are compiled into MapReduce processes.	



Supports embedding custom MapReduce scripts.	



Built in support for standard relational, arithmetic, and boolean
 operators.	



Supports aggregate functions, including statistical functions (avg,
  standard deviation, covariance, percentiles).	



                                                                      page 21
Hive – User Defined Functions

HiveQL is extensible through user defined functions
 implemented in Java. 	



Also supports aggregation functions.	



Provides table functions when more than one value needs to
  be returned.	





                                                             page 22
Hive – User Defined Functions
Example UDF – Find hotel’s position in an impression list:

package com.orbitz.hive;!
import org.apache.hadoop.hive.ql.exec.UDF;!
import org.apache.hadoop.io.Text;!


/**!
 * returns hotel_id's position given a hotel_id and impression list!
 */!
public final class GetPos extends UDF {!
       public Text evaluate(final Text hotel_id, final Text impressions) {!
            if (hotel_id == null || impressions == null)!
                return null;!


            String[] hotels = impressions.toString().split(";");!
            String position;!
            String id = hotel_id.toString();!
            int begin=0, end=0;!


            for (int i=0; i<hotels.length; i++) {!
                 begin = hotels[i].indexOf(",");!
                 end = hotels[i].lastIndexOf(",");!
                 position = hotels[i].substring(begin+1,end);!
                 if (id.equals(hotels[i].substring(0,begin)))!
                     return new Text(position);!
            }!
            return null;!
       }!
}!




                                                                              page 23
Hive – User Defined Functions

hive> add jar path-to-jar/pos.jar; !
hive> create temporary function getpos as
 'com.orbitz.hive.GetPos';!
hive> select getpos(‘1’,
 ‘1,3,100.00;2,1,100.00’);!
…!
hive> 3 !




                                            page 24
Hive MapReduce

Allows analysis not possible through standard HiveQL
 queries.	



Can be implemented in any language.	





                                                       page 25
Hive MapReduce

•  #!/usr/bin/python


  import sys


  for line in sys.stdin:

          line = line.replace(';', '|')

          impressions = line.split('|')

          for impression in impressions:

                  fields = "".join(impression).split(',')

                  print "%st%s" % (fields[0], fields[1])


  hive>   ADD FILE /home/jseidman/parse_impressions.py;

  hive>   FROM

      >     hotel_searches         

      >   SELECT

      >     TRANSFORM(impressions)               

      >   USING

      >     'parse_impressions.py'                

      >   AS

      >     hotel, pos;





                                                             page 26
Processing Web Analytics Logs

Hive provides the infrastructure to support analysis of web
 analytics logs stored in Hadoop	



Used to support analysis for machine learning tasks, cache
 optimization, keyword performance, etc.	





                                                              page 27
Processing Flow – Step 1




                           page 28
Processing Flow – Step 2




                           page 29
Processing Flow – Step 3




                           page 30
Processing Flow – Step 4




                           page 31
Processing Flow – Step 5




                           page 32
Processing Flow – Step 6




                           page 33
Importing Prepared Data to Hive

$HIVE_HOME/bin/hive -e "LOAD DATA INPATH !
  ’/output/part-00000' OVERWRITE INTO!
  TABLE hotel_searches PARTITION(dt='$YEAR-$MONTH-$DAY')"!


CREATE TABLE hotel_searches( !
 session_id STRING, host STRING, visitors_ip STRING,
 search_date STRING, search_time STRING, dept_date STRING,
 ret_date STRING, destination STRING, location_id STRING,
 number_of_guests INT, number_of_rooms INT, !
  impressions STRING)!
PARTITIONED BY (dt STRING)!
ROW FORMAT DELIMITED!
FIELDS TERMINATED BY 't’!
STORED AS TEXTFILE;!



                                                             page 34
Exporting Data from Hive Tables


hive> INSERT OVERWRITE LOCAL DIRECTORY !
    > '/tmp/searches.dat' !
    > SELECT * FROM hotel_searches; !




                                           page 35
Analyzing Prepared Data


Example - Find the Position of Each Booked Hotel in Search Results:	


   CREATE TABLE positions(!
     session_id STRING,!
     booked_hotel_id STRING,!
     position INT);!


   INSERT OVERWRITE TABLE

     positions!
   SELECT

     h.session_id, h.booked_hotel_id, i.position!
   FROM

     hotel_impressions i JOIN hotel_bookings h!
   ON

         (h.booked_hotel_id = i.hotel_id and h.session_id = i.session_id);!




                                                                              page 36
Analyzing Prepared Data


Example - Aggregate Booking Position by Location by Day:	

   CREATE TABLE position_aggregate_by_day(!
     location_id STRING,!
     booking_date STRING,!
     position INT,!
     pcount INT);!


   INSERT OVERWRITE TABLE!
     position_aggregate_by_day!
   SELECT!
     h.location_id, h.booking_date, i.position, count(1)!
   FROM!
     hotel_bookings h JOIN hotel_impressions i!
   ON!
    (i.hotel_id = h.booked_hotel_id and i.session_id = h.session_id)!
   GROUP BY!
     h.location_id, h.booking_date, i.position!




                                                                        page 37
Hive vs. Pig


Both are declarative languages, but Hive is SQL-like, Pig is
 a scripting language.	



Explicit schema vs. implicit schema.	



Hive metadata can be accessed by external tools.	





                                                               page 38
Hive vs. HBase


HBase is a column-based key value store as opposed to an
 SQL model.	



HBase offers lower latency and random access to data.	



Hive/HBase integration was recently released, allowing
 Hive queries to be executed over HBase tables.	





                                                           page 39
Hive – Lessons Learned


Job scheduling – Default Hadoop scheduling is FIFO. Consider using
  something like the fair scheduler.	



Multi-user Hive – Default install is single user. Multi-user installs
 require an external relational store.	



set mapred.reduce.tasks is your friend.	



Migrating Hive between clusters is not fun.	



Documentation is still a little sparse.	




                                                                        page 40
References


•  Hadoop project: http://hadoop.apache.org/	

•  Hive project: http://hadoop.apache.org/hive/	

•  Hive – A Petabyte Scale Data Warehouse Using Hadoop:
   http://i.stanford.edu/~ragho/hive-icde2010.pdf	

•  Hadoop The Definitive Guide, Second Edition, Tom White, O’Reilly
   Press, 2011	

•  Hive Evolution, John Sichi, November 2010: http://
   www.slideshare.net/jsichi/hive-evolution-apachecon-2010	





                                                                     page 41

Contenu connexe

En vedette

Log analysis with Hadoop in livedoor 2013
Log analysis with Hadoop in livedoor 2013Log analysis with Hadoop in livedoor 2013
Log analysis with Hadoop in livedoor 2013
SATOSHI TAGOMORI
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Jonathan Seidman
 
HW09 Social network analysis with Hadoop
HW09 Social network analysis with HadoopHW09 Social network analysis with Hadoop
HW09 Social network analysis with Hadoop
Cloudera, Inc.
 
Analyse des médias étrangers CNN vs CCTV
Analyse des médias étrangers CNN vs CCTVAnalyse des médias étrangers CNN vs CCTV
Analyse des médias étrangers CNN vs CCTV
Ninou Haiko
 
Resume of Vimal 4.1
Resume of Vimal 4.1Resume of Vimal 4.1
Resume of Vimal 4.1
Vimal Suthar
 
Hadoop - Stock Analysis
Hadoop - Stock AnalysisHadoop - Stock Analysis
Hadoop - Stock Analysis
Vaibhav Jain
 

En vedette (20)

Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011
 
Video Analysis in Hadoop
Video Analysis in HadoopVideo Analysis in Hadoop
Video Analysis in Hadoop
 
Log analysis with Hadoop in livedoor 2013
Log analysis with Hadoop in livedoor 2013Log analysis with Hadoop in livedoor 2013
Log analysis with Hadoop in livedoor 2013
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
 
HW09 Social network analysis with Hadoop
HW09 Social network analysis with HadoopHW09 Social network analysis with Hadoop
HW09 Social network analysis with Hadoop
 
Internal Hive
Internal HiveInternal Hive
Internal Hive
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
Large-scale social media analysis with Hadoop
Large-scale social media analysis with HadoopLarge-scale social media analysis with Hadoop
Large-scale social media analysis with Hadoop
 
Analyse des médias étrangers CNN vs CCTV
Analyse des médias étrangers CNN vs CCTVAnalyse des médias étrangers CNN vs CCTV
Analyse des médias étrangers CNN vs CCTV
 
IBM : Gouvernance de l\'Information - Principes &amp; Mise en oeuvre
IBM : Gouvernance de l\'Information - Principes &amp; Mise en oeuvreIBM : Gouvernance de l\'Information - Principes &amp; Mise en oeuvre
IBM : Gouvernance de l\'Information - Principes &amp; Mise en oeuvre
 
Intelligent Video Surveillance with Cloud Computing
Intelligent Video Surveillance with Cloud ComputingIntelligent Video Surveillance with Cloud Computing
Intelligent Video Surveillance with Cloud Computing
 
New trends in video analytics and surveillance systems for the mining industry
New trends in video analytics and surveillance systems for the mining industryNew trends in video analytics and surveillance systems for the mining industry
New trends in video analytics and surveillance systems for the mining industry
 
An Introduction to Video Analytics
An Introduction to Video Analytics An Introduction to Video Analytics
An Introduction to Video Analytics
 
Video Analytics on Hadoop webinar victor fang-201309
Video Analytics on Hadoop webinar victor fang-201309Video Analytics on Hadoop webinar victor fang-201309
Video Analytics on Hadoop webinar victor fang-201309
 
Hadoop data analysis
Hadoop data analysisHadoop data analysis
Hadoop data analysis
 
Resume of Vimal 4.1
Resume of Vimal 4.1Resume of Vimal 4.1
Resume of Vimal 4.1
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
 
Traffic data analysis using HADOOP
Traffic data analysis using HADOOPTraffic data analysis using HADOOP
Traffic data analysis using HADOOP
 
Basic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveBasic Sentiment Analysis using Hive
Basic Sentiment Analysis using Hive
 
Hadoop - Stock Analysis
Hadoop - Stock AnalysisHadoop - Stock Analysis
Hadoop - Stock Analysis
 

Similaire à Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011

Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
Jesus Rodriguez
 
An Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptxAn Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptx
iaeronlineexm
 

Similaire à Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011 (20)

מיכאל
מיכאלמיכאל
מיכאל
 
Windows Azure HDInsight Service
Windows Azure HDInsight ServiceWindows Azure HDInsight Service
Windows Azure HDInsight Service
 
Hive with HDInsight
Hive with HDInsightHive with HDInsight
Hive with HDInsight
 
Hadoop intro
Hadoop introHadoop intro
Hadoop intro
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
1. Apache HIVE
1. Apache HIVE1. Apache HIVE
1. Apache HIVE
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
 
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache HiveHarnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle Professional
 
Working with Hive Analytics
Working with Hive AnalyticsWorking with Hive Analytics
Working with Hive Analytics
 
An Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptxAn Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptx
 

Plus de Jonathan Seidman

Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Jonathan Seidman
 
Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010
Jonathan Seidman
 

Plus de Jonathan Seidman (13)

Foundations for Successful Data Projects – Strata London 2019
Foundations for Successful Data Projects – Strata London 2019Foundations for Successful Data Projects – Strata London 2019
Foundations for Successful Data Projects – Strata London 2019
 
Foundations strata sf-2019_final
Foundations strata sf-2019_finalFoundations strata sf-2019_final
Foundations strata sf-2019_final
 
Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018
 
Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018
 
Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014
 
Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
 
Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011
 
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
 
Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011
 
Real World Machine Learning at Orbitz, Strata 2011
Real World Machine Learning at Orbitz, Strata 2011Real World Machine Learning at Orbitz, Strata 2011
Real World Machine Learning at Orbitz, Strata 2011
 
Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011

  • 1. Introduction to Data Analysis with Hadoop and Hive Jonathan Seidman ChicagoDB February 21 | 2011
  • 2. About Me •  Lead Engineer on Business Intelligence/Data Infrastructure team at Orbitz, former member of Machine Learning team •  Co-organizer/founder of Chicago Hadoop User Group (http:// www.meetup.com/Chicago-area-Hadoop-User-Group-CHUG/) •  Recovering Java developer •  jseidman@orbitz.com •  @jseidman •  @OrbitzTalent page 2
  • 3. Why Hadoop and Hive? page 3
  • 4. Some Hadoop “Clichés” (Which are still true…) Hadoop allows you to store and process data that was previously impractical because of cost, technical issues, etc. page 4
  • 5. Utterly redonkulous amounts of money $ per managed TB page 5
  • 6. Utterly redonkulous amounts of money More reasonable amounts of money $ per managed TB page 6
  • 7. Adding data to our data warehouse also requires a lengthy plan/implement/deploy cycle. Because of the expense and time our data teams need to be very judicious about which data gets added. This means that potentially valuable data may not be saved. page 7
  • 8. Hadoop brings our cost per TB down to $1500 (or even less) page 8
  • 9. Hadoop Distributed File System HDFS provides economical, reliable, fault tolerant and scalable storage of very large datasets across machines in a cluster. page 9
  • 10. Some Hadoop “Clichés” (Which are still true…) Hadoop places no constraints on how data is processed. page 10
  • 11. Some Hadoop “Clichés” (Which are still true…) Hadoop makes it relatively easy to efficiently process all the data stored in HDFS. MapReduce is a programming model for efficient distributed processing. Designed to reliably perform computations on large volumes of data in parallel. MapRedue Removes much of the burden of writing distributed computations. page 11
  • 12. The Problem with MapReduce •  package org.myorg; •  2. •  3. import java.io.IOException; •  4. import java.util.*; •  5. •  6. import org.apache.hadoop.fs.Path; •  7. import org.apache.hadoop.conf.*; •  8. import org.apache.hadoop.io.*; •  9. import org.apache.hadoop.mapred.*; •  10. import org.apache.hadoop.util.*; •  11. •  12. public class WordCount { •  13. •  14. public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { •  15. private final static IntWritable one = new IntWritable(1); •  16. private Text word = new Text(); •  17. •  18. public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { •  19. String line = value.toString(); •  20. StringTokenizer tokenizer = new StringTokenizer(line); •  21. while (tokenizer.hasMoreTokens()) { •  22. word.set(tokenizer.nextToken()); •  23. output.collect(word, one); •  24. } •  25. } •  26. } •  27. •  28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { •  29. public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { •  30. int sum = 0; •  31. while (values.hasNext()) { •  32. sum += values.next().get(); •  33. } •  34. output.collect(key, new IntWritable(sum)); •  35. } •  36. } •  37. •  38. public static void main(String[] args) throws Exception { •  39. JobConf conf = new JobConf(WordCount.class); •  40. conf.setJobName("wordcount"); •  41. •  42. conf.setOutputKeyClass(Text.class); •  43. conf.setOutputValueClass(IntWritable.class); •  44. •  45. conf.setMapperClass(Map.class); •  46. conf.setCombinerClass(Reduce.class); •  47. conf.setReducerClass(Reduce.class); •  48. •  49. conf.setInputFormat(TextInputFormat.class); •  50. conf.setOutputFormat(TextOutputFormat.class); •  51. •  52. FileInputFormat.setInputPaths(conf, new Path(args[0])); •  53. FileOutputFormat.setOutputPath(conf, new Path(args[1])); •  54. •  55. JobClient.runJob(conf); •  57. } •  58. } page 12
  • 13. Hive Overview Hive is an open-source data warehousing solution built on top of Hadoop which allows for easy data summarization, ad-hoc querying and analysis of large datasets stored in Hadoop. Developed at Facebook to provide a structured data model over Hadoop data. Simplifies Hadoop data analysis – users can use a familiar SQL model rather than writing low level custom code. Hive queries are compiled into Hadoop MapReduce jobs. Designed for scalability, not low latency. page 13
  • 14. Hive provides the basis for a new data analysis infrastructure. We currently run Hive 0.6.0 with Cloudera CDH2 (Hadoop 0.20.1) page 14
  • 16. Hive Overview – Comparison to Traditional DBMS Systems Although Hive uses a model familiar to database users, it does not support a full relational model and only supports a subset of SQL. Schema on read vs. schema on write What Hadoop/Hive offers is highly scalable and fault-tolerant processing of very large data sets. Hive However is moving more and more towards being a parallel DBMS. page 16
  • 17. Hive - Data Model Tables – analogous to tables in a standard RDBMS. Partitions and buckets – Allow Hive to prune data during query processing. page 17
  • 18. Not Yet, But Soon Multiple databases Views Indexes page 18
  • 19. Hive – Data Types Supports primitive types such as int, double, and string. Also supports complex types such as structs, maps (key/ value tuples), and arrays (indexable lists). page 19
  • 20. Extensible Storage Model Row formats determine how records are stored. Row format is defined by a SerDe (Serializer-Deserializer). Container format is determined by the file format. page 20
  • 21. Hive – Hive Query Language HiveQL – Supports basic SQL-like operations such as select, join, aggregate, union, sub-queries, etc. HiveQL queries are compiled into MapReduce processes. Supports embedding custom MapReduce scripts. Built in support for standard relational, arithmetic, and boolean operators. Supports aggregate functions, including statistical functions (avg, standard deviation, covariance, percentiles). page 21
  • 22. Hive – User Defined Functions HiveQL is extensible through user defined functions implemented in Java. Also supports aggregation functions. Provides table functions when more than one value needs to be returned. page 22
  • 23. Hive – User Defined Functions Example UDF – Find hotel’s position in an impression list: package com.orbitz.hive;! import org.apache.hadoop.hive.ql.exec.UDF;! import org.apache.hadoop.io.Text;! /**! * returns hotel_id's position given a hotel_id and impression list! */! public final class GetPos extends UDF {! public Text evaluate(final Text hotel_id, final Text impressions) {! if (hotel_id == null || impressions == null)! return null;! String[] hotels = impressions.toString().split(";");! String position;! String id = hotel_id.toString();! int begin=0, end=0;! for (int i=0; i<hotels.length; i++) {! begin = hotels[i].indexOf(",");! end = hotels[i].lastIndexOf(",");! position = hotels[i].substring(begin+1,end);! if (id.equals(hotels[i].substring(0,begin)))! return new Text(position);! }! return null;! }! }! page 23
  • 24. Hive – User Defined Functions hive> add jar path-to-jar/pos.jar; ! hive> create temporary function getpos as 'com.orbitz.hive.GetPos';! hive> select getpos(‘1’, ‘1,3,100.00;2,1,100.00’);! …! hive> 3 ! page 24
  • 25. Hive MapReduce Allows analysis not possible through standard HiveQL queries. Can be implemented in any language. page 25
  • 26. Hive MapReduce •  #!/usr/bin/python
 import sys
 for line in sys.stdin:
         line = line.replace(';', '|')
         impressions = line.split('|')
         for impression in impressions:
                 fields = "".join(impression).split(',')
                 print "%st%s" % (fields[0], fields[1])
 hive> ADD FILE /home/jseidman/parse_impressions.py;
 hive> FROM
     >   hotel_searches         
     > SELECT
     >   TRANSFORM(impressions)               
     > USING
     >   'parse_impressions.py'                
     > AS
     >   hotel, pos;
 page 26
  • 27. Processing Web Analytics Logs Hive provides the infrastructure to support analysis of web analytics logs stored in Hadoop Used to support analysis for machine learning tasks, cache optimization, keyword performance, etc. page 27
  • 28. Processing Flow – Step 1 page 28
  • 29. Processing Flow – Step 2 page 29
  • 30. Processing Flow – Step 3 page 30
  • 31. Processing Flow – Step 4 page 31
  • 32. Processing Flow – Step 5 page 32
  • 33. Processing Flow – Step 6 page 33
  • 34. Importing Prepared Data to Hive $HIVE_HOME/bin/hive -e "LOAD DATA INPATH ! ’/output/part-00000' OVERWRITE INTO! TABLE hotel_searches PARTITION(dt='$YEAR-$MONTH-$DAY')"! CREATE TABLE hotel_searches( ! session_id STRING, host STRING, visitors_ip STRING, search_date STRING, search_time STRING, dept_date STRING, ret_date STRING, destination STRING, location_id STRING, number_of_guests INT, number_of_rooms INT, ! impressions STRING)! PARTITIONED BY (dt STRING)! ROW FORMAT DELIMITED! FIELDS TERMINATED BY 't’! STORED AS TEXTFILE;! page 34
  • 35. Exporting Data from Hive Tables hive> INSERT OVERWRITE LOCAL DIRECTORY ! > '/tmp/searches.dat' ! > SELECT * FROM hotel_searches; ! page 35
  • 36. Analyzing Prepared Data Example - Find the Position of Each Booked Hotel in Search Results: CREATE TABLE positions(! session_id STRING,! booked_hotel_id STRING,! position INT);! INSERT OVERWRITE TABLE
 positions! SELECT
 h.session_id, h.booked_hotel_id, i.position! FROM
 hotel_impressions i JOIN hotel_bookings h! ON
 (h.booked_hotel_id = i.hotel_id and h.session_id = i.session_id);! page 36
  • 37. Analyzing Prepared Data Example - Aggregate Booking Position by Location by Day: CREATE TABLE position_aggregate_by_day(! location_id STRING,!   booking_date STRING,!   position INT,!   pcount INT);! INSERT OVERWRITE TABLE! position_aggregate_by_day! SELECT! h.location_id, h.booking_date, i.position, count(1)! FROM! hotel_bookings h JOIN hotel_impressions i! ON! (i.hotel_id = h.booked_hotel_id and i.session_id = h.session_id)! GROUP BY! h.location_id, h.booking_date, i.position! page 37
  • 38. Hive vs. Pig Both are declarative languages, but Hive is SQL-like, Pig is a scripting language. Explicit schema vs. implicit schema. Hive metadata can be accessed by external tools. page 38
  • 39. Hive vs. HBase HBase is a column-based key value store as opposed to an SQL model. HBase offers lower latency and random access to data. Hive/HBase integration was recently released, allowing Hive queries to be executed over HBase tables. page 39
  • 40. Hive – Lessons Learned Job scheduling – Default Hadoop scheduling is FIFO. Consider using something like the fair scheduler. Multi-user Hive – Default install is single user. Multi-user installs require an external relational store. set mapred.reduce.tasks is your friend. Migrating Hive between clusters is not fun. Documentation is still a little sparse. page 40
  • 41. References •  Hadoop project: http://hadoop.apache.org/ •  Hive project: http://hadoop.apache.org/hive/ •  Hive – A Petabyte Scale Data Warehouse Using Hadoop: http://i.stanford.edu/~ragho/hive-icde2010.pdf •  Hadoop The Definitive Guide, Second Edition, Tom White, O’Reilly Press, 2011 •  Hive Evolution, John Sichi, November 2010: http:// www.slideshare.net/jsichi/hive-evolution-apachecon-2010 page 41