Indic threads pune12-comparing hadoop data storage

Comparing Hadoop Data Storage
(HDFS, HBase, Hive and Pig)

Rakesh Jadhav
SAS

Agenda

• Hadoop Ecosystem
• HDFS
• HBase
• Hive
• Pig

Hadoop Ecosystem Components
 HDFS: Hadoop Distributed File System
 MapReduce: Hadoop Distributed Programming Paradigm
 HBase: Hadoop Column Oriented Database for Random
Access Read/Write of Smaller Data
 Hive: Hadoop Petabyte scalable Data Warehousing
Infrastructure
 Pig: Hadoop Data Flow/Analysis Infrastructure
 Zookeeper: Hadoop Co-ordination service, Configuration Service
Infrastructure
 Chukwa: Hadoop Monitoring Service
 Avro: Hadoop Data Serialization De-Serialization
Infrastructure
 Mahout: Hadoop Scalable Machine Learning Library

HDFS (Data Storage)
Design Features

• Failure Is Norm
• Designed For Large Datasets than Small
• Designed For Batch Processing than Interactive
• Supports Write Once- Read Many
• Provides Interfaces to Move Processing Closer
To Data

HDFS

APPLICATION AREAS
• Large Log Processing
• Web search indexing
LIMITATIONS
• Small Size Problem
• Single Node Of Failure
• No Random Access
• No Write Support

HBase (Data Storage)
Design Features
• Key-Value Store (Like Map)
• Semi Structured Data
• Column Family, Time Stamp
• Key=RowKey.ColumnFamiliy.ColumnName.TimeStamp
• De-normalized Data
• Faster Data Retrieval Using Column Families
• Static Column Families, Dynamic Columns

RDBMS v/s HBase: Example
RDBMS
ID Name Age Birth- Marital Location Weight Employer
Place Status
1 Sam 35 Mumbai Married Pune 76 XYZ
2 Bob 56 Chicago Married New 79 PQR
York
HBase
Row Personal Information Other Information
Key (Column Family) (Column Family)

1 Nam Age: Birth-Place Marital Weight:T2 Locatio Employer:T1=
e: T2= :T1=Mumbai Status = 76 n: T2= XYZ
T1=S 35 :T2= Pune
am Married Weight:T1
Age: = 65 Locatio
T1:=2 Marital n:
5 Status: T1:=Mu
T1= mbai
Unmarried

2 … … … … … … …

HBase: Application Areas

• Applications which need Store/Access/Search
using Key
• Need Fast Random Access/Update to scalable
structured data
• Applications Needing Flexible Table Schema
• Applications Needing range-search capabilities
supported by key ordering

HBase: Limitations

• Expensive Full Row Read
• No Secondary Keys
• No SQL Support
• Not Efficient for Big Cell Values

Hive (Data Access)
Design Features

• Scalable data warehouse on top of Hadoop
developed by Facebook
• SQL like Query Language HiveQL
• Limited JDBC support
• Support for rich data types
• Ability to insert custom map-reduce jobs

Hive: Application Areas

• Adhoc analysis on huge structured data, not
having any requirement of low latency
• Log processing
• Text Mining
• Document Indexing
• Customer Facing business intelligence (Google
analytics)
• Predictive Modeling, hypothesis testing

Hive: Limitations

• No Support To Update Data
• Only Bulk Load Support
• Not Efficient For Small Data

Hive: Example

• create table employee (id bigint, name string,
age int…) ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't' STORED AS
TEXTFILE;
• LOAD DATA LOCAL INPATH
'/sas/employee.txt' OVERWRITE INTO
TABLE employee;
• INSERT OVERWRITE TABLE oldest_employee
SELECT * FROM employee SORT BY age
DESC LIMIT 100;

Pig(Data Access)

• Pig Latin High level data flow language.
• Client side library, no server side deployment needed.
• Batch processing large unstructured data
• Procedural language
• Runtime Schema Creation, Check point ability, Splits pipeline support
• Customer code support
• Rich data types
• Support for Joins

Pig: Application Areas

• Extract Transform Load (ETL)
• Unstructured Data Analysis

PIG: Limitations

• Not efficient for processing small datasets

PIG: Example

Load Emplyee data from text file, filter it using
age and joining year and group using joining
year.
1. records = LOAD 'sas/input/files/employee.txt'
AS (joiningYear:chararray, employeeId:int, age:int);
2. filtered_records = FILTER records BY age> 30 AND
( joiningYear >=2000 OR joiningYear <= 2012);
3. grouped_records = GROUP filtered_records BY joiningYear;
max_age = FOREACH grouped_records GENERATE group,
MAX(filtered_records.age);
DUMP max_age;

Conclusion

Organizations
•Revisit data strategy
•Evaluate Hadoop Ecosystem
•Build economical, scalable solutions for Big Data problems

References

• Hadoop: Definitive Guide, By Tom White
• http://hadoop.apache.org/
• http://developer.yahoo.com/hadoop/tutorial/
• http://www-
01.ibm.com/software/data/infosphere/hadoop/
• http://www.information-
management.com/blogs/
• http://www.mckinsey.com/insights/mgi/researc
h/technology_and_innovation/big_data_the_next
_frontier_for_innovation

Indic threads pune12-comparing hadoop data storage

Recommandé

Recommandé

Contenu connexe

Similaire à Indic threads pune12-comparing hadoop data storage

Similaire à Indic threads pune12-comparing hadoop data storage (20)

Plus de IndicThreads

Plus de IndicThreads (20)

Dernier

Dernier (20)

Indic threads pune12-comparing hadoop data storage