HCatalog & Templeton

HCatalog & Templeton
Youngwoo Kim (brandon.kim@nexr.com, kt.com)
Daegeun Kim (dani.kim@geekple.com)
데이터분석플랫폼 KTCloudware (NexR)

Wednesday, July 18, 12

HCatalog


Hadoop Ecosystem
(Many data processing tools)

MapReduce Hive Pig

LoadFunc
StoreFunc
Metastore SerDe
SerDe

RDBMS

InputFormat / OutputFormat / ...

Filesystem


Problems

• Hive 외에는 메타스토어의 부재

• 한 클러스터에서 다양한 도구를 사용하는 경우 연동이 쉽지 않다.

• 매번 커뮤니케이션 비용이 발생

• 어디에? 어떻게? 뭘?

• M/R, Pig 사용자는 기억해야할 많은 정보

• 스키마, 데이터 경로 또는 포맷 변경은 M/R, Pig 에 많은 영향


HCatalog

• Apache Incubator

• Hive metastore 기반

• M/R, Pig 사용자에게 읽고 쓸 수 있는 프로그래밍 인터페이스 제공

• MapReduce 작업이 필요없는 모든 DDL 명령 제공 (CLI Commands)

• import/export, CREATE TABLE AS SELECT 등 제외

• Data exploration 기능 제공

• SHOW TABLES, DESCRIBE 제공

• http://incubator.apache.org/hcatalog/docs/r0.4.0/cli.html

• Hortonworks,Yahoo, Twitter, ... 등 개발


Table abstraction

• 메타데이터

• 데이터 위치, 스키마, 압축, 파티션, 포맷 등

• HCatalog를 이용하여 데이터를 추상화

• 한 곳에서 메타데이터가 관리되며 그 만큼 역할 또한 중요

• 컬럼 타입으로 primitives, map, list, struct 지원


HCatalog
MapReduce Hive Pig

HCatInputFormat HCatLoader
HCatOutputFormat HCatStorer

Metastore SerDe
SerDe

InputFormat
RDBMS
OutputFormat

Filesystem


Data types : Pig

HCatalog = Hive Pig

primitives
int, long, ﬂoat, double, chararray
(int, long, ﬂoat, double, string)

map
map
(contains key and value pairs)

list
bag
(contains a list elements of same data type)

struct
tuple
(contains elements of different data types)


Examples


DDL

$HCAT_HOME/bin/hcat -e “
drop table if exists rawevents;
create external table rawevents (
url string, user string
)
partitioned by (ds string)
“

$HIVE_HOME/bin/hive -e “
LOAD DATA LOCAL INPATH ‘...’ OVERWRITE INTO TABLE rawevents
PARTITION (ds=‘20120530`)
“


Pig

raw = LOAD '/data/rawevents/20120530' AS (url, user);

botless = FILTER raw BY myudfs.NotABot(user);

grpd = GROUP botless by (url, user);

cntd = FOREACH grpd GENERATE flatten(url, user), COUNT(botless);

STORE cntd INTO '/data/counted/20120530';
http://www.slideshare.net/hortonworks/h-cat-berlinbuzzwords2012 : Page. 8


Pig + HCatalog
Pig
raw = LOAD '/data/rawevents/20120530' AS (url, user);

Pig + HCatalog
raw = LOAD 'rawevents' using org.apache.hcatalog.pig.HCatLoader();

LOAD '/data/rawevents/20120530'

Pig + HCatalog (Partition Filter)
raw_0530 = FILTER raw BY ds = '20120530';

Pig
STORE cntd INTO '/data/counted/20120530';

Pig + HCatalog
STORE cntd INTO 'counted' using org.apache.hcatalog.pig.HCatStorer();
http://www.slideshare.net/hortonworks/h-cat-berlinbuzzwords2012 : Page. 8


MapReduce

• HCatInputFormat과 HCatOutputFormat 클래스를 활용

• Value 클래스는 기본적으로 HCatRecord를 활용

• Key는 사용하지 않음

• OutputValueClass는 HCatRecord로 설정

• 언제나 그렇듯 Reducer는 필수가 아닌 선택

• 파티션 제어 가능

• 스키마로 쉽게 제어 가능


MapReduce - Job
Job job = new Job(getConf());
job.setJarByClass(HCatMRTest.class);
job.setJobName("HCatMRTest");

job.setOutputKeyClass(WritableComparable.class);
job.setOutputValueClass(HCatRecord.class);

job.setMapperClass(HCatMRTest.Map.class);
job.setInputFormatClass(HCatInputFormat.class);
job.setOutputFormatClass(HCatOutputFormat.class);

job.setNumReduceTasks(0);


MapReduce - DB, TBL, Partition
java.util.Map<String, String> partition = ...
partition.put("ds", "20120530");

in = InputJobInfo.create("DB", "rawevents",
"ds='20120530'");
out = OutputJobInfo.create("DB", "counted", partition);

HCatInputFormat.setInput(job, in);
HCatOutputFormat.setOutput(job, out);

HCatSchema s = HCatOutputFormat.getTableSchema(job);
HCatOutputFormat.setSchema(job, s);


MapReduce - HCatRecord

• 레코드 단위로 사용되는 클래스

• boolean, byte, short, integer, long, ﬂoat, double, string, list, struct, map

• tinyint : HCatRecord.getByte

• smallint : HCatRecord.getShort

• Index 또는 컬럼명으로 접근가능

• 컬럼명으로 접근할 때는 HCatSchema 정보 필요

• 파티션 컬럼이 들어갈 수 있도록 공간 확보


MapReduce - HCatRecord
테이블 스키마 정보 획득 방법

HCatSchema in = HCatInputFormat.getTableSchema(context)
HCatSchema out = HCatOutputFormat.getTableSchema(context)

HCatRecord record = new HCatRecord(3);
record.set(“url”, out, value.get(“url”, in));

context.write(null, record);

해당 스키마 정보는 job.xml에 기록(encoded)
* mapreduce.lib.hcat.job.info
* mapreduce.lib.hcatoutput.info


Conclusions

• Pig 및 MR만을 사용하더라도 메타데이터 관리가 편해진다

• 다양한 도구를 활용할 때 효과를 발휘

• 빠른 컨트리뷰션이 이루어지고 있어 추후에 더 기대


Templeton


The Templeton project is named after the a
character in the award-winning children's
novel Charlotte's Web, by E. B. White. The
novel's protagonist is a pig named Wilber.
Templeton is a rat who helps Wilber by
running errands and making deliveries.


Templeton

• HCatalog 연동

• Thrift
• Java API (HCATALOG-419)
• REST API
• Web services interface for HCatalog access and Pig, Hive
and MR Job excution
• http://github.com/hortonworks/templeton
• HCATALOG-182
• a.k.a ‘webhcat’


Getting started

• Install
◦ Requirements
■ Hadoop 0.20.205 or Hadoop 1.x
■ Zookeeper
■ HCatalog
■ Hadoop Distributed Cache
■ To use the Hive, Pig, or hadoop/streaming
resources
• Configuration
◦ templeton-site.xml
• Security
◦ Default security (without additional authentication)
◦ Authentication via Kerberos


Templeton Resources

:version
Returns a list of supported response types.
status
Returns the Templeton server status.
version
Returns the a list of supported versions and the
current version.


Templeton Resources (2)
ddl
Performs an HCatalog DDL command.
ddl/database
List HCatalog databases.
ddl/database/:db (GET)
Describe an HCatalog database.
ddl/database/:db (PUT)
Create an HCatalog database.
ddl/database/:db (DELETE)
Delete (drop) an HCatalog database.
ddl/database/:db/table
List the tables in an HCatalog database.
ddl/database/:db/table/:table (GET)
Describe an HCatalog table.
ddl/database/:db/table/:table (POST)
Rename an HCatalog table.
ddl/database/:db/table/:table/partion
List all partitions in an HCatalog table.
ddl/database/:db/table/:table/partion/:partition (GET)
Describe a single partition in an HCatalog table.
......
......
ddl/database/:db/table/:table/partion/:partition (PUT)

Templeton Resources (3)

mapreduce/streaming
Creates and queues Hadoop streaming MapReduce jobs.
mapreduce/jar
Creates and queues standard Hadoop MapReduce jobs.
pig
Creates and queues Pig jobs.
hive
Runs Hive queries and commands.
queue
Returns a list of all jobids registered for the user.
queue/:jobid (GET)
Returns the status of a job given its ID.
queue/:jobid (DELETE)
Kill a job given its ID.


Examples

$ curl -s 'http://tb080:50111/templeton/v1/status'
{"status":"ok","version":"v1"}
$ curl -s -d user.name=nexr -d 'exec=show tables;'
'http://tb080:50111/templeton/v1/ddl'
{
"stdout": "empnnamenname_a29n",
"stderr": "WARNING: org.apache.hadoop.metrics.jvm.EventCounter
is deprecated. ......
//[jar:file:/home/nexr/nexr_platforms/hadoop/hadoop-1.0.3/
lib/slf4j-log4j12-1.4.3.jar!/org/slf4j/impl/
StaticLoggerBinder.class]nSLF4J: See http://www.slf4j.org/
codes.html#multiple_bindings for an explanation.nOKnTime
taken: 0.491 secondsn",
"exitcode": 0
}


Examples

$ curl -s 'http://tb080:50111/templeton/v1/ddl/database/default/
table/emp?user.name=nexr'
{
"statement": "use default; desc emp; ",
"error": "...",
"exec": {
"stdout": "{"columns":[{"name":"empno","type":"int
"},{"name":"name","type":"string"},{"name":"deptno
","type":"int"}]}t t n",
"stderr": "WARNING:
org.apache.hadoop.metrics.jvm.EventCounter is deprecated. ......
explanation.nOKnTime taken: 0.324 secondsnOKnTime taken:
0.398 secondsn",
"exitcode": 0
}
}


Examples
$ curl -s -X PUT -HContent-type:application/json -d '{
"comment": "Test table",
"columns": [
{ "name": "id", "type": "bigint" },
{ "name": "price", "type": "float", "comment": "The unit price" } ],
"partitionedBy": [
{ "name": "country", "type": "string" } ],
"format": { "storedAs": "rcfile" } }'
'http://tb080:50111/templeton/v1/ddl/database/default/table/test_table?
user.name=nexr'
hive> show tables;
OK
emp
test_table
Time taken: 0.477 seconds
hive> describe extended test_table;
OK
id bigint
price float The unit price
country string

Detailed Table Information Table(tableName:test_table,
dbName:default, owner:nexr, createTime:1342578059, lastAccessTime:0,
retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:id,
type:bigint, comment:null), FieldSchema(name:price, type:float,
comment:The unit price), FieldSchema(name:country, type:string,


Future of Templeton

• webhcat
• Java API based on REST API
• Integrate or replace existing web interfaces, e.g.,
WebHDFS


References

• Apache HCatalog (Incubating), http://
incubator.apache.org/hcatalog/
• HCatalog, http://www.slideshare.net/ydn/jan-2012-hug-
hcatalog
• Future of HCatalog, http://www.slideshare.net/
hortonworks/future-of-hcatalog-hadoop-summit-2012
• Introduction to HCatalog, http://geekdani.wordpress.com/
2012/07/11/introduction-to-hcatalog/
• HCatalog 설치와 HCatalog를 이용한 Hive & Pig 스키마 연
동, http://mixellaneous.tistory.com/1123


HCatalog & Templeton

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (20)

Similaire à HCatalog & Templeton

Similaire à HCatalog & Templeton (20)

Dernier

Dernier (20)

HCatalog & Templeton