4. Problems
• Hive 외에는 메타스토어의 부재
• 한 클러스터에서 다양한 도구를 사용하는 경우 연동이 쉽지 않다.
• 매번 커뮤니케이션 비용이 발생
• 어디에? 어떻게? 뭘?
• M/R, Pig 사용자는 기억해야할 많은 정보
• 스키마, 데이터 경로 또는 포맷 변경은 M/R, Pig 에 많은 영향
Wednesday, July 18, 12
5. HCatalog
• Apache Incubator
• Hive metastore 기반
• M/R, Pig 사용자에게 읽고 쓸 수 있는 프로그래밍 인터페이스 제공
• MapReduce 작업이 필요없는 모든 DDL 명령 제공 (CLI Commands)
• import/export, CREATE TABLE AS SELECT 등 제외
• Data exploration 기능 제공
• SHOW TABLES, DESCRIBE 제공
• http://incubator.apache.org/hcatalog/docs/r0.4.0/cli.html
• Hortonworks,Yahoo, Twitter, ... 등 개발
Wednesday, July 18, 12
6. Table abstraction
• 메타데이터
• 데이터 위치, 스키마, 압축, 파티션, 포맷 등
• HCatalog를 이용하여 데이터를 추상화
• 한 곳에서 메타데이터가 관리되며 그 만큼 역할 또한 중요
• 컬럼 타입으로 primitives, map, list, struct 지원
Wednesday, July 18, 12
8. Data types : Pig
HCatalog = Hive Pig
primitives
int, long, float, double, chararray
(int, long, float, double, string)
map
map
(contains key and value pairs)
list
bag
(contains a list elements of same data type)
struct
tuple
(contains elements of different data types)
Wednesday, July 18, 12
10. DDL
$HCAT_HOME/bin/hcat -e “
drop table if exists rawevents;
create external table rawevents (
url string, user string
)
partitioned by (ds string)
“
$HIVE_HOME/bin/hive -e “
LOAD DATA LOCAL INPATH ‘...’ OVERWRITE INTO TABLE rawevents
PARTITION (ds=‘20120530`)
“
Wednesday, July 18, 12
11. Pig
raw = LOAD '/data/rawevents/20120530' AS (url, user);
botless = FILTER raw BY myudfs.NotABot(user);
grpd = GROUP botless by (url, user);
cntd = FOREACH grpd GENERATE flatten(url, user), COUNT(botless);
STORE cntd INTO '/data/counted/20120530';
http://www.slideshare.net/hortonworks/h-cat-berlinbuzzwords2012 : Page. 8
Wednesday, July 18, 12
12. Pig + HCatalog
Pig
raw = LOAD '/data/rawevents/20120530' AS (url, user);
Pig + HCatalog
raw = LOAD 'rawevents' using org.apache.hcatalog.pig.HCatLoader();
LOAD '/data/rawevents/20120530'
Pig + HCatalog (Partition Filter)
raw_0530 = FILTER raw BY ds = '20120530';
Pig
STORE cntd INTO '/data/counted/20120530';
Pig + HCatalog
STORE cntd INTO 'counted' using org.apache.hcatalog.pig.HCatStorer();
http://www.slideshare.net/hortonworks/h-cat-berlinbuzzwords2012 : Page. 8
Wednesday, July 18, 12
13. MapReduce
• HCatInputFormat과 HCatOutputFormat 클래스를 활용
• Value 클래스는 기본적으로 HCatRecord를 활용
• Key는 사용하지 않음
• OutputValueClass는 HCatRecord로 설정
• 언제나 그렇듯 Reducer는 필수가 아닌 선택
• 파티션 제어 가능
• 스키마로 쉽게 제어 가능
Wednesday, July 18, 12
15. MapReduce - DB, TBL, Partition
java.util.Map<String, String> partition = ...
partition.put("ds", "20120530");
in = InputJobInfo.create("DB", "rawevents",
"ds='20120530'");
out = OutputJobInfo.create("DB", "counted", partition);
HCatInputFormat.setInput(job, in);
HCatOutputFormat.setOutput(job, out);
HCatSchema s = HCatOutputFormat.getTableSchema(job);
HCatOutputFormat.setSchema(job, s);
Wednesday, July 18, 12
16. MapReduce - HCatRecord
• 레코드 단위로 사용되는 클래스
• boolean, byte, short, integer, long, float, double, string, list, struct, map
• tinyint : HCatRecord.getByte
• smallint : HCatRecord.getShort
• Index 또는 컬럼명으로 접근가능
• 컬럼명으로 접근할 때는 HCatSchema 정보 필요
• 파티션 컬럼이 들어갈 수 있도록 공간 확보
Wednesday, July 18, 12
17. MapReduce - HCatRecord
테이블 스키마 정보 획득 방법
HCatSchema in = HCatInputFormat.getTableSchema(context)
HCatSchema out = HCatOutputFormat.getTableSchema(context)
HCatRecord record = new HCatRecord(3);
record.set(“url”, out, value.get(“url”, in));
context.write(null, record);
해당 스키마 정보는 job.xml에 기록(encoded)
* mapreduce.lib.hcat.job.info
* mapreduce.lib.hcatoutput.info
Wednesday, July 18, 12
18. Conclusions
• Pig 및 MR만을 사용하더라도 메타데이터 관리가 편해진다
• 다양한 도구를 활용할 때 효과를 발휘
• 빠른 컨트리뷰션이 이루어지고 있어 추후에 더 기대
Wednesday, July 18, 12
21. The Templeton project is named after the a
character in the award-winning children's
novel Charlotte's Web, by E. B. White. The
novel's protagonist is a pig named Wilber.
Templeton is a rat who helps Wilber by
running errands and making deliveries.
Wednesday, July 18, 12
22. Templeton
• HCatalog 연동
• Thrift
• Java API (HCATALOG-419)
• REST API
• Web services interface for HCatalog access and Pig, Hive
and MR Job excution
• http://github.com/hortonworks/templeton
• HCATALOG-182
• a.k.a ‘webhcat’
Wednesday, July 18, 12
23. Getting started
• Install
◦ Requirements
■ Hadoop 0.20.205 or Hadoop 1.x
■ Zookeeper
■ HCatalog
■ Hadoop Distributed Cache
■ To use the Hive, Pig, or hadoop/streaming
resources
• Configuration
◦ templeton-site.xml
• Security
◦ Default security (without additional authentication)
◦ Authentication via Kerberos
Wednesday, July 18, 12
24. Templeton Resources
:version
Returns a list of supported response types.
status
Returns the Templeton server status.
version
Returns the a list of supported versions and the
current version.
Wednesday, July 18, 12
25. Templeton Resources (2)
ddl
Performs an HCatalog DDL command.
ddl/database
List HCatalog databases.
ddl/database/:db (GET)
Describe an HCatalog database.
ddl/database/:db (PUT)
Create an HCatalog database.
ddl/database/:db (DELETE)
Delete (drop) an HCatalog database.
ddl/database/:db/table
List the tables in an HCatalog database.
ddl/database/:db/table/:table (GET)
Describe an HCatalog table.
ddl/database/:db/table/:table (POST)
Rename an HCatalog table.
ddl/database/:db/table/:table/partion
List all partitions in an HCatalog table.
ddl/database/:db/table/:table/partion/:partition (GET)
Describe a single partition in an HCatalog table.
......
......
ddl/database/:db/table/:table/partion/:partition (PUT)
Wednesday, July 18, 12
26. Templeton Resources (3)
mapreduce/streaming
Creates and queues Hadoop streaming MapReduce jobs.
mapreduce/jar
Creates and queues standard Hadoop MapReduce jobs.
pig
Creates and queues Pig jobs.
hive
Runs Hive queries and commands.
queue
Returns a list of all jobids registered for the user.
queue/:jobid (GET)
Returns the status of a job given its ID.
queue/:jobid (DELETE)
Kill a job given its ID.
Wednesday, July 18, 12
27. Examples
$ curl -s 'http://tb080:50111/templeton/v1/status'
{"status":"ok","version":"v1"}
$ curl -s -d user.name=nexr -d 'exec=show tables;'
'http://tb080:50111/templeton/v1/ddl'
{
"stdout": "empnnamenname_a29n",
"stderr": "WARNING: org.apache.hadoop.metrics.jvm.EventCounter
is deprecated. ......
//[jar:file:/home/nexr/nexr_platforms/hadoop/hadoop-1.0.3/
lib/slf4j-log4j12-1.4.3.jar!/org/slf4j/impl/
StaticLoggerBinder.class]nSLF4J: See http://www.slf4j.org/
codes.html#multiple_bindings for an explanation.nOKnTime
taken: 0.491 secondsn",
"exitcode": 0
}
Wednesday, July 18, 12