10. 用pig 整形後
10
北 A1 劉 12.5
LOAD
LOAD
FILTER
JOIN
GROUP
FOREACH
STORE
(nm, dp, id)
(nm, dp, id)
(id, dt, hr)
(nm, dp, id, id, dt, hr)
(group, {(nm, dp, id, id, dt, hr)})
(group, …., AVG(hr))
(dp,group, nm, hr)
Logical PlanPig Latin
A = LOAD 'file1.txt' using PigStorage(',') AS (nm, dp, id) ;
B = LOAD ‘file2.txt' using PigStorage(',') AS (id, dt, hr) ;
C = FILTER B by hr > 8;
D = JOIN C BY id, A BY id;
E = GROUP D BY A::id;
F = FOREACH E GENERATE $1.dp,group,$1.nm,
AVG($1.hr);
STORE F INTO '/tmp/pig_output/';
nm dp Id Id dt hr
劉 北 A1 A1 7/7 13
李 中 B1 A1 7/8 12
王 中 B2 A1 7/9 4
Tips : 關鍵字大小寫有差;先用小量資料於
pig –x local 模式驗證;每行先配合dump or
illustrate看是否正確
13. Hive 架構提供了..
• 介面
– CLI
– WebUI
– API
• JDBC and ODBC
• Thrift Server (hiveserver)
– 使遠端Client可用 API 執
行 HiveQL
• Metastore
– DB, table, partition…
13
figure Source : http://blog.cloudera.com/blog/2013/07/how-hiveserver2-brings-security-and-concurrency-to-apache-hive
14. 現在換 蜂 也會的程式設計
14
$ hive
hive> create table A(x int, y int, z int)
hive> load data local inpath ‘file1 ’ into table A;
hive> select * from A where y>10000
hive> insert table B select *
from A where y>10000
figure Source : http://hortonworks.com/blog/stinger-phase-2-the-journey-to-100x-faster-hive/
15. 用 Hive 整形後
15
北 A1 劉 12.5
HiveQL
> create table A (nm String, dp String, id String)
> create table B (id String, dt Date, hr int)
> create table final (dp String, id String , nm String, avg float)
> load data inpath ‘file1’ into table A;
> load data inpath ‘file2’ into table B;
> Insert table final select a.id, collect_set(a.dp), collect_set(a.nm), avg(b.hr)
from a,b where b.hr > 8 and b.id = a.id group by a.id;
nm dp Id id dt hr
劉 北 A1 A1 7/7 13
李 中 B1 A1 7/8 12
王 中 B2 A1 7/9 4
Tips : create table & load
data 建議用 tool 匯入資料
較不會錯
16. Hive和SQL 比較
Hive RDMS
查詢語法 HQL SQL
儲存體 HDFS
Raw Device or
Local FS
運算方法 MapReduce Excutor
延遲 非常高 低
處理數據規模 大 小
修改資料 NO YES
索引
Index, Bigmap
index…
複雜健全的索
引機制
16
Source : http://sishuok.com/forum/blogPost/list/6220.html
17. Pig vs Hive
17
Hive Pig
SQL-LIKE 語法 PigLatin
Yes/明確型 Schemas/
Types
Yes /隱含型
Yes Partitions No
Thrift Server No
Yes Web
Interface
No
Yes(limited) JDBC/ODBC No
No Hdsf 操作 Yes
Hive更適合於數
據倉庫的任務,
用於靜態的結構
及需要經常分析
的工作
Pig賦予開發人員
在Big Data中,具
備更多的靈活性,
並允許開發簡潔腳
本
Source : http://f.dataguru.cn/thread-33553-1-1.html
37. Pig example result
37
A = LOAD '/user/waue/pig_input/file1.txt'
using PigStorage(',') AS (nm, dp, id) ;
B = LOAD '/user/waue/pig_input/file2.txt'
using PigStorage(',') AS (id, dt, hr) ;
C = FILTER B by hr > 8;
D = JOIN C BY id, A BY id;
E = GROUP D BY A::id;
F = FOREACH E GENERATE
$1.dp,group,$1.nm, AVG($1.hr);
STORE F INTO '/tmp/pig_output/';
38. Hive example result
38
INSERT OVERWRITE TABLE final
select a.id, collect_set(a.dp),
collect_set(a.nm), avg(b.hr)
from a,b where b.hr > 8 and b.id = a.id
group by a.id;