Optimized Join
•検証
a : 1,000,000,000 records
b : 100,000,000 records
$ SELECT a.hoge, b.fuga FROM a JOIN b on (a.id = b.id)
121.384 s
$ SELECT a.hoge, b.fuga FROM b JOIN a on (b.id = a.id)
122.339 s
$ SELECT /*+ streamtable(a) */ a.hoge, b.fuga FROM b JOIN a on (b.id =
a.id)
120.298 s
Programming Hive Reading #3 12
Local Mode
•データサイズが小さい場合はLocal Modeの方が
overheadが減らせて速いケースがある。
$ set mapred.job.tracker = local;
$ set mapred.tmp.dir =/tmp/masashi/sada;
$ SELECT * FROM hoge FROM id = ‘fuga’
..........
Job running in-process (local Hadoop)
..........
Programming Hive Reading #3 15
Local Mode
•データサイズが小さい場合はLocal Modeの方が
overheadが減らせて速いケースがある。
•ex. 約30,000レコードのtable
normal mode : 27s
local mode : 10s
•ex. 約100,000,000レコードのtable
normal mode : 40s
local mode : 532s
Programming Hive Reading #3 16
Local Mode
•自動的にLocal Mode処理をさせるには
“hive.exec.mode.local.auto=true”
•Local Mode動作する条件は以下
• The total input size of the job is lower than:
“hive.exec.mode.local.auto.inputbytes.max” (128MB by default)
• The total number of map-tasks is less than:
“hive.exec.mode.local.auto.tasks.max” (4 by default)
• The total number of reduce tasks required is 1 or 0.
Programming Hive Reading #3 17
Single MR Multi Group By
•参考:https://issues.apache.org/jira/browse/HIVE-2056
From table T
insert overwrite table test1 select col1, count(distinct colx) group by col1
insert overwrite table test2 select col1, col2, count(distinct colx) group by col1, col2;
•上記の場合”hive.multigroupby.singlemr=true”のほ
うが速いらしい。
Programming Hive Reading #3 22