Hadoop ecosystem - hadoop 生態系

Headfirst 之
Hadoop的生態系
陳威宇

今天的象排餐，供應的部位…
2
figure Source : http://aryannava.com/2014/02/19/apache-hadoop-ecosystem/hadoopecosystem/

等等，為何我們需要這些東東
• 接下來會遇到六
個Case，一起想
一想要怎麼解決
3
Hadoop EcoSystem

手工打造程式來做
用flume
問題
• 場景:
– 有上百個服務，運作在許多不同的機器中，每個服務
都產生超多的 log ，且需要被分析，我知道最後可以
放到hadoop中，可是….
• 問題:
– 我要如何送這些
源源不絕的資料
到hadoop?
• 解法:
4
figure Source : http://image.slidesharecdn.com/flume-120314204418-phpapp01/95/apache-flume-4-728.jpg?cb=1338404245

Apache Flume: Log 收集器
• 即時日誌收集系統
• 將分佈在不同節點、機器上的日誌收集到hdfs 中
• 不用寫程式: 僅定義config檔即可
5
Source
• netcat
• exec
• syslog
• spooldir
• seq
• http
• avro
Sink
• logger
• hdfs
• file_roll
• hbase
• solr
• avro
channel
• memory
• jdbc
• File
figure Source : https://flume.apache.org/FlumeUserGuide.html

 用 shell 硬把程式兜出來，放棄用 hadoop 了
 使用 PIG
 發憤圖強，廢寢忘食的研究…
問題 :
• 場景:
– 老闆要我統計組織內所有員工的平均工時。於是我取
得了全台灣的打卡紀錄檔(打卡鐘的log檔)，還跟人事
部門拿到了員工 id 對應表。這些資料量又多且大，我
想到要餵進去 Hadoop 的HDFS, .. 然後
• 問題:
– 為了寫MapReduce，開始學 Java, 物件導向, hadoop
API, … @@
• 解法:
6

有Pig後Map-Reduce簡單了！?
• Apache Pig用來處理大規模資料的高級查詢語言
• 適合操作大型半結構化數據集
• 比使用Java，C++等語言編寫大規模資料處理程式的
難度要小16倍，實現同樣的效果的代碼量也小20倍。
• Pig元件
– Pig Shell (Grunt)
– Pig Language (Latin)
– Libraries (Piggy Bank)
– UDF:使用者定義功能
7
figure Source : http://www.slideshare.net/ydn/hadoop-yahoo-internet-scale-data-processing

豬也會的程式設計
8
功能指令
讀取 LOAD
儲存 STORE
資料
處理
REGEX_EXTRACT, FILTER, FOREACH,
GROUP, JOIN, UNION, SPLIT, …
彙總
運算
AVG, COUNT, MAX, MIN, SIZE, …
數學
運算
ABS, RANDOM, ROUND, …
字串
處理
INDEXOF, SUBSTRING, REGEX
EXTRACT, …
Debug DUMP, DESCRIBE, EXPLAIN, ILLUSTRATE
HDFS cat, ls, cp, mkdir, …
$ pig –x
grunt> A = LOAD ‘file1’ AS (x, y, z);
grunt> B = FILTER A by y > 10000;
grunt> STORE B INTO ‘output’;

整型前的mapreduce code
9
nm dp Id Id dt hr
劉北 A1 A1 7/7 13
李中 B1 A1 7/8 12
王中 B2 A1 7/9 4
Java Code
Map-Reduce
A1 劉北 7/8 13
A1 劉北 7/9 12
A1 劉北 Jul 12.5

用pig 整形後
10
北 A1 劉 12.5
LOAD
LOAD
FILTER
JOIN
GROUP
FOREACH
STORE
(nm, dp, id)
(nm, dp, id)
(id, dt, hr)
(nm, dp, id, id, dt, hr)
(group, {(nm, dp, id, id, dt, hr)})
(group, …., AVG(hr))
(dp,group, nm, hr)
Logical PlanPig Latin
A = LOAD 'file1.txt' using PigStorage(',') AS (nm, dp, id) ;
B = LOAD ‘file2.txt' using PigStorage(',') AS (id, dt, hr) ;
C = FILTER B by hr > 8;
D = JOIN C BY id, A BY id;
E = GROUP D BY A::id;
F = FOREACH E GENERATE $1.dp,group,$1.nm,
AVG($1.hr);
STORE F INTO '/tmp/pig_output/';
nm dp Id Id dt hr
劉北 A1 A1 7/7 13
李中 B1 A1 7/8 12
王中 B2 A1 7/9 4
Tips : 關鍵字大小寫有差；先用小量資料於
pig –x local 模式驗證；每行先配合dump or
illustrate看是否正確

問題 :
• 場景:
– 組織內有統一格式的出勤紀錄資料表，分散在全台各
縣市的各個部門的資料庫中。老闆要我蒐集全台的資
料統計所有員工的平均工時。DB內的table 都轉成csv
檔，並且餵進去 Hadoop 的HDFS了, ..
• 問題:
– 雖然我知道PIG可以降低MapReduce的門檻，但我還
是習慣 SQL 語法來實作，如果有一台超大又免費的DB
就好了…
• 解法:
11
 編列經費買台高效伺服器再裝個大容量的 sql server
 使用 Hive

Hadoop 也有 RDB 可以用 : Hive
• Hive = Hadoop的RDB
– 將結構化的資料檔案映射為資料庫表
– 提供SQL查詢功能( 轉譯SQL語法成
MapReduce程式)
• 適合：
– 有SQL 基礎的使用者且基本 SQL 能運算的事
• 特色：
– 可擴展、可自訂函數、容錯
• 限制：
– 執行時間較久
– 資料結構固定
– 無法修改
12

Hive 架構提供了..
• 介面
– CLI
– WebUI
– API
• JDBC and ODBC
• Thrift Server (hiveserver)
– 使遠端Client可用 API 執
行 HiveQL
• Metastore
– DB, table, partition…
13
figure Source : http://blog.cloudera.com/blog/2013/07/how-hiveserver2-brings-security-and-concurrency-to-apache-hive

現在換蜂也會的程式設計
14
$ hive
hive> create table A(x int, y int, z int)
hive> load data local inpath ‘file1 ’ into table A;
hive> select * from A where y>10000
hive> insert table B select *
from A where y>10000
figure Source : http://hortonworks.com/blog/stinger-phase-2-the-journey-to-100x-faster-hive/

用 Hive 整形後
15
北 A1 劉 12.5
HiveQL
> create table A (nm String, dp String, id String)
> create table B (id String, dt Date, hr int)
> create table final (dp String, id String , nm String, avg float)
> load data inpath ‘file1’ into table A;
> load data inpath ‘file2’ into table B;
> Insert table final select a.id, collect_set(a.dp), collect_set(a.nm), avg(b.hr)
from a,b where b.hr > 8 and b.id = a.id group by a.id;
nm dp Id id dt hr
劉北 A1 A1 7/7 13
李中 B1 A1 7/8 12
王中 B2 A1 7/9 4
Tips : create table & load
data 建議用 tool 匯入資料
較不會錯

Hive和SQL 比較
Hive RDMS
查詢語法 HQL SQL
儲存體 HDFS
Raw Device or
Local FS
運算方法 MapReduce Excutor
延遲非常高低
處理數據規模大小
修改資料 NO YES
索引
Index, Bigmap
index…
複雜健全的索
引機制
16
Source : http://sishuok.com/forum/blogPost/list/6220.html

Pig vs Hive
17
Hive Pig
SQL-LIKE 語法 PigLatin
Yes/明確型 Schemas/
Types
Yes /隱含型
Yes Partitions No
Thrift Server No
Yes Web
Interface
No
Yes(limited) JDBC/ODBC No
No Hdsf 操作 Yes
Hive更適合於數
據倉庫的任務，
用於靜態的結構
及需要經常分析
的工作
Pig賦予開發人員
在Big Data中，具
備更多的靈活性，
並允許開發簡潔腳
本
Source : http://f.dataguru.cn/thread-33553-1-1.html

豬與蜜蜂兼得 : HCatalog
• 提供:
– Mapreduce, pig, hive 的讀寫"metastore”介面
– Command line 介面
18
figure Source : http://wiki.gurubee.net/pages/viewpage.action?pageId=26739793

問題 :
• 場景:
– 承前，長官反映一個月做一次統計太久，頻率要改成
一天一次以即時反應
• 問題:
– 每天都要將這麼多個資料表，各自轉成csv 再匯入
hdfs ，然後 load 到 hive 接著運算…，天都黑了
• 解法:
19
組織內有統一格式的出勤紀錄資料表，分散在全台各
縣市的各個部門的資料庫中。老闆要我蒐集全台的資
料統計所有員工的平均工時。DB內的table 都轉成csv
檔，並且餵進去 Hadoop 的HDFS了,
 找工讀生 ………..
 使用 sqoop ………

Sqoop : RDB 與 Hadoop 的橋樑
• Apache Sqoop = SQL to Hadoop
• 從..拿資料
– RDBMS
– Data warehources
– NoSQL
• 寫資料到..
– Hive
– Hbase
• 與 oozie 整合
– 可排程
20
figure Source : http://bigdataanalyticsnews.com/data-transfer-mysql-cassandra-using-sqoop/

Sqoop 使用方法
21
figure Source : http://hive.3du.me/slide.html

用 Hive + Sqoop 的微創整形手術
22
北 A1 劉 12.5
HiveQL
> create …………
> load data inpath ‘file1’ into table A;
> load data inpath ‘file2’ into table B;
> Insert table final select a.id, collect_set(a.dp),
collect_set(a.nm), avg(b.hr) from a,b where b.hr
> 8 and b.id = a.id group by a.id;
nm dp Id id dt hr
劉北 A1 A1 7/7 13
李中 B1 A1 7/8 12
李中 B2 A1 7/9 4
HiveQL
> create …………
> Insert table final select a.id, collect_set(a.dp),
collect_set(a.nm), avg(b.hr) from a,b where b.hr > 8
and b.id = a.id group by a.id;

問題 :
• 場景:
– 自從知道 hive 的好用之後，所有以前 RDB 存不下、不
能存的東西，我通通都建成 hive 的DB, table 來存放，
搭配 sqoop 資料是還滿順的，不過…
• 問題:
– 即使沒有要做複雜運算，只是要取出某一行資料，總
是要等hive 處理很久很久
• 解法:
23
 邊唱韋禮安的歌邊慢慢等
 使用 Impala
 使用 HBase

關於impala 的兩三事
• 目的：解決批次化處理的時間延宕和存取資料速度不
方便
• Near-realtime 的 SQL 查詢工具
• 速度約比hive 快 6~ 60 倍
24
figure Source : http://blog.cloudera.com/blog/2013/05/cloudera-impala-1-0-its-here-its-real-its-already-the-standard-for-sql-on-hadoop/

NOSQL 資料庫 Hbase
• Hbase是參考谷歌BigTable建模的NoSQL
• 特性：
– 類似表格的資料結構 (Multi-Dimensional Map)
– 分散式
– 高可用性、高效能
– 很容易擴充容量及效能
• Why HBase：
– Random read/write hadoop 內的資料
25

Hbase “不是不是不是” 關聯式資料
• HBase並不是關聯式資料庫系統(RDBMS)
– 表格(Table)只有一個主要索引 (primary index) 即 row key.
– 不提供 SQL 語法 (如 join )
• 提供Java函式庫, 與 REST與Thrift等介面
• 利用 getRow(), Scan() 存取資料
– getRow()可以取得一筆row range的資料，同時也可以指定
版本(timestamp)
– Scan()可以取得整個表格的資料或是一組row range (設定
start key, end key)
• insert, update, delete 都是在塞資料
– Hbase 中的 insert 功能即 put ()
– 在同一cell 內重複put() => update;
– Delete() = 在該 cell 上貼上刪除的標籤
• Row Key design 是 hbase 設計重點中的重點
26

HBase 資料長相
• “Rowkey”, “column family”, “column
qualifier”, “timestamp”, “cell”
27
figure Source : http://www.slideshare.net/hanborq/h-base-introduction

問題 :
• 問題:
– 我的東西需要很多的統計分析方法、machine
learning, data mining 等，用 hive, pig 都不適用…
• 解法:
– Machine learning => Machout
– 統計分析 => Rhadoop
28

Mahout = 象夫
• Mahout = 可伸縮的機器學習演算法
• 用MapReduce實現了部分data mining算法
• 演算法分類如 : (各自提供多種經典演算法的實作 )
– 推薦引擎（Mahout中專指協同過濾式的推薦）
– 降維（Dimension Reduction
– 向量相似度（Vector Similarity）
– 分類演算法
– 群集演算法
– 模式探勘（Pattern Mining）
29
Regression
Recommenders
ClusteringClassification
Freq.
Pattern
Mining
Vector Similarity
Non-MR
Algorithms
Examples
Dimension
Reduction
Evolution
figure Source : http://www.slideshare.net/chaoyu0513/hit20130928-apache-mahout

處理大資料的R使用者有福了 : R hadoop
• R 是在統計領域上，鼎鼎大名的語言
• 主要用於統計分析、繪圖、資料探勘、矩陣計算
• R綜合典藏網 CRAN
– 像Perl 依樣的自由函式庫
• Revolution Rhadoop
– rmr2, rhdfs, rhbase …
30
figure Source : http://www.r-project.org/

問題 :
• 場景:
– 自從我學了 hadoop 的十八般武藝之後，已經設計了
很多用不同 ecosystem 做的 application 了，不過老
闆要我把 src txt-> { flume => MR => hive 或 pig =>
sqoop } -> dst DB，整段串起來在每天凌晨執行，活
要見人result 死要見屍 error message…
• 做法:
31
 用shell script 將整段兜起來 ………..
 使用 oozie ………

Hadoop 工作流程管理員 : oozie
• 把多個 job 組合到起來，從而完成更大型的任務
• 包含
– 控制流程 ( start, end, kill, fork, join )
– 動作 ( mapreduce/java/pig/hive )
• 不用寫 code ，用 xml 定義流程
32
figure Source : http://www.slideshare.net/martyhall/hadoop-tutorial-oozie

回顧
• ETL
– Apache Flume
– Apache Sqoop
• DB
– Apache Hbase
– Apache Hive
– Apache Impala
• Calculate
– Apache Pig
– Apache Mahout
– R Hadoop
• WorkFlow
– Apache OOZIE
34

Advice
• 在巨量資料領域中Hadoop是目前最多人使用的框架，
在這之上，你可以更聰明的使用它
• 資料不夠大時，難以發揮Hadoop大資料分析的效益
• 大數據人才:懂資工、統計還不夠，還要會說故事
– 一個能擔當資料科學的完整團隊，最好包括四種角色：
懂資訊科學的程式設計師、懂統計學的資料分析師、
懂圖像呈現，善於包裝傳達的圖像設計師與擁有產業
知識的專案推動者。(2014 年 4 月號《遠見雜誌》第 334 期)
• 小心別掉進陷阱裡，大數據專案失敗的八個理由
– (Yahoo)
35

Pig example result
37
A = LOAD '/user/waue/pig_input/file1.txt'
using PigStorage(',') AS (nm, dp, id) ;
B = LOAD '/user/waue/pig_input/file2.txt'
using PigStorage(',') AS (id, dt, hr) ;
C = FILTER B by hr > 8;
D = JOIN C BY id, A BY id;
E = GROUP D BY A::id;
F = FOREACH E GENERATE
$1.dp,group,$1.nm, AVG($1.hr);
STORE F INTO '/tmp/pig_output/';

Hive example result
38
INSERT OVERWRITE TABLE final
select a.id, collect_set(a.dp),
collect_set(a.nm), avg(b.hr)
from a,b where b.hr > 8 and b.id = a.id
group by a.id;

Hadoop ecosystem - hadoop 生態系

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Hadoop ecosystem - hadoop 生態系

Similaire à Hadoop ecosystem - hadoop 生態系 (20)

Hadoop ecosystem - hadoop 生態系

Notes de l'éditeur