Contenu connexe Similaire à Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study (20) Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study2. 大纲
● Oozie概述
● 适合使用Oozie的情景
● Oozie的实现原理及特点
● Oozie的核心组件
● Oozie实战及Tips
● Oozie的编程接口介绍
● 支持Oozie的图形化开源ETL工具Kettle初探
● 总结展望
Berlin | 2014.01.14 | Teng Qiu
2
3. OOZIE概述
● 工作流引擎
● 顺序运行一组Hadoop作业任务
● 有向无环图 DAG (Direct Acyclic Graph)
● Workflow 1:1 Coordinator n:1 Bundle
● Coordinator可触发执行,可类似cron job方式执行,时间轮循只支持UTC时间
● XML作为工作流描述语言 hPDL (Process Definition Language)
● 类似JBoss jBPM中使用的 jPDL
● Control Flow Nodes 控制流程的执行路径: start, end, fail / kill, decision, fork-join
● Action Nodes:
● HDFS, MapReduce, Pig, Hive, Sqoop, Java, SSH, E-Mail, Sub-Workflow
● (mkdir, delete, move, chmod, touchz, DistCp)
● 信息存放在数据库中 derby / mysql
Berlin | 2014.01.14 | Teng Qiu
3
5. 适合使用OOZIE的情景
● 需要定期执行的任务,如 ETL
cron job A,在 hdp01 这个机器上,每个小时的15分启动,处理原始数据集1
cron job B,在 hdp05 这个机器上,每个小时的20分启动,处理原始数据集2
cron job C,在 hdp11 这个机器上,每个小时的50分启动,去读A和B的结果,然后做处理
● RDBMS中的表 => HBase Table / Hive Table
● RDBMS中的 trigger / stored procedure
=> HBase的RegionObserver和Endpoint Coprocessor
Berlin | 2014.01.14 | Teng Qiu
5
7. OOZIE的实现原理及特点
● 实现原理
● oozie:lancher:T=:W=:A=
● Oozie Server根据workflow XML, 提交一个map only的MR Job
● map中封装用户定义的action, 通过JobClient将job.jar和job.xml提交JobTracker
● action Job开始工作,map only Job 等待 => oozie始终多占用一个map slot
● callback / polling 获取action状态
● 正常情况下,通过callback URL通知完成
● 特点
● 通过MapReduce Framework实现负载均衡,容错/重试机制
● 支持参数化,Java EL 语言
● DAG,没有重试(Error / Exception / exit code != 0)
● 但是workflow可以rerun(oozie.wf.rerun.failnodes=true或
oozie.wf.rerun.skip.nodes=xxx,yyy,zzz)
Berlin | 2014.01.14 | Teng Qiu
7
9. OOZIE的核心组件
Control Flow Node 流程控制节点
● decision 节点
${wf:conf("etl_only_do_something") eq "yes"}
● fork-join
● 一个bug:OOZIE-1142,3.3.2后fix
● 解决办法:在 oozie-site.xml 中,设置oozie.validate.ForkJoin为false
Berlin | 2014.01.14 | Teng Qiu
9
10. OOZIE的核心组件
Action Node 任务节点
● HDFS
● move, delete, mkdir, chmod, touchz, DistCp
● MapReduce
● job.xml 指定M/R的class和目录
● Pig / Hive
● <job-xml>hive-site.xml</job-xml>
● <script>${hiveScript}</script>
● SSH
● public key !!! 一声叹息啊
● <host>, <command>, <args> -_● Sub Workflow
● <propagate-configuration/>
Berlin | 2014.01.14 | Teng Qiu
10
12. OOZIE的核心组件
Action Node 任务节点
● Java Action
● <main-class>
● <arg>
● <capture-output />
● ${wf:actionData('action-node-name')['property-name']}
String oozieProp = System.getProperty("oozie.action.output.properties");
if (oozieProp != null) {
Properties props = new Properties();
props.setProperty(propKey, propVal);
File propFile = new File(oozieProp);
OutputStream os = new FileOutputStream(propFile);
props.store(os, "Results from oozie task");
os.close();
}
Berlin | 2014.01.14 | Teng Qiu
12
13. OOZIE的核心组件
Action Node 任务节点
● 自定义Action
● 实现 ActionExecutor 接口
● 构造函数 super(ACTION_TYPE)
● ActionExecutor.Context
● start / end / kill / check
● 修改 oozie-site.xml
● 添加自定义类名到属性
● oozie.service.ActionService.executor.ext.classes
● 或许可以给 Impala 写一个?
Berlin | 2014.01.14 | Teng Qiu
13
14. OOZIE实战及TIPS
情景描述
● Oozie实战及Tips
● 典型的DMP(Data Management Platform)ETL应用
● 对用户行为进行聚合,对用户进行归类
用户行为表1..n,TTL=30天
商品分类表
最终结果
用户
时间
商品
商品
归属类别
归类
A
101
XXX
XXX
1,2,3
外部用
户标识
Genera
tion
B
102
YYY
YYY
4,3,2
A1
1,7,2,
3,9,8
0
A
103
ZZZ
ZZZ
7,9,8
B1
4,3,2
0
● 中间表:内部用户归类:A -> 1,7,2,3,9,8 | B -> 4,3,2
内部用户ID外部用户标识对应表: A -> A1 | B -> B1
Berlin | 2014.01.14 | Teng Qiu
14
15. OOZIE实战及TIPS
START
Point
ZKClient getGen and checkTime
1) get old and new generation
2) compare lastImportedTime vs.
lastExportedTime
decision
is there new
data?
No
E-Mail Client *
MSG: nothing to export
END
Successful
Yes
Error
Hive/FTP Script
to create/send export
files for A
coprocessor Client
aggregate X events
fork
coprocessor Client
aggregate Y events
join
coprocessor Client
aggregate Z events
Hive Script
to generate
export table
Error
fork
Hive/FTP Script
to create/send export
files for B
Hive/FTP Script
to create/send export
files for C
Error
Error
join
ZKClient-setGen
set new generation
Error
ZKClient-fail-after-coproc
set generation back
E-Mail Client *
MSG: failed after coproc
KILLED
with ERROR
Error
E-Mail Client *
MSG: failed by ZK Client
Berlin | 2014.01.14 | Teng Qiu
15
17. OOZIE实战及TIPS
万里长征第一步 – 运行
● Oozie的使用
● 命令行
● Java Client API / REST API
● Hue
jobTracker=xxx:8021
nameNode=xxx:8020
oozie.coord.application.path=${workflowRoot}/coordinator.xml
oozie.wf.application.path=${workflowRoot}/workflow.xml
$ oozie job -oozie http://fxlive.de:11000/oozie -config /some/where/job.properties –run
$ oozie job -oozie http://fxlive.de:11000/oozie -info 0000001-130104191423486-oozie-oozi-W
$ oozie job -oozie http://fxlive.de:11000/oozie -log 0000001-130104191423486-oozie-oozi-W
$ oozie job -oozie http://fxlive.de:11000/oozie -kill 0000001-130104191423486-oozie-oozi-W
● ShareLib
● /usr/lib/oozie/oozie-sharelib.tar.gz
● sudo -u oozie hadoop fs -put share /user/oozie/
● 在job.properties中,oozie.use.system.libpath=true
● oozie.service.WorkflowAppService.system.libpath
● oozie.libpath=${nameNode}/xxx/xxx/jars
Berlin | 2014.01.14 | Teng Qiu
17
18. OOZIE实战及TIPS
运行不了?
● 权限问题
● Error: E0902 : E0902: Exception occured:
[org.apache.hadoop.ipc.RemoteException: User: oozie is not allowed to
impersonate xxx]
● core-site.xml中设置
● hadoop.proxyuser.oozie.groups
● hadoop.proxyuser.oozie.hosts
● ForkJoin的bug
● Error: E0735 : E0735: There was an invalid "error to" transition to node [xxx]
while using fork/join
● OOZIE-1142
● oozie-site.xml中设置oozie.validate.ForkJoin为false
Berlin | 2014.01.14 | Teng Qiu
18
20. OOZIE实战及TIPS
Hive各种报错
● 每个hive action node都必须通过 <job-xml> 指定 hive-site.xml
● FAILED: Error in metadata
● NestedThrowables: JDOFatalInternalException 或 InvocationTargetException
● MetaStore所使用数据库的driver
● 如MySQL Java Connector,mysql-connector-java-xxx-bin.jar是否在
workflow中的lib目录下
● 目录权限
● Hive的warehouse和tmp目录权限,对启动oozie任务必须是的可写
● 如果要整合HBase
● hive-site.xml 中的 auxpath,zookeeper设置
Berlin | 2014.01.14 | Teng Qiu
20
22. OOZIE实战及TIPS
TIP:全局属性
● 属性检查、替换
<workflow-app name="">
<parameters>
<property>
<name>current_month</name> 如果current_month变量未指定,将报错Error: E0738
</property>
<property>
如果current_date变量未指定,此处将设为 ''
<name>currentDate</name>
<value>${concat(concat("'", wf:conf('current_date')), "'")}</value>
</property>
<property>
<name>dateFrom</name>
<value>${concat(concat("'", firstNotNull(wf:conf('current_date'), concat(wf:conf('current_month'), '-01'))), "'")}</value>
</property>
<property>
<name>dateTo</name>
<value>${concat(concat("'", firstNotNull(wf:conf('current_date'), concat(wf:conf('current_month'), '-31'))), "'")}</value>
</property>
</parameters>
...
Berlin | 2014.01.14 | Teng Qiu
22
24. OOZIE实战及TIPS
工作流运行中对KPI值的收集
● MapReduce action / Pig action
● hadoop:counters
● ${hadoop:counters("mr-node-name")["FileSystemCounters"]["FILE_BYTES_READ"]}
● Java / SSH action
● <capture-output />
● ${wf:actionData('java-action-node-name')['property-name']}
● ${wf:action:output('ssh-action-node-name')['property-name']}
● Hive 没有好的办法
● hive –e –S
Berlin | 2014.01.14 | Teng Qiu
24
25. OOZIE实战及TIPS
Java Action 传递输出数据回oozie
● Java的输出作为变量
● <capture-output />
● 程序中写Properties
String oozieProp = System.getProperty("oozie.action.output.properties");
if (oozieProp != null) {
Properties props = new Properties();
props.setProperty(“last.import.date”, “2013-12-01T00:00:00Z”); // ISO-8601 date format
File propFile = new File(oozieProp);
OutputStream os = new FileOutputStream(propFile);
props.store(os, "Results from oozie task");
os.close();
}
Berlin | 2014.01.14 | Teng Qiu
25
27. OOZIE实战及TIPS
收集输出变量也是有风险滴
● Oozie Action的输出数据有个默认的大小限制,只有2K!
Failing Oozie Launcher, Output data size [4 321] exceeds maximum [2 048]
Failing Oozie Launcher, Main class [com.myactions.action.InitAction], exception invoking main(), null
org.apache.oozie.action.hadoop.LauncherException
at org.apache.oozie.action.hadoop.LauncherMapper.failLauncher(LauncherMapper.java:571)
● 修改 oozie-site.xml
<property>
<name>oozie.action.max.output.data</name>
<value>1048576</value>
</property>
● 设置成1M
● 然后。。。要重启oozie
Berlin | 2014.01.14 | Teng Qiu
27
28. OOZIE的编程接口介绍
● Oozie的编程接口介绍
● Oozie Web Services API
● HTTP REST API
● curl -X POST -H "Content-Type: application/xml" -d @config.xml "http://localhost:11000/oozie/v1/jobs?action=start"
● Oozie Java client API
import org.apache.oozie.client.OozieClient;
new OozieClient(String oozie_url)
create Properties Object
String jobId = oozieClient.run(Properties prop)
org.apache.oozie.client.WorkflowJob
WorkflowJob job = oozieClient.getJobInfo(String jobID);
Berlin | 2014.01.14 | Teng Qiu
28
30. 总结展望
● 总结展望
● 作为hadoop集群内cron job的有效替代者
● 与Hadoop结合紧密,可统一进行用户权限管理
● 工作流节点的错误报警和处理(rerun)
● 可通过流程控制节点对工作流进行灵活控制
● 与Azkaban相比,支持的任务种类更多
● 但是是有所牺牲的,始终占用一个map slot
● 与Azkaban相比,支持变量及EL语言
● coordinator提供事件触发式的启动模式
● API丰富
● 不支持HBase
● 要费劲写XML
Berlin | 2014.01.14 | Teng Qiu
30