Introduction to pig&zookeeper

Introduction to Pig & Zookeeper

Yoyo Cheng
ISCAS

ZooKeeper

• 什么是ZooKeeper
• ZooKeeper的工作模式
• 工作原理
• API接口

What is ZooKeeper

• a high-performance coordination service for
distributed applications
• common services
– naming
– configuration management
– synchronization
– group services
• used by HBase, Yahoo! Message Broker, Fetch
Service of Yahoo! crawler in Yahoo!( like Google
's Chubby based on Paxos)

Example
• 假设有20个搜索引擎的服务器(每个负责总索引中
的一部分的搜索任务,其中15个服务器现在提供搜索服
务,5个服务器正在生成索引.)和一个总服务器(负责向
这20个搜索引擎的服务器发出搜索请求并合并结
果集),一个备用的总服务器(负责当总服务器宕机
时替换总服务器),一个web的 cgi(向总服务器发出
搜索请求)
• 这20个搜索引擎的服务器经常要让正在提
供搜索服务的服务器停止提供服务开始生
成索引,或生成索引的服务器已经把索引生
成完成可以搜索提供服务了

What can ZooKeeper do?

• 可以保证总服务器自动感知有多少提供搜
索引擎的服务器并向这些服务器发出搜索
请求
• 备用的总服务器宕机时自动启用备用的总
服务器
• web的cgi能够自动地获知总服务器的网络
地址变化

How did ZooKeeper do?

• 总服务器自动感知有多少提供搜索引擎的服务器
并向这些服务器发出搜索请求
– Step 1: 提供搜索引擎的服务器都在Zookeeper中创建
znode,zk.create("/search/nodes/node1",
"hostname".getBytes(), Ids.OPEN_ACL_UNSAFE,
CreateFlags.EPHEMERAL);
– Step2: 总服务器可以从Zookeeper中获取一个znode的
子节点的列表,zk.getChildren("/search/nodes", true);
– Step3:总服务器遍历这些子节点,并获取子节点的数据
生成提供搜索引擎的服务器列表;
– Step4. 当总服务器接收到子节点改变的事件信息,重新
返回第二步.


• 备用的总服务器宕机时自动启用备用的总
服务器

– 总服务器在Zookeeper中创建节点,zk.create("/search/master", "hostname".getBytes(),
Ids.OPEN_ACL_UNSAFE, CreateFlags.EPHEMERAL);

– 备用的总服务器监控Zookeeper中的"/search/master"节点.当这个znode的节点数据改变时,把
自己启动变成总服务器, 并把自己的网络地址数据放进这个节点.


• web的cgi能够自动地获知总服务器的网络
地址变化
– web的cgi从Zookeeper中"/search/master"节点
获取总服务器的网络地址数据并向其发送搜索
请求.
– web的cgi监控Zookeeper中的"/search/master"
节点,当这个znode的节点数据改变时,从这个节
点获取总服务器的网络地址数据,并改变当前的
总服务器的网络地址.

standalon or quorum
• standalon
– 只有一个zookeeper service
– 便于测试
– 但不能保证服务的高性能和高可靠性
• quorum:
– 只要集群中的大多数正常工作，就可以提供稳定的高
性能服务
– 例如：五个节点的ensemble，任意两个节点失败，服
务器仍然可以正常工作
– 原理：znode树的每一次修改都被复制到ensemble的
大多数机器中
– Zookeeper使用zab协议

two phase commit

• Phase1:leader election
– 选举一个杰出的组员（一个zookeeper
service），称之为leader，其他的机器称之为
followers.
• Phase2:Atomic broadcast
– 所有的写请求传递到leader，leader通过广播
更新followers。当大多数更改后，leader提交
更新，同时client得到响应：更新成功。

• replicated database是一个包含整个数据树
的内存数据库.更新被logged到磁盘以提供
可恢复性,写操作先持久化到磁盘，然后再
对内存数据库作变更.
• 消息层负责替换失效leader并同步followers.
• 当Leader收到写请求，它计算写请求起作
用时系统将要处于的状态，并将写请求转
换为一个封装新状态的事务处理操作。

Query
• 用来查询服务器端的数据，不会更改服务器端的数据
• 所有的查询命令都可以即刻从client连接的server立即返回，不需要leader进
行协调。
• 所有的查询命令都可以指定watcher，通过它来跟踪指定path的数据变化。一
旦指定的数据发生变化（create,delete,modified,children_changed），服务
器将会发送命令来回调注册的watcher.
• 查询命令：

– 1. exists:判断指定path的node是否存在，如果存在则返回true，否则返回false.

– 2. getData:从指定path获取该node的数据

– 3. getACL:获取指定path的ACL。

– 4. getChildren:获取指定path的node的所有子结点。

Modify
• 主要是用来修改节点数据或结构，或者权限信息。任何修改命令都需要提交到leader进
行协调，协调完成后才返回。
• 在leader的协调过程中，需要leader与Follower之间的来回请求响应。并且在此过程中
还会涉及事务日志的记录，更糟糕的情况是还有take snapshot的操作。因此此过程可
能比较耗时。
• Zookeeper的通信中最大特点是异步的，如果请求是连续不断的，Zookeeper的处理是集中处理逻
辑，然后批量发送，批量的大小也是有控制的。如果请求量不大，则即刻发送。这样当负载很大时
也能保证很大的吞吐量，时效性也在一定程度上进行了保证。

• 修改命令主要包括：

– 1. createSession：请求server创建一个session

– 2. create：创建一个节点

– 3. delete：删除一个节点

– 4. setData：修改一个节点的数据

– 5. setACL：修改一个节点的ACL

– 6. closeSession：请求server关闭session

Pig

• 什么是Pig
• 为什么要使用Pig
• pig的应用场景
• 如何使用pig

What is Pig

• SQL-like语言，是在
MapReduce上构建的
一种高级查询语言

Motivation

• Map Reduce is very powerful,but:
– It requires a Java programmer.
– re-invent the wheel(join, filter, etc.)

Pig Latin

• Pig provides a higher level language, Pig
Latin, that:
– Increases productivity. In one test
• 10 lines of Pig Latin ≈ 200 lines of Java.
• What took 4 hours to write in Java took 15
minutes in Pig Latin.
– Opens the system to non-Java programmers.
– Provides common operations like join,group,
filter, sort.

Why a New Language?

• Pig Latin is a data flow language.
• User code and existing binaries can be
included almost anywhere.
• Metadata not required, but used when
available.
• Support for nested types(map，
list,collection...), pig latin support that as
first class type.

• Operates on files in HDFS

Background

• Yahoo! was the first big adopter of Hadoop.
• Hadoop gained popularity in the company
quickly.
• Yahoo! Research developed Pig to
address the need for a higher level
language.
• Roughly 30% of Hadoop jobs run at Yahoo!
are Pig jobs.

How Pig is Being Used

• Web log processing
• Data processing for web search platforms
• Ad hoc queries across large data sets.
• Rapid prototyping of algorithms for
processing large data sets

Accessing Pig

• Submit a script directly.
• Grunt, the pig shell.
• PigServer Java class, a JDBC like
interface.
• PigPen, an eclipse plugin
– Allows textual and graphical scripting.
– Samples data and shows example data
– flow.

Data Types

• Scalar types: int, long,double, chararray,
bytearray.
• Complex types:
– map: associative array.
– tuple: ordered list of data, elements may be of
any scalar or complex type.
– bag: unordered collection of tuples.

How to use

• No need to install anything extra on your
Hadoop cluster
• Start a terminal and run
$ cd /usr/share/cloudera/pig/
$ bin/pig –x local
Should see a prompt like:
grunt>

Load Data
Users = LOAD 'users.txt'
USING PigStorage(',') AS (name, age);

� LOAD … AS …
� PigStorage(‘,’) to specify separator

name age
John,18 John 18
Mary,20
Mary 20
Bob,30
Bob 30

Filter
Fltrd = FILTER Users
BY age >= 18 AND age <= 25;

� FILTER … BY …
� constraints can be composite

name age name age
John 18 John 18
Mary 20 Mary 20
Bob 30

Generate / Project
Names = FOREACH Fltrd GENERATE name;

� FOREACH … GENERATE

name age name
John 18 John
Mary 20 Mary

Store Data
STORE Names INTO 'names.out';

� STORE … INTO …
� PigStorage(‘,’) to specify separator if multiple
fields

Command - JOIN
Users = LOAD ‘users’ AS (name, age);
users’
Pages = LOAD ‘pages’ AS (user, url);
pages’
Jnd = JOIN Users BY name, Pages BY user;

name age
John 18 name age user url
Mary 20
John 18 John yaho
Bob 30
Mary 20 Mary goog
user url
John yaho Bob 30 Bob bing
Mary goog
Bob bing

Command - GROUP
Grpd = GROUP Jnd by url;
describe Grpd;

name age url yhoo (John, 18, yhoo)
(Dee, 25, yhoo)
John 18 yhoo

Mary 20 goog goog (Mary, 20, goog)

Dee 25 yhoo
bing (Kim, 40, bing)
Kim 40 bing
(Bob, 30, bing)
Bob 30 bing

Other Commands

� ORDER – sort by a field
� COUNT – eval: count #elements
� COGROUP – structured JOIN
� More at http://hadoop.apache.org/pig/

Reference
• 初识ZooKeeper, http://bbs.hadoopor.com/thread-533-1-1.html
• Zookeeper分布式安装手册, http://bbs.hadoopor.com/thread-1541-1-1.html
• 安装zookeeper, http://bbs.hadoopor.com/thread-836-1-1.html
• Paxos在大型系统中常见的应用场景,http://timyang.net/tag/zookeeper/
• Introduction to Pig programming,Yiwei Chen,Yahoo Search Engineering,
http://www.docstoc.com/docs/27501834/Introduction-to-Pig-programming
• Introduction to Pig,Allen
Gates,Yahoo!,http://www.cloudera.com/videos/introduction_to_pig ,
http://www.cloudera.com/videos/pig_tutorial
• Pig Latin ── Language for Large Data
Processing,http://www.hadoop.tw/2010/04/pig.html
• Pig安装与配置教程,http://www.hadoopor.com/thread-236-1-1.html
• Hadoop学习-9 Pig执
行,http://sunjun041640.blog.163.com/blog/static/2562683220106240117330/
• http://hadoop.apache.org/pig/
• http://hadoop.apache.org/zookeeper/
• http://wiki.apache.org/hadoop/ZooKeeper

Introduction to pig&zookeeper

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Introduction to pig&zookeeper

Similaire à Introduction to pig&zookeeper (20)

Introduction to pig&zookeeper