2. 1. Zookeeper Arhitecture and Features
2. Zookeeper Node roles
3. Zookeeper configuration (配置)
4. Zookeeper data model( 介绍znode, zxid等)
5. Zookeeper data read/write
6. Key mechanisms, 包括Leader的选举,log和snapshot的
作用,为什么要奇数个节点,为什么半数以上follower同意才可
以完成写操作,等
3. What’s zookeeper
ZooKeeper is a distributed, open-source coordination service for distributed applications.
It exposes a simple set of primitives that distributed applications can build upon to
implement higher level services for synchronization, configuration maintenance, and
groups and naming. It is designed to be easy to program to, and uses a data model styled
after the familiar directory tree structure of file systems.
The motivation behind ZooKeeper is to relieve distributed applications the responsibility of
implementing coordination services from scratch
An open source implementation of Chubby
4. 12/11/14
Zookeeper architecture
ZooKeeper consists of more servers
One Leader,more Followers
High performance: It can be used in large, distributed systems.
Highly available: The reliability aspects keep it from being a single point of failure
Strictly ordered access:sophisticated synchronization primitives can be implemented at the client.
5. The servers that make up the ZooKeeper service must all know about each other.
Zookeeper uses configuration file to konw each other and PING message type is
enchanged between follower and leader to determine liveliness
note: ping is a kind of sending packet to a specified port
12/11/14
6. ZooKeeper achieves high-availability through replication, and can provide a
service as long as a majority of the machines in the ensemble are up.
For example, in a five-node ensemble, any two machines can fail and the
service will still work because a majority of three remain. Note that a six-node
12/11/14
ensemble can also tolerate only two machines failing, since with three
failures the remaining three do not constitute a majority of the six. For this
reason, it is usual to have an odd number of machines in an ensemble.
others reason: can not form the majority, any value can not be approved.
7. feature
1、It is especially fast in "read-dominant" workloads.
2、 ZooKeeper is replicated. Like the distributed processes it
coordinates, ZooKeeper itself is intended to be replicated over a sets of
hosts called an ensemble.
3、Every update made to the znode tree is given a globally unique
identifier, called a zxid (which stands for “ZooKeeper transaction ID”).
…………
12/11/14
8. Zookeeper Data Model
A shared hierarchal namespace, similarly to a standard file system
Each folder called znode
ZooKeeper was designed to store coordination data, so it is very small
status information (version numbers for data changes, ACL changes, and
timestamps) ,configuration,location information.
12/11/14
9.
10. zonde data structure
czxid:The zxid of the change that caused this znode to be created.
mzxid:The zxid of the change that last modified this znode.
ctime:The time in milliseconds from epoch when this znode was created.
mtime:The time in milliseconds from epoch when this znode was last modified.
version:The number of changes to the data of this znode.
cversion:The number of changes to the children of this znode.
aversion:The number of changes to the ACL of this znode.
ephemeralOwner:The session id of the owner of this znode if the znode is an
ephemeral node. If it is not an ephemeral node, it will be zero.
dataLength:The length of the data field of this znode.The maximum allowable
size of the data array is 1 MB
numChildren:The number of children of this znode. pzxid ??
12/11/14
11. ZooKeeper is replicated.
Theory a client will see the same view of the system regardless of the server it connects
to
12/11/14
Like the distributed processes it coordinates, ZooKeeper itself is intended to be
replicated over a sets of hosts called an ensemble.
All of the server have the same data guaranteed by fast paxos algorithm
12. 12/11/14
role of zookeeper
• Leader : responsible for initiation and resolution of
the final vote, update the status in the end .
note:It is possible to configure ZooKeeper so that
the leader does not accept client connections. set
zookeeper.leaderServes value to "no"
• Follower :Follower for receiving client requests
and returned to the client results. Participate in the
Leader-sponsored vote. the server will synchronize
with the leader and replicate any transactions.
13. • Oberserver :The observer can enhance the performance of the read
operation of the cluster that it does not affect the write performance, it only
accepts read requests, write requests are forwarded to the leader.
The problem is that as we add more voting members, the write
performance drops. This is due to the fact that a write operation requires
the agreement of (in general) at least half the nodes in an ensemble and
therefore the cost of a vote can increase significantly as more voters are
added.
peerType=observer
server.1:localhost:2181:3181:observer
detail: http://zookeeper.apache.org/doc/trunk/zookeeperObservers.html
14. Read data from the connected Server
Read requests are serviced from the local replica of each
server database
15. Write data:Paxos
•N Senator make decision in Paxos Island
•Each proposal has a increasing PID
•More than half Senators pass the proposal ,
it can pass
•Each Senator just agree the proposal whose
PID is bigger than the current PID
12/11/14
ZooKeeper
Senator -> Server
proposal ->ZnodeChange
PID -> ZooKeeper Transaction Id
17. Write data: Cilent zookeeper
write requests, are processed by an agreement protocol
a leader proposes a request, collects votes, and finally commits
1.Client sent write request to Server
2. Server sent write request to leader
3. Leader sent PROPOSAL message to all the followers.(asynchronous sent)
4. Followers: Agree or deny (ACK sent by a follower after it has synced a proposal)
5. Commit
6. Sent response to client
18. note:
All machines in the ensemble write updates to disk before updating their
in-memory copy of the znode tree.
Updates are logged to disk for recoverability, and writes are serialized to
disk before they are applied to the in-memory database.
http://zookeeper.apache.org/doc/r3.2.2/zookeeperOver.html
SyncRequestProcessor
ZkDatabase
if restart: ZkDatabase-> load the database from the disk onto memory
when boot up
This class maintains the in memory database of zookeeper server states
that includes the sessions, datatree and the committed logs.
It is booted up after reading the logs and snapshots from the disk.
12/11/14
19. log and snapshot:
SyncRequestProcessor
when take a snapshot
1、when leader change
2、a new server comes
3、when
logCount>(snapCount/2+randRoll)
snapshot is used for recoverability with
logs
detail : http://rdc.taobao.com/team/jm/archives/947
20. question
1、When the leader crash?
2、they make the point that a follower may lag the leader ,
so one cilent may read outdate data.
3、why update half of the nodes
21. Leader Selection
SSeerrvveerr
Send
data
Selected Leader id
zxid
Logic clock(init value equals 0)
Status: LOOKING, FOLLOWING,
OBSERVING,LEADING
Server1
Server2
Server3
Server4
Server5
Step 1
Step 2
Step 3
Step 4
Step 5
No response looking
Server2 is leader, but less than half Servers agree
looking
Server3 is leader, more than half Servers agree
Leading
There is leader already following
There is leader already following
23. This phase is finished once a majority (or quorum) of followers have
synchronized their state with the leader.
24. The sync operation forces the ZooKeeper server to which a
cilent is connected to “catch up” with the leader,
12/11/14
question 2
1、use sync
2、watcher
Application scenarios limit
25. Watcher :data zookeeper Cilent
1、ZooKeeper supports the concept of watches.
2.Clients can set a watch on a znodes.
3.A watch will be triggered and removed when the znode changes.
4.When a watch is triggered, the client receives a packet saying that the znode has
changed.
12/11/14
27. 12/11/14
Zookeeper Performace
It is especially high performance in applications where reads outnumber
writes, since writes involve synchronizing the state of all servers. (Reads
outnumbering writes is typically the case for a coordination service.)
28. 12/11/14
use of zookeeper
1、Master election
/currentMaster/{sessionId}-1 ,
/currentMaster/{sessionId}-2 ,
/currentMaster/{sessionId}-3
EPHEMERAL_SEQUENTIAL node
29. 2、Hbase use zookeeper
Select a master.
Discover which master controls which servers.
Help the client to find its master.
30. Configuration Management (push)
1. Every server corresponds to a znodein ZooKeeper. (Client1 P1, C2 P2,
12/11/14
…)
2. Multiple servers in one cluster may share one configuration.
3. When the configuration changed, they should receive notification
31. Cluster Management
1. When one machine is dead, other machine should receive the notification.
2. When one server dies, his znode will be automatically removed. (C1 P1, C2 P2 …)
3. When the master machine is dead, how to select the new master? Paxos!
12/11/14
33. 12/11/14
Configuration
Each server in the ensemble of ZooKeeper servers has a numeric
identifier that is unique within the ensemble, and must fall between 1 and
255.
we can see that the number of zookeeper server is less than 255;
A ZooKeeper service usually consists of three to seven machines. Our
implementation supports more machines, but three to seven machines
provide more than enough performance and resilience.
So if you want reliability go with at least 3. We typically recommend
having 5 servers in "online" production serving environments. This
allows you to take 1 server out of service (say planned
maintenance) and still be able to sustain an unexpected outage of
one of the remaining servers w/o interruption of the service.
35. initLimit is the amount of time to allow for followers to connect to
and sync with the leader. If a majority of followers fail to sync
within this period, then the leader renounces its leadership status
and another leader election takes place. If this happens often (and
you can discover if this is the case because it is logged), it is a sign
that the setting is too low. (10s)
syncLimit is the amount of time to allow a follower to sync with
the leader. If a follower fails to sync within this period, it will
restart itself. Clients that were attached to this follower will connect
to another one.(4s)
12/11/14
36. 12/11/14
Servers listen on three ports:
2181 for client connections;
2888 for follower connections,if they are the leader;
3888 for other server connections during the leader
election phase.
37. 12/11/14
FAQ
How do I size a ZooKeeper ensemble (cluster)?
In general when determining the number of ZooKeeper serving nodes to deploy (the size
of an ensemble) you need to think in terms of reliability, and not performance.
Reliability:
A single ZooKeeper server (standalone) is essentially a coordinator with no reliability (a
single serving node failure brings down the ZK service).
A 3 server ensemble (you need to jump to 3 and not 2 because ZK works based on
simple majority voting) allows for a single server to fail and the service will still be
available.
So if you want reliability go with at least 3. We typically recommend having 5 servers in
"online" production serving environments. This allows you to take 1 server out of service
(say planned maintenance) and still be able to sustain an unexpected outage of one of the
remaining servers w/o interruption of the service.
Performance:
Write performance actually decreases as you add ZK servers, while read performance
increases modestly: http://bit.ly/9JEUju
42. summary
•Hadoop Zookeeper:
An open source implementation of Chubby.
•Data Model:
A shared hierarchal namespace, similarly to a standard file system
•One Leader,more Followers Architecture
Follower has the same Data Model.
Use Paxos algorithm to implement consistency
•Watcher
Client can monitor the znode change by watcher
We are able to simplify the
two-phase commit protocol because we do not have aborts;
followers either acknowledge the leader's proposal or they
abandon the leader. The lack of aborts also mean that we
can commit once a quorum of servers ack the proposal rather
than waiting for all servers to respond. This simplied two-
phase commit by itself cannot handle leader failures, so we
will add recovery mode to handle leader failures.
If a Zab server comes online while a leader is actively broadcasting messages, the
server will start in recovery mode, discover and synchronize with the leader, and start participating in the message
broadcasts.---zab
For example, a Zab service made up of three servers where one is a leader and the two other servers are followers will move to broadcast mode. If one of the followers die, there will be no interruption in service since the leader will still have a quorum. If the follower recovers and the other dies, there will still be no service interruption.
当一个Server启动时它都会发起一次选举,跟顺序有关系,选举完之后会和leader进行数据同步;一旦半数以上完成数据同步则此阶段结束
选举算法:http://zookeeper.apache.org/doc/r3.2.2/zookeeperInternals.html#sc_leaderElection
http://rdc.taobao.com/blog/cs/?p=162
源码:http://blog.sina.com.cn/s/blog_3fe961ae01012dkk.html
Zookeeper用于选举leader有三个类:FastLeaderElection类,LeaderElection类,AuthFastLeaderElection类。默认使用FastLeaderElection类。FastLeaderElection选举算法用到的类是FastLeaderElection类,在org.apache.zookeeper.server.quorum包中
If a majority of followers fail to sync within this period, then the leader renounces
its leadership status and another leader election takes place. If this happens often (and
you can discover if this is the case because it is logged), it is a sign that the setting is too
low.
Zookeeper的核心是原子广播,这个机制保证了各个server之间的同步。实现这个机制的协议叫做Zab协议。Zab协议有两种模式,它们分别是恢复模式和广播模式。当服务启动或者在领导者崩溃后,Zab就进入了恢复模式,当领导者被选举出来,且大多数server的完成了和leader的状态同步以后,恢复模式就结束了。状态同步保证了leader和server具有相同的系统状态。
一旦leader已经和多数的follower进行了状态同步后,他就可以开始广播消息了,即进入广播状态。这时候当一个server加入zookeeper服务中,它会在恢复模式下启动,发现leader,并和leader进行状态同步。待到同步结束,它也参与消息广播。Zookeeper服务一直维持在Broadcast状态,直到leader崩溃了或者leader失去了大部分的followers支持。
Broadcast模式极其类似于分布式事务中的2pc(two-phrase commit 两阶段提交):即leader提起一个决议,由followers进行投票,leader对投票结果进行计算决定是否通过该决议,如果通过执行该决议(事务),否则什么也不做。