Zookeeper Introduce

zookeeper
A Distributed Coordination Service for Distributed Applications
By Fx_bull

1. Zookeeper Arhitecture and Features
2. Zookeeper Node roles
3. Zookeeper configuration （配置）
4. Zookeeper data model( 介绍znode, zxid等)
5. Zookeeper data read/write
6. Key mechanisms, 包括Leader的选举，log和snapshot的
作用，为什么要奇数个节点，为什么半数以上follower同意才可
以完成写操作，等

What’s zookeeper
ZooKeeper is a distributed, open-source coordination service for distributed applications.
It exposes a simple set of primitives that distributed applications can build upon to
implement higher level services for synchronization, configuration maintenance, and
groups and naming. It is designed to be easy to program to, and uses a data model styled
after the familiar directory tree structure of file systems.
The motivation behind ZooKeeper is to relieve distributed applications the responsibility of
implementing coordination services from scratch
An open source implementation of Chubby

12/11/14
Zookeeper architecture
ZooKeeper consists of more servers
One Leader，more Followers
High performance: It can be used in large, distributed systems.
Highly available: The reliability aspects keep it from being a single point of failure
Strictly ordered access:sophisticated synchronization primitives can be implemented at the client.

The servers that make up the ZooKeeper service must all know about each other.
Zookeeper uses configuration file to konw each other and PING message type is
enchanged between follower and leader to determine liveliness
note: ping is a kind of sending packet to a specified port
12/11/14

ZooKeeper achieves high-availability through replication, and can provide a
service as long as a majority of the machines in the ensemble are up.
For example, in a five-node ensemble, any two machines can fail and the
service will still work because a majority of three remain. Note that a six-node
12/11/14
ensemble can also tolerate only two machines failing, since with three
failures the remaining three do not constitute a majority of the six. For this
reason, it is usual to have an odd number of machines in an ensemble.
others reason: can not form the majority, any value can not be approved.

feature
1、It is especially fast in "read-dominant" workloads.
2、 ZooKeeper is replicated. Like the distributed processes it
coordinates, ZooKeeper itself is intended to be replicated over a sets of
hosts called an ensemble.
3、Every update made to the znode tree is given a globally unique
identifier, called a zxid (which stands for “ZooKeeper transaction ID”).
…………
12/11/14

Zookeeper Data Model
A shared hierarchal namespace, similarly to a standard file system
Each folder called znode
ZooKeeper was designed to store coordination data, so it is very small
status information (version numbers for data changes, ACL changes, and
timestamps) ,configuration,location information.
12/11/14

zonde data structure
czxid：The zxid of the change that caused this znode to be created.
mzxid：The zxid of the change that last modified this znode.
ctime：The time in milliseconds from epoch when this znode was created.
mtime：The time in milliseconds from epoch when this znode was last modified.
version：The number of changes to the data of this znode.
cversion：The number of changes to the children of this znode.
aversion：The number of changes to the ACL of this znode.
ephemeralOwner：The session id of the owner of this znode if the znode is an
ephemeral node. If it is not an ephemeral node, it will be zero.
dataLength：The length of the data field of this znode.The maximum allowable
size of the data array is 1 MB
numChildren：The number of children of this znode. pzxid ??
12/11/14

ZooKeeper is replicated.
Theory a client will see the same view of the system regardless of the server it connects
to
12/11/14
Like the distributed processes it coordinates, ZooKeeper itself is intended to be
replicated over a sets of hosts called an ensemble.
All of the server have the same data guaranteed by fast paxos algorithm

12/11/14
role of zookeeper
• Leader ： responsible for initiation and resolution of
the final vote, update the status in the end .
note:It is possible to configure ZooKeeper so that
the leader does not accept client connections. set
zookeeper.leaderServes value to "no"
• Follower ：Follower for receiving client requests
and returned to the client results. Participate in the
Leader-sponsored vote. the server will synchronize
with the leader and replicate any transactions.

• Oberserver ：The observer can enhance the performance of the read
operation of the cluster that it does not affect the write performance, it only
accepts read requests, write requests are forwarded to the leader.
The problem is that as we add more voting members, the write
performance drops. This is due to the fact that a write operation requires
the agreement of (in general) at least half the nodes in an ensemble and
therefore the cost of a vote can increase significantly as more voters are
added.
peerType=observer
server.1:localhost:2181:3181:observer
detail: http://zookeeper.apache.org/doc/trunk/zookeeperObservers.html

Read data from the connected Server
Read requests are serviced from the local replica of each
server database

Write data:Paxos
•N Senator make decision in Paxos Island
•Each proposal has a increasing PID
•More than half Senators pass the proposal ，
it can pass
•Each Senator just agree the proposal whose
PID is bigger than the current PID
12/11/14
ZooKeeper
Senator -> Server
proposal ->ZnodeChange
PID -> ZooKeeper Transaction Id

paxos
http://en.wikipedia.org/wiki/Paxos_algorithm
http://zh.wikipedia.org/zh-cn/Paxos%E7%AE
%97%E6%B3%95
http://research.microsoft.com/pubs/64624/tr-2005-112.pdf
http://rdc.taobao.com/blog/cs/?p=162

Write data: Cilent  zookeeper
write requests, are processed by an agreement protocol
a leader proposes a request, collects votes, and finally commits
1.Client sent write request to Server
2. Server sent write request to leader
3. Leader sent PROPOSAL message to all the followers.(asynchronous sent)
4. Followers: Agree or deny (ACK sent by a follower after it has synced a proposal)
5. Commit
6. Sent response to client

note:
All machines in the ensemble write updates to disk before updating their
in-memory copy of the znode tree.
Updates are logged to disk for recoverability, and writes are serialized to
disk before they are applied to the in-memory database.
http://zookeeper.apache.org/doc/r3.2.2/zookeeperOver.html
SyncRequestProcessor
ZkDatabase
if restart: ZkDatabase-> load the database from the disk onto memory
when boot up
This class maintains the in memory database of zookeeper server states
that includes the sessions, datatree and the committed logs.
It is booted up after reading the logs and snapshots from the disk.
12/11/14

log and snapshot:
SyncRequestProcessor
when take a snapshot
1、when leader change
2、a new server comes
3、when
logCount>(snapCount/2+randRoll)
snapshot is used for recoverability with
logs
detail : http://rdc.taobao.com/team/jm/archives/947

question
1、When the leader crash?
2、they make the point that a follower may lag the leader ，
so one cilent may read outdate data.
3、why update half of the nodes

Leader Selection
SSeerrvveerr
Send
data
Selected Leader id
zxid
Logic clock(init value equals 0)
Status: LOOKING, FOLLOWING,
OBSERVING,LEADING
Server1
Server2
Server3
Server4
Server5
Step 1
Step 2
Step 3
Step 4
Step 5
No response  looking
Server2 is leader, but less than half Servers agree
looking
Server3 is leader, more than half Servers agree
Leading
There is leader already following
There is leader already following

12/11/14
Leader selection
note: dataVersion->zxid
http://zookeeper.apache.org/doc/r3.2.2/zookeeperInternals.html#sc_leaderElection
http://rdc.taobao.com/blog/cs/?p=162

This phase is finished once a majority (or quorum) of followers have
synchronized their state with the leader.

The sync operation forces the ZooKeeper server to which a
cilent is connected to “catch up” with the leader,
12/11/14
question 2
1、use sync
2、watcher
Application scenarios limit

Watcher :data zookeeper Cilent
1、ZooKeeper supports the concept of watches.
2.Clients can set a watch on a znodes.
3.A watch will be triggered and removed when the znode changes.
4.When a watch is triggered, the client receives a packet saying that the znode has
changed.
12/11/14

12/11/14
question 3 ：why half of the nodes
• performance

12/11/14
Zookeeper Performace
It is especially high performance in applications where reads outnumber
writes, since writes involve synchronizing the state of all servers. (Reads
outnumbering writes is typically the case for a coordination service.)

12/11/14
use of zookeeper
1、Master election
/currentMaster/{sessionId}-1 ,
/currentMaster/{sessionId}-2 ,
/currentMaster/{sessionId}-3
EPHEMERAL_SEQUENTIAL node

2、Hbase use zookeeper
Select a master.
Discover which master controls which servers.
Help the client to find its master.

Configuration Management （push）
1. Every server corresponds to a znodein ZooKeeper. (Client1 P1, C2 P2,
12/11/14
…)
2. Multiple servers in one cluster may share one configuration.
3. When the configuration changed, they should receive notification

Cluster Management
1. When one machine is dead, other machine should receive the notification.
2. When one server dies, his znode will be automatically removed. (C1 P1, C2 P2 …)
3. When the master machine is dead, how to select the new master? Paxos!
12/11/14

Other： Queues 、Double Barriers 、Two-phased Commit. etc
reference:
• http://zookeeper.apache.org/doc/r3.3.2/recip
es.html
• http://rdc.taobao.com/team/jm/archives/1232
这里总结很详细，就不粘贴了
12/11/14

12/11/14
Configuration
Each server in the ensemble of ZooKeeper servers has a numeric
identifier that is unique within the ensemble, and must fall between 1 and
255.
we can see that the number of zookeeper server is less than 255;
A ZooKeeper service usually consists of three to seven machines. Our
implementation supports more machines, but three to seven machines
provide more than enough performance and resilience.
So if you want reliability go with at least 3. We typically recommend
having 5 servers in "online" production serving environments. This
allows you to take 1 server out of service (say planned
maintenance) and still be able to sustain an unexpected outage of
one of the remaining servers w/o interruption of the service.

12/11/14
zoo.cfg
tickTime=2000
dataDir=/disk1/zookeeper
dataLogDir=/disk2/zookeeper
clientPort=2181
initLimit=5
syncLimit=2
server.1=zookeeper1:2888:3888

initLimit is the amount of time to allow for followers to connect to
and sync with the leader. If a majority of followers fail to sync
within this period, then the leader renounces its leadership status
and another leader election takes place. If this happens often (and
you can discover if this is the case because it is logged), it is a sign
that the setting is too low. (10s)
syncLimit is the amount of time to allow a follower to sync with
the leader. If a follower fails to sync within this period, it will
restart itself. Clients that were attached to this follower will connect
to another one.(4s)
12/11/14

12/11/14
Servers listen on three ports:
2181 for client connections;
2888 for follower connections,if they are the leader;
3888 for other server connections during the leader
election phase.

12/11/14
FAQ
How do I size a ZooKeeper ensemble (cluster)?
In general when determining the number of ZooKeeper serving nodes to deploy (the size
of an ensemble) you need to think in terms of reliability, and not performance.
Reliability:
A single ZooKeeper server (standalone) is essentially a coordinator with no reliability (a
single serving node failure brings down the ZK service).
A 3 server ensemble (you need to jump to 3 and not 2 because ZK works based on
simple majority voting) allows for a single server to fail and the service will still be
available.
So if you want reliability go with at least 3. We typically recommend having 5 servers in
"online" production serving environments. This allows you to take 1 server out of service
(say planned maintenance) and still be able to sustain an unexpected outage of one of the
remaining servers w/o interruption of the service.
Performance:
Write performance actually decreases as you add ZK servers, while read performance
increases modestly: http://bit.ly/9JEUju

faq
• http://rdc.taobao.com/team/jm/archives/138
4

1、leader 选举完之后，或者新加入 zookeeper server ，
follower都要和leader进行同步，我看配置文件经常设置成
4s ，其实内存中的镜像有时候可能很大， 4s之内完不成同步，
怎么办
2、“粗粒度”的锁服务，说下 “粗粒度”该怎么理解啊？
3、权威指南强调大部分持久化成功之后才，返回？？
4、并不是每次都持久化？？

QuorumPeerMain ZookeeperServerMain is used to start the program

Processor Chain
LeaderZooKeeperServer
FollowerZooKeeperServer

summary
•Hadoop Zookeeper:
An open source implementation of Chubby.
•Data Model:
A shared hierarchal namespace, similarly to a standard file system
•One Leader，more Followers Architecture
Follower has the same Data Model.
Use Paxos algorithm to implement consistency
•Watcher
Client can monitor the znode change by watcher

Zookeeper Introduce

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Zookeeper Introduce

Similaire à Zookeeper Introduce (20)

Dernier

Dernier (20)

Zookeeper Introduce

Notes de l'éditeur