Presentation given at TriHUG (Triangle Hadoop User Group) on May 22, 2012. Gives a basic overview of Apache ZooKeeper as well as some common use cases, 3rd party libraries, and "gotchas"
Demo code available at https://github.com/mumrah/trihug-zookeeper-demo
2. Who am I
● David Arthur
● Engineer at Lucid Imagination
● Hadoop user
● Python enthusiast
● Father
● Gardener
3. Play along!
Grab the source for this presentation at GitHub
github.com/mumrah/trihug-zookeeper-demo
You'll need Java, Ant, and bash.
4. Apache ZooKeeper
● Formerly a Hadoop sub-project
● ASF TLP (top level project) since Nov 2010
● 7 PMC members, 8 committers - most from
Yahoo! and Cloudera
● Ugly logo
5. One liner
"ZooKeeper allows distributed processes to
coordinate with each other through a shared
hierarchical name space of data registers"
- ZooKeeper wiki
6. Who uses it?
Everyone*
● Yahoo!
● HBase
● Solr
● LinkedIn (Kafka, Hedwig)
● Many more
* https://cwiki.apache.org/confluence/display/ZOOKEEPER/PoweredBy
7. What is it good for?
● Configuration management - machines
bootstrap config from a centralized source,
facilitates simpler deployment/provisioning
● Naming service - like DNS, mappings of names
to addresses
● Distributed synchronization - locks, barriers,
queues
● Leader election - a common problem in
distributed coordination
● Centralized and highly reliable (simple) data
registry
8. Namespace (ZNodes)
parent : "foo"
|-- child1 : "bar"
|-- child2 : "spam"
`-- child3 : "eggs"
`-- grandchild1 : "42"
Every znode has data (given as byte[]) and can
optionally have children.
9. Sequential znode
Nodes created in "sequential" mode will
append a 10 digit zero padded monotonically
increasing number to the name.
create("/demo/seq-", ..., ..., PERSISTENT_SEQUENTIAL) x4
/demo
|-- seq-0000000000
|-- seq-0000000001
|-- seq-0000000002
`-- seq-0000000003
10. Ephemeral znode
Nodes created in "ephemeral" mode will be
deleted when the originating client goes away.
create("/demo/foo", ..., ..., PERSISTENT);
create("/demo/bar", ..., ..., EPHEMERAL);
Connected Disconnected
/demo /demo
|-- foo `-- foo
`-- bar
11. Simple API
Pretty much everything lives under the
ZooKeeper class
● create
● exists
● delete
● getData
● setData
● getChildren
12. Synchronicity
sync and async version of API methods
exists("/demo", null);
exists("/demo", null, new StatCallback() {
@Override
public processResult(int rc,
String path,
Object ctx,
Stat stat) {
...
}
}, null);
13. Watches
Watches are a one-shot callback mechanism
for changes on connection and znode state
● Client connects/disconnects
● ZNode data changes
● ZNode children change
14. Demo time!
For those playing along, you'll need to get
ZooKeeper running. Using the default port
(2181), run:
ant zk
Or specify a port like:
ant zk -Dzk.port=2181
15. Things to "watch" out for
● Watches are one-shot - if you want continuous
monitoring of a znode, you have to reset the
watch after each event
● Too many clients watches on a single znode
creates a "herd effect" - lots of clients get
notifications at the same time and cause spikes
in load
● Potential for missing changes
● All watches are executed in a single, separate
thread (be careful about synchronization)
16. Building blocks
● Hierarchical nodes
● Parent and leaf nodes can have data
● Two special types of nodes - ephemeral and
sequential
● Watch mechanism
● Consistency guarantees
○ Order of updates is maintained
○ Updates are atomic
○ Znodes are versioned for MVCC
○ Many more
17. The Fun Stuff
Recipes:
● Lock
● Barrier
● Queue
● Two-phase commit
● Leader election
● Group membership
18. Demo Time!
Group membership (i.e., the easy one)
Recipe:
● Members register a sequential ephemeral
node under the group node
● Everyone keeps a watch on the group node
for new children
19. Lots of boilerplate
● Synchronize the asynchronous connection
(using a latch or something)
● Handling disconnects/reconnects
● Exception handling
● Ensuring paths exist (nothing like mkdir -p)
● Resetting watches
● Cleaning up
20. What happens?
● Everyone writes their own high level
wrapper/connection manager
○ ZooKeeperWrapper
○ ZooKeeperSession
○ (w+)ZooKeeper
○ ZooKeeper(w+)
21. Open Source, FTW!
Luckily, some smart people have open sourced
their ZooKeeper utilities/wrappers
● Netflix Curator - Netflix/curator
● Linkedin - linkedin/linkedin-zookeeper
● Many others
22. Netflix Curator
● Handles the connection management
● Implements many recipes
○ leader election
○ locks, queues, and barriers
○ counters
○ path cache
● Bonus: service discovery implementation
(we use this)
23. Demo Time!
Group membership refactored with Curator
● EnsurePath is nice
● Robust connection management is
awesome
● Exceptions are more sane
24. Thoughts on Curator
i.e., my non-expert subjective opinions
● Good level of abstraction - doesn't do
anything "magical"
● Doesn't hide ZooKeeper
● Weird API design (builder soup)
● Extensive, well tested recipe support
● It works!
26. Use case: Solr 4.0
Used in "Solr cloud" mode for:
● Cluster management - what machines are
available and where are they located
● Leader election - used for picking a shard as
the "leader"
● Consolidated config storage
● Watches allow for very non-chatty steady-
state
● Herd effect not really an issue
27. Use case: Kafka
● Linkedin's distributed pub/sub system
● Queues are persistent
● Clients request a slice of a queue (offset,
length)
● Brokers are registered in ZooKeeper, clients
load balance requests among live brokers
● Client state (last consumed offset) is stored
in ZooKeeper
● Client rebalancing algorithm, similar to
leader election
28. Use case:
LucidWorks Big Data
● We use Curator's service discovery to
register REST services
● Nice for SOA
● Took 1 dev (me) 1 day to get something
functional (mostly reading Curator docs)
● So far, so good!
29. Review of "gotchas"
● Watch execution is single threaded and synchronized
● Can't reliably get every change for a znode
● Excessive watchers on the same znode (herd effect)
Some new ones
● GC pauses: if your application is prone to long GC
pauses, make sure your session timeout is sufficiently
long
● Catch-all watches: if you use one Watcher for
everything, it can be tedious to infer exactly what
happened
30. Four letter words
The ZooKeeper server responds to a few "four
letter word" commands via TCP or Telnet*
> echo ruok | nc localhost 2181
imok
I'm glad you're OK, ZooKeeper - really I am.
* http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_zkCommands
31. Quorums
In a multi-node deployment (aka, ZooKeeper
Quorum), it is best to use an odd number of
machines.
ZooKeeper uses majority voting, so it can
tolerate ceil(N/2)-1 machine failures and
still function properly.
32. Multi-tenancy
ZooKeeper supports "chroot" at the session level. You can
add a path to the connection string that will be implicitly
prefixed to everything you do:
new ZooKeeper("localhost:2181/my/app");
Curator also supports this, but at the application level:
CuratorFrameworkFactory.builder()
.namespace("/my/app");
33. Python client
Dumb wrapper around C client, not very
Pythonic
import zookeeper
zk_handle = zookeeper.init("localhost:2181")
zookeeper.exists(zk_handle, "/demo")
zookeeper.get_children(zk_handle, "/demo")
Stuff in contrib didn't work for me, I used a
statically linked version: zc-zookeeper-static
34. Other clients
Included in ZooKeeper under src/contrib:
● C (this is what the Python client uses)
● Perl (again, using the C client)
● REST (JAX-RS via Jersey)
● FUSE? (strange)
3rd-party client implementations:
● Scala, courtesy of Twitter
● Several others
35. Overview
● Basics of ZooKeeper (znode types, watches)
● High-level recipes (group membership, et
al.)
● Lots of boilerplate for basic functionality
● 3rd party helpers (Curator, et al.)
● Gotchas and other miscellany