Kraken is a P2P docker image distribution system. It’s loosely based on BitTorrent protocol, fully compatible with docker registry API, and supports pluggable storage backends like S3, HDFS, etc. It successfully solved scaling problems we saw under different scenarios, also greatly sped up container deployment.
2. Agenda
● History of docker registry at Uber
● Evolution of a P2P solution
● Kraken architecture
● Performance
● Optimizations
3. Docker Registry at Uber 2015
● 400 services, hundreds of
compute hosts
● Static placement
● One registry host in each zone
● Local filesystem storage
○ No deletion
● Periodically sync across zones
4. Docker Registry at Uber 2017
● 3000+ services, thousands of
compute hosts, multiple zones
● Static placement → Mesos
● 3-5 registry hosts in each zone
○ Sharded by image names
● Local filesystem storage
○ Customized image gc tool
● Fronted by 3-10 nginx cache hosts
● Async replication with 30s delay
5. Problems
● Bandwidth and Disk IO limit
○ Image size p50 ~ 1G
○ 10 - 25Gbps NIC limit on registry and cache machines
○ 1000s of concurrent requests for each image
○ Projected to grow >10x per year
7. Problems
● Bandwidth and Disk IO limit
○ Image size p50 ~ 1G
○ 10 - 25Gbps NIC limit on registry and cache machines
○ 1000s of concurrent requests for each image
○ Projected to grow >10x per year
● Replication within and across zones
○ More expensive and complex as Uber add more zones
● Storage management
○ Maintaining cost of in-house image GC solution
8. Ideas
● Drastically reduce image size
● Deploy one more layer of cache servers
● Explore Docker registry storage driver options
○ Ceph
○ HDFS
○ P2P?
■ Same blobs being downloaded at the same time
9. Similarities
Docker image / Docker registry
● Immutable blobs
○ Content addressable
● Image manifest
● Tag resolution and manifest
distribution is decoupled with
layer distribution
BitTorrent
● Immutable blobs
○ Identified by infohash (piece hashes)
● Torrent file
● Torrent file lookup and
distribution is decoupled with p2p
protocol
10. Differences
Docker image / Docker registry
● Need to handle bursty load with
deadline (5 min default timeout)
● Client behaviors are controlled
and reliable
BitTorrent
● Prioritize on preserving complete
copies in the network
● Defend against selfish or
unreliable peers
11. POC
● Model each layer as a torrent
○ Each layer is divided into 4MB pieces
● Registry agent
○ Use docker registry code, keep all APIs
○ New storage driver with 3rd party P2P library
● Tracker
○ peer store
○ tag→metainfo(s) store
13. POC
● Docker pull = a series of
requests from local
docker daemon
● Resolve tag to metainfo
of layers first
14. POC
● Announce to tracker for
each layer, get list of
peers
● Hold connection from
local docker daemon
15. POC
● Locate each other and
seeder through tracker
● ???
● Download succeeds
16. Production Considerations
In house library, optimize for data center internal usage
● Peer connection
○ Central decision vs local decisions
○ Topology
■ Tree
● Rack-aware?
■ Graph
● Piece selection
○ Central decision vs local decisions
○ Selection algorithm
○ Piece size
18. Piece Selection
Random
● Easy to implement
Rarest first
● “Rarest First and Choke Algorithms are Enough” (Legout et al.)
19. Piece Selection
Smaller piece size
● Faster downloads
Bigger piece size
● Less communication overhead
● Required if piece selection is decided by central component
20. Peer Connection
Central decision
● Debuggability
● Easier to shutdown to avoid disasters
● Easier to apply optimizations and migrations
Local decisions
● Scalability
● Still need a few well known nodes
22. Peer Connection
Optimal graph
● Regular graph
● <=log(m*n) ramp-up time to place
the initial pieces
● All nodes upload/download at the
max speed, if piece selection is also
optimal
● Need to manage each piece, hard to
scale
23. Peer Connection
Random k-regular graph
● K-connected
=> K paths to seeders
● Diameter ~log(n)
=> Close to seeders
● Every peer downloads at
> 75% of max speed with
random piece selection
● Hard to keep it k-regular
24. Decisions
● Peer connection
○ Central decision by tracker (mostly), random selection
■ Tracker return 100 random completed peers, dedicated seeders, and incomplete peers
■ Peer iterate through the 100 until it has 10 connections
● Piece selection
○ Local decision
○ Random selection
■ Evaluate rarest first later
○ 4MB piece size
■ Configurable, evaluate other choices later
26. Kraken Architecture
Kraken core (cont’d)
● Agent
○ Implement registry interface
○ On every host
● Origin
○ Dedicated seeders
○ Pluggable storage backend
○ Self-healing hash ring
○ Ephemeral
● Tracker
○ Metainfo and peers
○ Self-healing hash ring (WIP)
○ Ephemeral
27. Kraken Architecture
Kraken index
● Zone local
● Resolves human readable tags
● Handles async replication to other
clusters
● k copies with staggered delay
● No consistency guarantee =>
No need for consensus protocols
29. Download 100MB blob onto 100 hosts under 3 seconds
Blue Origin
Grey Peer
Yellow Peer (downloading)
Green Peer (completed)
30. Performance in Test
Setup
● 3G image with 2 layers
● 2600 hosts (5200 downloads)
● 300Mbps speed limit
Result
● P50 10s (at speed limit)
● P99 20s
● Max 32sec
31. Performance in Production
Blobs distributed per day in busiest zone:
● 500k 0-100MB blobs
● 400k 100MB-1G blobs
● 3k 1G+ blobs
Peak
● 20k 100MB-1G blobs within 30 sec
32. Optimizations
● Low connection limit, aggressive disconnect
○ Less overhead
○ Less likely to have complete graphs
● Pipelining
○ Maintain a request queue of size n for each connection
● Endgame mode
○ For the last few pieces, request from all connected neighbors
● TTI/TTL based deletion
33. Unsuccessful Optimizations
● Prefer peers on the same rack
○ Reduced download speed by half
● Reject incoming request based on number of mutual connections
○ Intended to avoid highly-connected subgraphs, but doesn't work against bipartite graphs
○ Haven’t seen issues caused by graph density problems
● Rarest first piece selection
○ All peers decided to download the same piece at the same time, negatively impacted speed
34. Takeaways
● Docker images are just tar files
● P2P solutions can work within data centers
● Randomization works
● Get something working first before optimization
35. Future Plan
● Open source
● Tighter integration with Mesos agent
● Other use cases
● Debuggability