Kraken mesoscon 2018

Kraken
P2P Docker Registry
Cody Gibb <codyg@uber.com>, Evelyn Liu <evelynl@uber.com>, Yiran Wang <yiran@uber.com>

Agenda
● History of docker registry at Uber
● Evolution of a P2P solution
● Kraken architecture
● Performance
● Optimizations

Docker Registry at Uber 2015
● 400 services, hundreds of
compute hosts
● Static placement
● One registry host in each zone
● Local filesystem storage
○ No deletion
● Periodically sync across zones

Docker Registry at Uber 2017
● 3000+ services, thousands of
compute hosts, multiple zones
● Static placement → Mesos
● 3-5 registry hosts in each zone
○ Sharded by image names
● Local filesystem storage
○ Customized image gc tool
● Fronted by 3-10 nginx cache hosts
● Async replication with 30s delay

Problems
● Bandwidth and Disk IO limit
○ Image size p50 ~ 1G
○ 10 - 25Gbps NIC limit on registry and cache machines
○ 1000s of concurrent requests for each image
○ Projected to grow >10x per year

Registry
Nginx
100%
100%
● Both registry and cache are at limit
● Worse during outages, cluster maintenance, and base image upgrade
Network Utilization

Problems
● Bandwidth and Disk IO limit
○ Image size p50 ~ 1G
○ 10 - 25Gbps NIC limit on registry and cache machines
○ 1000s of concurrent requests for each image
○ Projected to grow >10x per year
● Replication within and across zones
○ More expensive and complex as Uber add more zones
● Storage management
○ Maintaining cost of in-house image GC solution

Ideas
● Drastically reduce image size
● Deploy one more layer of cache servers
● Explore Docker registry storage driver options
○ Ceph
○ HDFS
○ P2P?
■ Same blobs being downloaded at the same time

Similarities
Docker image / Docker registry
● Immutable blobs
○ Content addressable
● Image manifest
● Tag resolution and manifest
distribution is decoupled with
layer distribution
BitTorrent
● Immutable blobs
○ Identified by infohash (piece hashes)
● Torrent file
● Torrent file lookup and
distribution is decoupled with p2p
protocol

Differences
Docker image / Docker registry
● Need to handle bursty load with
deadline (5 min default timeout)
● Client behaviors are controlled
and reliable
BitTorrent
● Prioritize on preserving complete
copies in the network
● Defend against selfish or
unreliable peers

POC
● Model each layer as a torrent
○ Each layer is divided into 4MB pieces
● Registry agent
○ Use docker registry code, keep all APIs
○ New storage driver with 3rd party P2P library
● Tracker
○ peer store
○ tag→metainfo(s) store

POC
● Generate metainfo
(torrent file) per layer on
docker push
● Announce to tracker

POC
● Docker pull = a series of
requests from local
docker daemon
● Resolve tag to metainfo
of layers first

POC
● Announce to tracker for
each layer, get list of
peers
● Hold connection from
local docker daemon

POC
● Locate each other and
seeder through tracker
● ???
● Download succeeds

Production Considerations
In house library, optimize for data center internal usage
● Peer connection
○ Central decision vs local decisions
○ Topology
■ Tree
● Rack-aware?
■ Graph
● Piece selection
○ Central decision vs local decisions
○ Selection algorithm
○ Piece size

Piece Selection
Central decision
● More likely to be optimal
● High load on tracker, won’t scale
Local decisions
● Limited information

Piece Selection
Random
● Easy to implement
Rarest first
● “Rarest First and Choke Algorithms are Enough” (Legout et al.)

Piece Selection
Smaller piece size
● Faster downloads
Bigger piece size
● Less communication overhead
● Required if piece selection is decided by central component

Peer Connection
Central decision
● Debuggability
● Easier to shutdown to avoid disasters
● Easier to apply optimizations and migrations
Local decisions
● Scalability
● Still need a few well known nodes

Peer Connection
Tree
● Speed limited by number of
branches
● Hard to handle host failures

Peer Connection
Optimal graph
● Regular graph
● <=log(m*n) ramp-up time to place
the initial pieces
● All nodes upload/download at the
max speed, if piece selection is also
optimal
● Need to manage each piece, hard to
scale

Peer Connection
Random k-regular graph
● K-connected
=> K paths to seeders
● Diameter ~log(n)
=> Close to seeders
● Every peer downloads at
> 75% of max speed with
random piece selection
● Hard to keep it k-regular

Decisions
● Peer connection
○ Central decision by tracker (mostly), random selection
■ Tracker return 100 random completed peers, dedicated seeders, and incomplete peers
■ Peer iterate through the 100 until it has 10 connections
● Piece selection
○ Local decision
○ Random selection
■ Evaluate rarest first later
○ 4MB piece size
■ Configurable, evaluate other choices later

Kraken Architecture
Kraken core
● Zone local
● Only dependency is DNS
● Handle any content addressable
blobs

Kraken Architecture
Kraken core (cont’d)
● Agent
○ Implement registry interface
○ On every host
● Origin
○ Dedicated seeders
○ Pluggable storage backend
○ Self-healing hash ring
○ Ephemeral
● Tracker
○ Metainfo and peers
○ Self-healing hash ring (WIP)
○ Ephemeral

Kraken Architecture
Kraken index
● Zone local
● Resolves human readable tags
● Handles async replication to other
clusters
● k copies with staggered delay
● No consistency guarantee =>
No need for consensus protocols

Global Replication
● < 1min
● No data loss

Download 100MB blob onto 100 hosts under 3 seconds
Blue Origin
Grey Peer
Yellow Peer (downloading)
Green Peer (completed)

Performance in Test
Setup
● 3G image with 2 layers
● 2600 hosts (5200 downloads)
● 300Mbps speed limit
Result
● P50 10s (at speed limit)
● P99 20s
● Max 32sec

Performance in Production
Blobs distributed per day in busiest zone:
● 500k 0-100MB blobs
● 400k 100MB-1G blobs
● 3k 1G+ blobs
Peak
● 20k 100MB-1G blobs within 30 sec

Optimizations
● Low connection limit, aggressive disconnect
○ Less overhead
○ Less likely to have complete graphs
● Pipelining
○ Maintain a request queue of size n for each connection
● Endgame mode
○ For the last few pieces, request from all connected neighbors
● TTI/TTL based deletion

Unsuccessful Optimizations
● Prefer peers on the same rack
○ Reduced download speed by half
● Reject incoming request based on number of mutual connections
○ Intended to avoid highly-connected subgraphs, but doesn't work against bipartite graphs
○ Haven’t seen issues caused by graph density problems
● Rarest first piece selection
○ All peers decided to download the same piece at the same time, negatively impacted speed

Takeaways
● Docker images are just tar files
● P2P solutions can work within data centers
● Randomization works
● Get something working first before optimization

Future Plan
● Open source
● Tighter integration with Mesos agent
● Other use cases
● Debuggability

Kraken mesoscon 2018

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Kraken mesoscon 2018

Similaire à Kraken mesoscon 2018 (20)

Dernier

Dernier (20)

Kraken mesoscon 2018