SlideShare une entreprise Scribd logo
1  sur  35
Télécharger pour lire hors ligne
Kraken
P2P Docker Registry
Cody Gibb <codyg@uber.com>, Evelyn Liu <evelynl@uber.com>, Yiran Wang <yiran@uber.com>
Agenda
● History of docker registry at Uber
● Evolution of a P2P solution
● Kraken architecture
● Performance
● Optimizations
Docker Registry at Uber 2015
● 400 services, hundreds of
compute hosts
● Static placement
● One registry host in each zone
● Local filesystem storage
○ No deletion
● Periodically sync across zones
Docker Registry at Uber 2017
● 3000+ services, thousands of
compute hosts, multiple zones
● Static placement → Mesos
● 3-5 registry hosts in each zone
○ Sharded by image names
● Local filesystem storage
○ Customized image gc tool
● Fronted by 3-10 nginx cache hosts
● Async replication with 30s delay
Problems
● Bandwidth and Disk IO limit
○ Image size p50 ~ 1G
○ 10 - 25Gbps NIC limit on registry and cache machines
○ 1000s of concurrent requests for each image
○ Projected to grow >10x per year
Registry
Nginx
100%
100%
● Both registry and cache are at limit
● Worse during outages, cluster maintenance, and base image upgrade
Network Utilization
Problems
● Bandwidth and Disk IO limit
○ Image size p50 ~ 1G
○ 10 - 25Gbps NIC limit on registry and cache machines
○ 1000s of concurrent requests for each image
○ Projected to grow >10x per year
● Replication within and across zones
○ More expensive and complex as Uber add more zones
● Storage management
○ Maintaining cost of in-house image GC solution
Ideas
● Drastically reduce image size
● Deploy one more layer of cache servers
● Explore Docker registry storage driver options
○ Ceph
○ HDFS
○ P2P?
■ Same blobs being downloaded at the same time
Similarities
Docker image / Docker registry
● Immutable blobs
○ Content addressable
● Image manifest
● Tag resolution and manifest
distribution is decoupled with
layer distribution
BitTorrent
● Immutable blobs
○ Identified by infohash (piece hashes)
● Torrent file
● Torrent file lookup and
distribution is decoupled with p2p
protocol
Differences
Docker image / Docker registry
● Need to handle bursty load with
deadline (5 min default timeout)
● Client behaviors are controlled
and reliable
BitTorrent
● Prioritize on preserving complete
copies in the network
● Defend against selfish or
unreliable peers
POC
● Model each layer as a torrent
○ Each layer is divided into 4MB pieces
● Registry agent
○ Use docker registry code, keep all APIs
○ New storage driver with 3rd party P2P library
● Tracker
○ peer store
○ tag→metainfo(s) store
POC
● Generate metainfo
(torrent file) per layer on
docker push
● Announce to tracker
POC
● Docker pull = a series of
requests from local
docker daemon
● Resolve tag to metainfo
of layers first
POC
● Announce to tracker for
each layer, get list of
peers
● Hold connection from
local docker daemon
POC
● Locate each other and
seeder through tracker
● ???
● Download succeeds
Production Considerations
In house library, optimize for data center internal usage
● Peer connection
○ Central decision vs local decisions
○ Topology
■ Tree
● Rack-aware?
■ Graph
● Piece selection
○ Central decision vs local decisions
○ Selection algorithm
○ Piece size
Piece Selection
Central decision
● More likely to be optimal
● High load on tracker, won’t scale
Local decisions
● Limited information
Piece Selection
Random
● Easy to implement
Rarest first
● “Rarest First and Choke Algorithms are Enough” (Legout et al.)
Piece Selection
Smaller piece size
● Faster downloads
Bigger piece size
● Less communication overhead
● Required if piece selection is decided by central component
Peer Connection
Central decision
● Debuggability
● Easier to shutdown to avoid disasters
● Easier to apply optimizations and migrations
Local decisions
● Scalability
● Still need a few well known nodes
Peer Connection
Tree
● Speed limited by number of
branches
● Hard to handle host failures
Peer Connection
Optimal graph
● Regular graph
● <=log(m*n) ramp-up time to place
the initial pieces
● All nodes upload/download at the
max speed, if piece selection is also
optimal
● Need to manage each piece, hard to
scale
Peer Connection
Random k-regular graph
● K-connected
=> K paths to seeders
● Diameter ~log(n)
=> Close to seeders
● Every peer downloads at
> 75% of max speed with
random piece selection
● Hard to keep it k-regular
Decisions
● Peer connection
○ Central decision by tracker (mostly), random selection
■ Tracker return 100 random completed peers, dedicated seeders, and incomplete peers
■ Peer iterate through the 100 until it has 10 connections
● Piece selection
○ Local decision
○ Random selection
■ Evaluate rarest first later
○ 4MB piece size
■ Configurable, evaluate other choices later
Kraken Architecture
Kraken core
● Zone local
● Only dependency is DNS
● Handle any content addressable
blobs
Kraken Architecture
Kraken core (cont’d)
● Agent
○ Implement registry interface
○ On every host
● Origin
○ Dedicated seeders
○ Pluggable storage backend
○ Self-healing hash ring
○ Ephemeral
● Tracker
○ Metainfo and peers
○ Self-healing hash ring (WIP)
○ Ephemeral
Kraken Architecture
Kraken index
● Zone local
● Resolves human readable tags
● Handles async replication to other
clusters
● k copies with staggered delay
● No consistency guarantee =>
No need for consensus protocols
Global Replication
● < 1min
● No data loss
Download 100MB blob onto 100 hosts under 3 seconds
Blue Origin
Grey Peer
Yellow Peer (downloading)
Green Peer (completed)
Performance in Test
Setup
● 3G image with 2 layers
● 2600 hosts (5200 downloads)
● 300Mbps speed limit
Result
● P50 10s (at speed limit)
● P99 20s
● Max 32sec
Performance in Production
Blobs distributed per day in busiest zone:
● 500k 0-100MB blobs
● 400k 100MB-1G blobs
● 3k 1G+ blobs
Peak
● 20k 100MB-1G blobs within 30 sec
Optimizations
● Low connection limit, aggressive disconnect
○ Less overhead
○ Less likely to have complete graphs
● Pipelining
○ Maintain a request queue of size n for each connection
● Endgame mode
○ For the last few pieces, request from all connected neighbors
● TTI/TTL based deletion
Unsuccessful Optimizations
● Prefer peers on the same rack
○ Reduced download speed by half
● Reject incoming request based on number of mutual connections
○ Intended to avoid highly-connected subgraphs, but doesn't work against bipartite graphs
○ Haven’t seen issues caused by graph density problems
● Rarest first piece selection
○ All peers decided to download the same piece at the same time, negatively impacted speed
Takeaways
● Docker images are just tar files
● P2P solutions can work within data centers
● Randomization works
● Get something working first before optimization
Future Plan
● Open source
● Tighter integration with Mesos agent
● Other use cases
● Debuggability

Contenu connexe

Tendances

Couchbase live 2016
Couchbase live 2016Couchbase live 2016
Couchbase live 2016Pierre Mavro
 
Practical SystemTAP basics: Perl memory profiling
Practical SystemTAP basics: Perl memory profilingPractical SystemTAP basics: Perl memory profiling
Practical SystemTAP basics: Perl memory profilingLubomir Rintel
 
Scaling Up Logging and Metrics
Scaling Up Logging and MetricsScaling Up Logging and Metrics
Scaling Up Logging and MetricsRicardo Lourenço
 
Object Compaction in Cloud for High Yield
Object Compaction in Cloud for High YieldObject Compaction in Cloud for High Yield
Object Compaction in Cloud for High YieldScyllaDB
 
Resource element lte explanations!
Resource element lte explanations!Resource element lte explanations!
Resource element lte explanations!Bobir Shomaksudov
 
Fluentd Project Intro at Kubecon 2019 EU
Fluentd Project Intro at Kubecon 2019 EUFluentd Project Intro at Kubecon 2019 EU
Fluentd Project Intro at Kubecon 2019 EUN Masahiro
 
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...Flink Forward
 
Neutron Network Namespaces and IPtables--A Technical Deep Dive
Neutron Network Namespaces and IPtables--A Technical Deep DiveNeutron Network Namespaces and IPtables--A Technical Deep Dive
Neutron Network Namespaces and IPtables--A Technical Deep DiveMirantis
 
Gluster wireshark niels_de_vos
Gluster wireshark niels_de_vosGluster wireshark niels_de_vos
Gluster wireshark niels_de_vosGluster.org
 
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDBEVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDBScott Mansfield
 
OpenZFS Channel programs
OpenZFS Channel programsOpenZFS Channel programs
OpenZFS Channel programsMatthew Ahrens
 
gRPC Design and Implementation
gRPC Design and ImplementationgRPC Design and Implementation
gRPC Design and ImplementationVarun Talwar
 
.NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov).NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov)ITCamp
 
Go and Uber’s time series database m3
Go and Uber’s time series database m3Go and Uber’s time series database m3
Go and Uber’s time series database m3Rob Skillington
 
M|18 Architectural Overview: MariaDB MaxScale
M|18 Architectural Overview: MariaDB MaxScaleM|18 Architectural Overview: MariaDB MaxScale
M|18 Architectural Overview: MariaDB MaxScaleMariaDB plc
 
Marriage with docker
Marriage with dockerMarriage with docker
Marriage with dockerDušan Katona
 

Tendances (20)

Couchbase live 2016
Couchbase live 2016Couchbase live 2016
Couchbase live 2016
 
Practical SystemTAP basics: Perl memory profiling
Practical SystemTAP basics: Perl memory profilingPractical SystemTAP basics: Perl memory profiling
Practical SystemTAP basics: Perl memory profiling
 
Containers and Logging
Containers and LoggingContainers and Logging
Containers and Logging
 
Scaling Up Logging and Metrics
Scaling Up Logging and MetricsScaling Up Logging and Metrics
Scaling Up Logging and Metrics
 
Fluentd vs. Logstash for OpenStack Log Management
Fluentd vs. Logstash for OpenStack Log ManagementFluentd vs. Logstash for OpenStack Log Management
Fluentd vs. Logstash for OpenStack Log Management
 
Object Compaction in Cloud for High Yield
Object Compaction in Cloud for High YieldObject Compaction in Cloud for High Yield
Object Compaction in Cloud for High Yield
 
Resource element lte explanations!
Resource element lte explanations!Resource element lte explanations!
Resource element lte explanations!
 
Fluentd Project Intro at Kubecon 2019 EU
Fluentd Project Intro at Kubecon 2019 EUFluentd Project Intro at Kubecon 2019 EU
Fluentd Project Intro at Kubecon 2019 EU
 
Fluent Bit: Log Forwarding at Scale
Fluent Bit: Log Forwarding at ScaleFluent Bit: Log Forwarding at Scale
Fluent Bit: Log Forwarding at Scale
 
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
 
Neutron Network Namespaces and IPtables--A Technical Deep Dive
Neutron Network Namespaces and IPtables--A Technical Deep DiveNeutron Network Namespaces and IPtables--A Technical Deep Dive
Neutron Network Namespaces and IPtables--A Technical Deep Dive
 
Gluster wireshark niels_de_vos
Gluster wireshark niels_de_vosGluster wireshark niels_de_vos
Gluster wireshark niels_de_vos
 
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDBEVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
 
OpenZFS Channel programs
OpenZFS Channel programsOpenZFS Channel programs
OpenZFS Channel programs
 
gRPC Design and Implementation
gRPC Design and ImplementationgRPC Design and Implementation
gRPC Design and Implementation
 
.NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov).NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov)
 
Go and Uber’s time series database m3
Go and Uber’s time series database m3Go and Uber’s time series database m3
Go and Uber’s time series database m3
 
M|18 Architectural Overview: MariaDB MaxScale
M|18 Architectural Overview: MariaDB MaxScaleM|18 Architectural Overview: MariaDB MaxScale
M|18 Architectural Overview: MariaDB MaxScale
 
Introduction to gRPC
Introduction to gRPCIntroduction to gRPC
Introduction to gRPC
 
Marriage with docker
Marriage with dockerMarriage with docker
Marriage with docker
 

Similaire à Kraken mesoscon 2018

Mirko Damiani - An Embedded soft real time distributed system in Go
Mirko Damiani - An Embedded soft real time distributed system in GoMirko Damiani - An Embedded soft real time distributed system in Go
Mirko Damiani - An Embedded soft real time distributed system in Golinuxlab_conf
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodbPGConf APAC
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2aspyker
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecPeter Bakas
 
From Three Nines to Five Nines - A Kafka Journey
From Three Nines to Five Nines - A Kafka JourneyFrom Three Nines to Five Nines - A Kafka Journey
From Three Nines to Five Nines - A Kafka JourneyAllen (Xiaozhong) Wang
 
Improving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at UberImproving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at UberYing Zheng
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Monal Daxini
 
Building zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafkaBuilding zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafkaAvinash Ramineni
 
Logs @ OVHcloud
Logs @ OVHcloudLogs @ OVHcloud
Logs @ OVHcloudOVHcloud
 
NetflixOSS meetup lightning talks and roadmap
NetflixOSS meetup lightning talks and roadmapNetflixOSS meetup lightning talks and roadmap
NetflixOSS meetup lightning talks and roadmapRuslan Meshenberg
 
Performance challenges in software networking
Performance challenges in software networkingPerformance challenges in software networking
Performance challenges in software networkingStephen Hemminger
 
haproxy_Load_Balancer.pptx
haproxy_Load_Balancer.pptxhaproxy_Load_Balancer.pptx
haproxy_Load_Balancer.pptxcrezzcrezz
 
haproxy_Load_Balancer.pdf
haproxy_Load_Balancer.pdfhaproxy_Load_Balancer.pdf
haproxy_Load_Balancer.pdfcrezzcrezz
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in productionPingCAP
 
Build real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache KafkaBuild real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache KafkaHotstar
 
Gluster dev session #6 understanding gluster's network communication layer
Gluster dev session #6  understanding gluster's network   communication layerGluster dev session #6  understanding gluster's network   communication layer
Gluster dev session #6 understanding gluster's network communication layerPranith Karampuri
 
Mux loves Clickhouse. By Adam Brown, Mux founder
Mux loves Clickhouse. By Adam Brown, Mux founderMux loves Clickhouse. By Adam Brown, Mux founder
Mux loves Clickhouse. By Adam Brown, Mux founderAltinity Ltd
 
Task migration using CRIU
Task migration using CRIUTask migration using CRIU
Task migration using CRIURohit Jnagal
 
Initial presentation of swift (for montreal user group)
Initial presentation of swift (for montreal user group)Initial presentation of swift (for montreal user group)
Initial presentation of swift (for montreal user group)Marcos García
 
Challenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan LambrightChallenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan LambrightGluster.org
 

Similaire à Kraken mesoscon 2018 (20)

Mirko Damiani - An Embedded soft real time distributed system in Go
Mirko Damiani - An Embedded soft real time distributed system in GoMirko Damiani - An Embedded soft real time distributed system in Go
Mirko Damiani - An Embedded soft real time distributed system in Go
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
 
From Three Nines to Five Nines - A Kafka Journey
From Three Nines to Five Nines - A Kafka JourneyFrom Three Nines to Five Nines - A Kafka Journey
From Three Nines to Five Nines - A Kafka Journey
 
Improving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at UberImproving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at Uber
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
 
Building zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafkaBuilding zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafka
 
Logs @ OVHcloud
Logs @ OVHcloudLogs @ OVHcloud
Logs @ OVHcloud
 
NetflixOSS meetup lightning talks and roadmap
NetflixOSS meetup lightning talks and roadmapNetflixOSS meetup lightning talks and roadmap
NetflixOSS meetup lightning talks and roadmap
 
Performance challenges in software networking
Performance challenges in software networkingPerformance challenges in software networking
Performance challenges in software networking
 
haproxy_Load_Balancer.pptx
haproxy_Load_Balancer.pptxhaproxy_Load_Balancer.pptx
haproxy_Load_Balancer.pptx
 
haproxy_Load_Balancer.pdf
haproxy_Load_Balancer.pdfhaproxy_Load_Balancer.pdf
haproxy_Load_Balancer.pdf
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in production
 
Build real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache KafkaBuild real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache Kafka
 
Gluster dev session #6 understanding gluster's network communication layer
Gluster dev session #6  understanding gluster's network   communication layerGluster dev session #6  understanding gluster's network   communication layer
Gluster dev session #6 understanding gluster's network communication layer
 
Mux loves Clickhouse. By Adam Brown, Mux founder
Mux loves Clickhouse. By Adam Brown, Mux founderMux loves Clickhouse. By Adam Brown, Mux founder
Mux loves Clickhouse. By Adam Brown, Mux founder
 
Task migration using CRIU
Task migration using CRIUTask migration using CRIU
Task migration using CRIU
 
Initial presentation of swift (for montreal user group)
Initial presentation of swift (for montreal user group)Initial presentation of swift (for montreal user group)
Initial presentation of swift (for montreal user group)
 
Challenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan LambrightChallenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan Lambright
 

Dernier

Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersMairaAshraf6
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...drmkjayanthikannan
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdfAldoGarca30
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...Amil baba
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Servicemeghakumariji156
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARKOUSTAV SARKAR
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxmaisarahman1
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxMuhammadAsimMuhammad6
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadhamedmustafa094
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityMorshed Ahmed Rahath
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdfKamal Acharya
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxSCMS School of Architecture
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdfKamal Acharya
 

Dernier (20)

Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
 

Kraken mesoscon 2018

  • 1. Kraken P2P Docker Registry Cody Gibb <codyg@uber.com>, Evelyn Liu <evelynl@uber.com>, Yiran Wang <yiran@uber.com>
  • 2. Agenda ● History of docker registry at Uber ● Evolution of a P2P solution ● Kraken architecture ● Performance ● Optimizations
  • 3. Docker Registry at Uber 2015 ● 400 services, hundreds of compute hosts ● Static placement ● One registry host in each zone ● Local filesystem storage ○ No deletion ● Periodically sync across zones
  • 4. Docker Registry at Uber 2017 ● 3000+ services, thousands of compute hosts, multiple zones ● Static placement → Mesos ● 3-5 registry hosts in each zone ○ Sharded by image names ● Local filesystem storage ○ Customized image gc tool ● Fronted by 3-10 nginx cache hosts ● Async replication with 30s delay
  • 5. Problems ● Bandwidth and Disk IO limit ○ Image size p50 ~ 1G ○ 10 - 25Gbps NIC limit on registry and cache machines ○ 1000s of concurrent requests for each image ○ Projected to grow >10x per year
  • 6. Registry Nginx 100% 100% ● Both registry and cache are at limit ● Worse during outages, cluster maintenance, and base image upgrade Network Utilization
  • 7. Problems ● Bandwidth and Disk IO limit ○ Image size p50 ~ 1G ○ 10 - 25Gbps NIC limit on registry and cache machines ○ 1000s of concurrent requests for each image ○ Projected to grow >10x per year ● Replication within and across zones ○ More expensive and complex as Uber add more zones ● Storage management ○ Maintaining cost of in-house image GC solution
  • 8. Ideas ● Drastically reduce image size ● Deploy one more layer of cache servers ● Explore Docker registry storage driver options ○ Ceph ○ HDFS ○ P2P? ■ Same blobs being downloaded at the same time
  • 9. Similarities Docker image / Docker registry ● Immutable blobs ○ Content addressable ● Image manifest ● Tag resolution and manifest distribution is decoupled with layer distribution BitTorrent ● Immutable blobs ○ Identified by infohash (piece hashes) ● Torrent file ● Torrent file lookup and distribution is decoupled with p2p protocol
  • 10. Differences Docker image / Docker registry ● Need to handle bursty load with deadline (5 min default timeout) ● Client behaviors are controlled and reliable BitTorrent ● Prioritize on preserving complete copies in the network ● Defend against selfish or unreliable peers
  • 11. POC ● Model each layer as a torrent ○ Each layer is divided into 4MB pieces ● Registry agent ○ Use docker registry code, keep all APIs ○ New storage driver with 3rd party P2P library ● Tracker ○ peer store ○ tag→metainfo(s) store
  • 12. POC ● Generate metainfo (torrent file) per layer on docker push ● Announce to tracker
  • 13. POC ● Docker pull = a series of requests from local docker daemon ● Resolve tag to metainfo of layers first
  • 14. POC ● Announce to tracker for each layer, get list of peers ● Hold connection from local docker daemon
  • 15. POC ● Locate each other and seeder through tracker ● ??? ● Download succeeds
  • 16. Production Considerations In house library, optimize for data center internal usage ● Peer connection ○ Central decision vs local decisions ○ Topology ■ Tree ● Rack-aware? ■ Graph ● Piece selection ○ Central decision vs local decisions ○ Selection algorithm ○ Piece size
  • 17. Piece Selection Central decision ● More likely to be optimal ● High load on tracker, won’t scale Local decisions ● Limited information
  • 18. Piece Selection Random ● Easy to implement Rarest first ● “Rarest First and Choke Algorithms are Enough” (Legout et al.)
  • 19. Piece Selection Smaller piece size ● Faster downloads Bigger piece size ● Less communication overhead ● Required if piece selection is decided by central component
  • 20. Peer Connection Central decision ● Debuggability ● Easier to shutdown to avoid disasters ● Easier to apply optimizations and migrations Local decisions ● Scalability ● Still need a few well known nodes
  • 21. Peer Connection Tree ● Speed limited by number of branches ● Hard to handle host failures
  • 22. Peer Connection Optimal graph ● Regular graph ● <=log(m*n) ramp-up time to place the initial pieces ● All nodes upload/download at the max speed, if piece selection is also optimal ● Need to manage each piece, hard to scale
  • 23. Peer Connection Random k-regular graph ● K-connected => K paths to seeders ● Diameter ~log(n) => Close to seeders ● Every peer downloads at > 75% of max speed with random piece selection ● Hard to keep it k-regular
  • 24. Decisions ● Peer connection ○ Central decision by tracker (mostly), random selection ■ Tracker return 100 random completed peers, dedicated seeders, and incomplete peers ■ Peer iterate through the 100 until it has 10 connections ● Piece selection ○ Local decision ○ Random selection ■ Evaluate rarest first later ○ 4MB piece size ■ Configurable, evaluate other choices later
  • 25. Kraken Architecture Kraken core ● Zone local ● Only dependency is DNS ● Handle any content addressable blobs
  • 26. Kraken Architecture Kraken core (cont’d) ● Agent ○ Implement registry interface ○ On every host ● Origin ○ Dedicated seeders ○ Pluggable storage backend ○ Self-healing hash ring ○ Ephemeral ● Tracker ○ Metainfo and peers ○ Self-healing hash ring (WIP) ○ Ephemeral
  • 27. Kraken Architecture Kraken index ● Zone local ● Resolves human readable tags ● Handles async replication to other clusters ● k copies with staggered delay ● No consistency guarantee => No need for consensus protocols
  • 28. Global Replication ● < 1min ● No data loss
  • 29. Download 100MB blob onto 100 hosts under 3 seconds Blue Origin Grey Peer Yellow Peer (downloading) Green Peer (completed)
  • 30. Performance in Test Setup ● 3G image with 2 layers ● 2600 hosts (5200 downloads) ● 300Mbps speed limit Result ● P50 10s (at speed limit) ● P99 20s ● Max 32sec
  • 31. Performance in Production Blobs distributed per day in busiest zone: ● 500k 0-100MB blobs ● 400k 100MB-1G blobs ● 3k 1G+ blobs Peak ● 20k 100MB-1G blobs within 30 sec
  • 32. Optimizations ● Low connection limit, aggressive disconnect ○ Less overhead ○ Less likely to have complete graphs ● Pipelining ○ Maintain a request queue of size n for each connection ● Endgame mode ○ For the last few pieces, request from all connected neighbors ● TTI/TTL based deletion
  • 33. Unsuccessful Optimizations ● Prefer peers on the same rack ○ Reduced download speed by half ● Reject incoming request based on number of mutual connections ○ Intended to avoid highly-connected subgraphs, but doesn't work against bipartite graphs ○ Haven’t seen issues caused by graph density problems ● Rarest first piece selection ○ All peers decided to download the same piece at the same time, negatively impacted speed
  • 34. Takeaways ● Docker images are just tar files ● P2P solutions can work within data centers ● Randomization works ● Get something working first before optimization
  • 35. Future Plan ● Open source ● Tighter integration with Mesos agent ● Other use cases ● Debuggability