Facebook e' uno dei piu' grandi siti nel mondo, con datacenter e POP in giro per il mondo, e una grande quantita' di macchine.
In questo talk useremo DHCP come un esempio per discutere perche' e' buono progettare sistemi stateless e discutere la sottile linea di separazione tra utilizzare un prodotto OpenSource o prendere un approccio "Not Invented here".
DevopsItalia2015 - DHCP at Facebook - Evolution of an infrastructure
1.
2. Evolution of the infrastructure
DHCP Infra @ Facebook
Angelo “pallotron” Failla <pallotron@fb.com> - ClusterOps Dublin
Incontro DevOps Italia 2015, Bologna 10/04/2015
3. Who is Angelo?
• First met Internet in 1994
• Linux user since 1999
• Met Freaknet Medialab in 1999
• In Ireland since early 2008
• Joined Facebook Dublin in early 2011
• Started in the S.R.O. team
• Automated itself out of the job (thanks FBAR!)
• Joined ClusterOPS team in 2013
6. Agenda
• Cluster overview
• DCHP: how and why it’s used
• Old architecture and its limits
• How we solved those limits
• Lesson learned and other takeaways
9. DHCP: how and why?
For bare metal provisioning:
•At reboot
•Used to install OS on hosts
•Anaconda based
•iPXE
For Out Of Band management:
•To assign IPs to OOB interfaces
•Leases renewed typically
once a day
10. DHCP client DHCP server
DISCOVER (broadcast)
Anatomy of a DHCP4 handshake
DHCP relayer
OFFER
DISCOVER (unicast)
OFFER
REQUEST (broadcast)
REQUEST (unicast)
ACK
ACK
11. What about DHCPv6 (RFC3315)?
It’s similar but with few differences:
• Different option names and formats
• Doesn’t deliver routing info (done by IPv6 via RA
packets)
• 255.255.255.255 -> ff02::1:2 (special multicast IP) ->
needs Relayer
• DUID (“Dhcp Unique IDentifier”) replaces MAC
• we use DUID-LL[T]
12. CSW CSW CSW CSW
uplinks
Datacenter routers
server
server
TOR
server
…
relayer
server
server
TOR
server
…
relayer
server
server
TOR
DHCP server
…
relayer
server
server
TOR
…
relayer
DHCP server
14. Problem: bootstrapping a cluster
L
B
DHCP server
DHCP server
active
standby
TOR
routable
DHCP server
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
TOR
static
config
static
config
static
config
for all DC
intra datacenter
remote cluster
15. Inventory
System
Periodic job git repo grocery-delivery
Problem: configuration distribution
/etc/dhcpd/…
/etc/init.d/
dhcpd restart
DHCP server
Chef
Infrastructure
16.
17.
18. Problem: lack of instrumentation
• Lack of instrumentation, we were oblivious to things like:
• # RPS
• client latencies
• # of errors/exceptions
• flying blind
Photo by Bill Abbott-
19. Goals for the new system
• Support both DHCPv4 and DHCPv6
• Stateless server
• Get rid of the F5 load balancers
• Must be easy to “containerize”
• Integrated with Facebook infrastructure
Photo by Angelo Failla -
21. Enter ISC KEA
• New DHCP rewrite from ISC (Internet Software Consortium)
• Started in 2009 (BIND10), DHCP component started in 2011
• Why a re-write?
• ISC DHCPD code is ~18 years old
• Monolithic code
• Managed open source model (closed repo, semi-closed bug tracking)
• Lacking performances
• Complex code / not modular / not easy to extend
• Not built using modern software development models
22. libdhcp++
general purpose DHCP library
IPv4/IPv6 packet parsing/assembly
IPv4/IPv6 options parsing/assembly
interface detection (Linux, partial BSD/Mac OSX)
socket management
DHCPv4
Server
DHCPv6
Server
DNS
Updates
perfdhcp
JSON
Configuration
24. inbound
packet
KEA
initial
processing
pkt[46]_receive
FB Infra
(e.g.: logging,
alerting, metrics,
inventory, others)
Cache
CalloutHandle
Context
Object
subnet[46]_select
(skipped)
pkt[46]send
FB Infra
(e.g.: logging,
alerting, metrics,
inventory, others)
outbound
packet
KEA
final
assembly
lease[46]_select
(skipped)
skipped subnet/lease
selection means packet
is empty at this
point and needs to be
filled in
Life of a packet in the FB Hook library
30. No more static configuration
• Configuration for hosts is pulled dynamically from inventory
• DCOPs people are happy (no more problems during swaps)
• Makes deployment easier, only need to generate a small JSON
file
• Integrated with “configerator”: our configeration infrastructure
based on Python DSL.
• Version controlled, canary support, hot reload support, etc.
31. No more hardware load balancing!
• Moved to Anycast + ECMP
• Packets sent to the anycast address are delivered
to the nearest server
• Same fleet-wide “anycast” IP is assigned to all
DHCP servers
• Address assigned to ip6tnl0/tunl0 interfaces
• ExaBGP is used to advertise the anycast IP
• Servers become routers and part of the network
infrastructure
• Packets sent to anycast addr are delivered to the
nearest server
38. First IPv6-only cluster in Luleå, Sweden
• Found bug in BIOS/firmware (*ALL* of the machines in cluster)
• Unable to fetch PXE seed via TFTPv6 when client and server are
on different VLANs
• Vendor was made aware of the problem but fix wasn’t going to
be fast (multiple months)
39. • Realized Cisco N3Ks can run Python scripts
• Wrote quick and dirty Python TFTP relayer
• Deployed it in all TORs in the cluster
• Modify KEA hook lib to override the IP of the
cluster TFTP endpoint with the IP of the
machine in the rack
The workaround
41. • Design your system to be as stateless as possible
• Try not to store state locally
• Let external endpoints / DBs keep state for you
• All of the above will make deployment easier (read containers)…
• …and operationally easy to support
• Stress your application properly, possibly with real production data to find
its breaking point
42. • The NIH syndrome
• It isn’t necessarily a bad thing
• Re-use other’s technology or write your
own?
• Some times you gotta write things from
scratch yourself for various reasons
• The more important thing is that you are
aware of it and take a reasoned decision
Photo by Andrew Tarvin -