8. • We hire responsible adults
and keep rules and policies
to a minimum
• Developers can change any
code in production at any
time
• And things don’t break
(usually)
Freedom and
Responsibility
9. Automate all the things!
http://hyperboleandahalf.blogspot.com/2010/06/this-is-why-ill-never-be-adult.html
10. • Application startup
• Configuration
• Code deployment
• System
deployment
Automate all the things!
11. • Standard base image
• Tools to manage all
the systems
• Reduce errors
through
reproducibility
Automation
12. Shared state should
be stored in a
shared service
!
Data on an instance
should be replicated
to other instances
13. “Build for three”
We hold a boot camp for new
engineers to teach them how to
build for a highly distributed
environment.
14. “Build for three”
We hold a boot camp for new
engineers to teach them how to
build for a highly distributed
environment.
15.
16. 12B
outbound
requests
per
day
to
API
dependencies
Movie
Ra)ngs
Personaliza)on
Engine
User
Info
Movie
Metadata
Similar
Movies
Reviews
A/B
Test
Engine
2B
requests
per
day
into
the
NeHlix
API
Discovery
API
Streaming
API
17. Movie
Ra)ngs
Personaliza)on
Engine
User
Info
Movie
Metadata
Similar
Movies
Reviews
A/B
Test
Engine
Discovery
API
Streaming
API
Content
Encoding
CDN
Management
QOS
Logging
DRM
OpenConnect
Edge Locations
Browse
Play
Watch
18. • Services are built by different
teams who work together to
figure out what each service
will provide.
• The service owner publishes
an API that anyone can use.
Highly aligned, loosely
coupled
19. • Easier auto-scaling
• Easier capacity planning
• Identify problematic code-paths
more easily
• Narrow in the effects of a change
• More efficient local caching
Advantages to a Service
Oriented Architecture
20. • Developers deploy when
they want
• They also manage their own
capacity and autoscaling
• And fix anything that breaks
at 4am!
Freedom and
Responsibility
30. • Supports all regions and zones
• Multiple accounts
• Cross region/account replication
• Internationalized, localized and GeoIP routed
• Advanced key management
• Autoscaling with 1000s of instances
• Monitoring and alerting on millions of metrics
Netflix PaaS features
Netflix OSS
38. • Chaos -- Kills random
instances
• Chaos Gorilla -- Kills
zones
• Chaos Kong -- Kills
regions
• Latency -- Degrades
network and injects
faults
• Conformity -- Looks
for outliers
The simian army
• Circus -- Kills and launches
instances to maintain zone
balance
• Doctor -- Fixes unhealthy
resources
• Janitor -- Cleans up unused
resources
• Howler --Yells about bad things
like Amazon limit violations
• Security -- Finds security issues
and expiring certificates
53. Why Bake?
Generic AMI Instance
Traditional:
•launch OS
•install packages
•install app
Netflix:
•launch OS
+app
App AMI Instance
54. Getting Baked
Perforce / Git
libraries
source
Ant targets
Ivy
Groovy all over
app bundles
Jenkins
sync
resolve
buildcompile report
publishtest
Artifactory
snapshot / release
libraries / apps
55. Base
Image
Baking
Yum / Apt
Linux: CentOS, Fedora, Ubuntu
RPMs: Apache, Java...
ec2 slave instances
S3 / EBS
foundation
AMI
base
AMI
Bakery
mount
install
Ready
for
app
bake
snapshot
AWS
57. app
AMI Linux Base AMI (CentOS or Ubuntu)
Java
Tomcat
Optional
Apache
Monitoring
!
Log Rotation
to S3
monitoring
GC and
thread dump
logging
Application war file, base
servlet, platform, interface
jars for dependent
services
Healthcheck, status
servelets, JMX interface,
Servo autoscale
58. Linux Base AMI (CentOS or Ubuntu)
Java
Tomcat
Optional
Apache
Monitoring
!
Log Rotation
to S3
monitoring
GC and
thread dump
logging
Application war file, base
servlet, platform, interface
jars for dependent
services
Healthcheck, status
servelets, JMX interface,
Servo autoscale
app
AMI
Application war file
59. Linux Base AMI (CentOS or Ubuntu)
Java
JBoss
Optional
Apache
Monitoring
!
Log Rotation
to S3
monitoring
GC and
thread dump
logging
Application war file, base
servlet, platform, interface
jars for dependent
services
Healthcheck, status
servelets, JMX interface,
Servo autoscale
app
AMI
60. Linux Base AMI (CentOS or Ubuntu)
Python
Bottle
Optional
Apache
Monitoring
!
Log Rotation
to S3
monitoring
logging
Application file, base
server, platform, interface
libs for dependent services
app
AMI
89. Central
Event
Gateway
• Parse raw alerts, match application to owner
• Add image captures and links to related
graphs for easy mobile use
• Send to the right service based on priority
• Register the event in Chronos, the timeline
application
• Correlate low priority alerts and generate
new high priority alerts
90.
91. Metrics in Production
• 796B Daily metric
points
• Peaks at 1.4B /
min
• 50% daily metric
churn
92. What is a metric?
com.netflix.eds.nccp.successful.requests.uiversion.nccprt-authorization.devtypid-101.clver-PHL_0AB.uiver-UI_169_mid.geo-US
93. How we built it
• Built our own big data
system
• Based on S3 and EMR
• Less copies, lower
resolution, and slower
speed retrieval based on
age of data
94. Self Serve is the Key
• Developers choose
what metrics to
submit
• What graphs they
put on their
dashboards
• What to alert on
101. Breakdown of an outage
Is something wrong? Alerting
Where is the problem? Telemetry and Dashboards
What changed? ???
102. Breakdown of an outage
Is something wrong? Alerting
Where is the problem? Telemetry and Dashboards
What changed? Change control?
103. Change control, the good
• Tells you what changed
• Tells you what’s about to
change
• Great for coordination
when one change gates
another change
104. Change control, the bad
• It’s manual
• It expresses intent, not
reality
• It forces you to
serialize your changes
to an extent
105. Breakdown of an outage
Is something wrong? Alerting
Where is the problem? Telemetry and Dashboards
What changed? Chronos
106.
107. (Some of) Netflix is open source:
https://netflix.github.io
Just a quick reminder...