- Netflix faced two operational challenges of accelerating innovation while sustaining quality at growing scale and complexity.
- Netflix adopted an approach of operational excellence through continuous improvement of operations management, design, and function to achieve greater quality and velocity.
- Netflix practices operations engineering by applying software engineering practices to operations to achieve operational excellence through automation, modular components, tools, and services.
22. Availability vs. Rate of Change
Rate of Change
Availability(nines)
6
5
4
3
2
1
0
1 10 100 1000
99.9999%
99.999%
99.99%
99.9%
99%
90%
31.5 seconds
5.26 minutes
52.56 minutes
8.76 hours
3.26 days
36.5 days
Quality vs. Velocity
23. Availability vs. Rate of Change
Rate of Change
Availability(nines)
6
5
4
3
2
1
0
1 10 100 1000
99.9999%
99.999%
99.99%
99.9%
99%
90%
31.5 seconds
5.26 minutes
52.56 minutes
8.76 hours
3.26 days
36.5 days
The Zero Sum Game
24. Availability vs. Rate of Change
Rate of Change
Availability(nines)
6
5
4
3
2
1
0
1 10 100 1000
99.9999%
99.999%
99.99%
99.9%
99%
90%
31.5 seconds
5.26 minutes
52.56 minutes
8.76 hours
3.26 days
36.5 days
The Zero Sum Game
25. Availability vs. Rate of Change
Rate of Change
Availability(nines)
6
5
4
3
2
1
0
1 10 100 1000
99.9999%
99.999%
99.99%
99.9%
99%
90%
Shifting the Curve
26. Operational Excellence is the continuous improvement
of the management, design, and function of operational
environments to achieve greater quality, velocity, and
competitive advantage.
30. Operations Engineering is the application of software
engineering practices and principles to achieve and sustain
operational excellence.
• automation
• modular components
• tools & services
• best practices
33. Data Center
● Delayed provisioning
● Hand-crafted servers
● Variations and complexity
Our Artisanal Past
Delivery
● Late night, manual deployments
● Repeated mistakes
● Painful delays to production fixes
44. • DES on time series
data
• Predict the future
based on history
• Favor recent history
• Threshold-based alerts
• 6-8 minute delay
Anomaly Detection
Alert!
62. Chaos Engineering is the discipline of experimenting on
a distributed system in order to build confidence in the
systems capability to withstand turbulent conditions in
production.
63. Cluster A Cluster D
Edge Cluster
Cluster B
Cluster C
Imagine a monkey loose in your data center…
64. Xen Hypervisor vulnerability – 9/25/14
218 out of 2700+ Cassandra nodes rebooted
22 did not reboot successfully
Automation handled the rest
A State of Xen – Chaos Monkey & Cassandra
65. Device Service B
Service C
Internet EdgeZuul
Service A
ELB
FIT
Fault-Injection Testing (FIT)
• Simulate service failures
• Override by device or account
• % of member traffic
66. Device Service B
Service C
Internet EdgeZuul
Service A
ELB
FIT
Fault-Injection Testing (FIT)
• Simulate service failures
• Override by device or account
• % of member traffic
82. Speaker When? Where?
Engineering Netflix Global Operations in the Cloud Josh Evans Wed @11am Palazzo N
Efficient Innovation: High-Velocity Cost Management at Netflix Andrew Park
Wed @
2:45pm
Palazzo C
Netflix Keystone: How Netflix Handles Data Streams Up to 8
Million Events Per Second
Peter Bakas
Wed @
2:45pm
San Polo
3501B
A Day in the Life of a Netflix Engineer Using 37% of the Internet Dave Hahn
Wed @
4:15pm
Venetian H
Availability: The New Kind of Innovator’s Dilemma Coburn Watson
Wed @
4:15pm
Marcello
4501B
Real-Time Analytics In Service of Self-Healing Ecosystems
Roy Rapoport
Chris Sanden
Wed @
4:15pm
Lido 3001B
Running Spark and Presto on the Netflix Big Data Platform Daniel Weeks Thu @ 11am Palazzo F
Splitting the Check on Compliance and Security: Keeping
Developers and Auditors Happy in the Cloud
Jason Chan Thu @ 11am
Marcello
4501B
@