Successful companies, while focusing on their current customers' needs, often fail to embrace disruptive technologies and business models. This phenomenon, known as the "Innovator's Dilemma," eventually leads to many companies' downfall and is especially relevant in the fast-paced world of online services. In order to protect its leading position and grow its share of the highly competitive global digital streaming market, Netflix has to continuously increase the pace of innovation by constantly refining recommendation algorithms and adding new product features, while maintaining a high level of service uptime. The Netflix streaming platform consists of hundreds of microservices that are constantly evolving, and even the smallest production change may cause a cascading failure that can bring the entire service down. We face a new kind of Innovator's Dilemma, where product changes may not only disrupt the business model but also cause production outages that deny customers service access. This talk will describe various architectural, operational and organizational changes adopted by Netflix in order to reconcile rapid innovation with service availability.
2. @coburnw
• Cloud performance and reliability @ Netflix
• Reduce time-to-detect and time-to-resolve
• Optimize usage of AWS cloud
• Steer global user traffic and support failover
• Inject chaos into production environment
• Build innovative performance analysis tooling
• Drive operational best practice adoption
3. • 67M+ subscribers
• > 50 countries
• > 3 billion hours of video streamed monthly
• Massive cloud footprint
• Homegrown CDN
• Strong Originals slate
11. Infrastructure on Demand
• No procurement process
• “all you can eat” **
• Expose IaaS via Spinnaker
• No passwords, no keys
** please don’t eat all of it
13. Decouple Services
• µservice architecture (500+ @Netflix)
• One Auto Scaling group per service
• Independent push schedules (1day 4weeks)
• Communicate via API
• Independent databases (280+ Cassandra clusters)
• Minimize aggregate rate of change
• Update code which needs updating…
14. Minimize Risks to Availability
“If everything seems under control, you're not going fast enough.”
― Mario Andretti
15. Maximize Infrastructure Stability
• Run on AWS
• Purchase 3-year EC2 Reserved instances (for failover as well)
• Distribute Auto Scaling groups across 3 Availability Zones per region
16. Propagate Changes Safely into Production
• Rolling regional “red-black” pushes
• Build pipelines & automated canary analysis
• 30 second time-to-detect on critical metrics
17. • Rigorous quality and performance checks part of code push
• Canary score is the gate for push
Automated Canary Analysis
18. Cross-Service Resiliency
• Isolate misbehaving services
• Open “circuits” and provide fallback experiences
Normal
(personalized)
Degraded
(unpersonalized)
19. Improve Time-To-Detect
• 30 second alerts vs. prior 8 minutes
• Utilize streaming analysis infrastructure at the edge tier
28. ….but what about efficiency?
..That’s a separate talk altogether
29. Wrapping it Up
• “To the cloud” – a journey
• Abstract complexity via platform
• Don’t be afraid to break things
• Break things intentionally and frequently
• Invest in reliability to support increased innovation
• Hire top talent
30. Related Sessions
Talk Speaker When? Where?
Engineering Netflix Global Operations in the Cloud Josh Evans Wed @11am Palazzo N
Efficient Innovation: High-Velocity Cost Management at Netflix Andrew Park Wed @ 2:45pm Palazzo C
Netflix Keystone: How Netflix Handles Data Streams Up to 8 Million
Events Per Second
Peter Bakas Wed @ 2:45pm
San Polo
3501B
A Day in the Life of a Netflix Engineer Using 37% of the Internet Dave Hahn Wed @ 4:15pm Venetian H
Real-Time Analytics In Service of Self-Healing Ecosystems
Roy Rapoport
Chris Sanden
Wed @ 4:15pm Lido 3001B
Running Spark and Presto on the Netflix Big Data Platform Daniel Weeks Thu @ 11am Palazzo F
Splitting the Check on Compliance and Security: Keeping Developers and
Auditors Happy in the Cloud
Jason Chan Thu @ 11am
Marcello
4501B