SlideShare une entreprise Scribd logo
1  sur  65
Resilience and Security @ Scale – Lessons Learned
Jason Chan - chan@netflix.com
Netflix, Inc.


 “Netflix is the world’s leading Internet television
  network with more than 33 million members in
   40 countries enjoying more than one billion
   hours of TV shows and movies per month,
            including original series . . .”

Source: http://ir.netflix.com
Me
 Director of Engineering @ Netflix
 Responsible for:
   Cloud app, product, infrastructure, ops security
 Previously:
   Led security team @ VMware
   Earlier, primarily security consulting at @stake, iSEC Partners
Netflix in the Cloud – Why?
Availability and the Move to Streaming
“Undifferentiated Heavy Lifting”
Netflix Culture




“may well be the most important document ever to come out of the Valley.”
                    Sheryl Sandberg, Facebook COO
Scale and Usage Curve
Netflix is now ~99% in the cloud
On the way to the cloud . . . (architecture)
On the way to the cloud . . . (organization)




                              (or NoOps, depending on definitions)
Some As-Is #s
  33m+ subscribers
  10,000s of systems
  100s of engineers, apps
  ~250 test deployments/day **
  ~70 production deployments/day **




    ** Sample based on one week‟s activities
Common Approaches to Reslience
Common Controls to Promote Resilience
 Architectural committees       Designed to standardize on
 Change approval boards          design patterns, vendors, etc.
 Centralized deployments        Problems for Netflix:
                                    Freedom and Responsibility
 Vendor-specific, component-
                                     Culture
  level HA
                                    Highly aligned and loosely
 Standards and checklists           coupled
                                    Innovation cycles
Common Controls to Promote Resilience
 Architectural committees       Designed to control and de-
 Change approval boards          risk change
 Centralized deployments        Focus on artifacts, test and
                                  rollback plans
 Vendor-specific, component-
  level HA                       Problems for Netflix:
                                    Freedom and Responsibility
 Standards and checklists
                                     Culture
                                    Highly aligned and loosely
                                     coupled
                                    Innovation cycles
Common Controls to Promote Resilience
 Architectural committees       Separate Ops team deploys at
 Change approval boards          a pre-ordained time (e.g.
                                  weekly, monthly)
 Centralized deployments
                                 Problems for Netflix:
 Vendor-specific, component-
                                    Freedom and Responsibility
  level HA
                                     Culture
 Standards and checklists          Highly aligned and loosely
                                     coupled
                                    Innovation cycles
Common Controls to Promote Resilience
 Architectural committees       High reliance on vendor
 Change approval boards          solutions to provide HA and
                                  resilience
 Centralized deployments
                                 Problems for Netflix:
 Vendor-specific, component-
                                    Traditional data center oriented
  level HA
                                     systems do not translate well
 Standards and checklists           to the cloud
                                    Heavy use of open source
Common Controls to Promote Resilience
 Architectural committees       Designed for repeatable
 Change approval boards          execution
 Centralized deployments        Problems for Netflix:
                                    Not suitable for load-based
 Vendor-specific, component-
                                     scaling and heavy automation
  level HA
                                    Reliance on humans
 Standards and checklists
Approaches to Resilience @ Netflix
What does the business value?
 Customer experience                  Remember these guys?
 Innovation and agility
 In other words:
    Stability and availability for
     customer experience
    Rapid development and
     change to continually improve
     product and outpace
     competition
 Not that different from anyone
  else
Overall Approach
 Understand and solve for relevant failure modes
 Rely on automation and tools instead of committees for
  evaluating architecture and changes
 Make deployment easy and standardized
Cloud Application Failure Modes and Effects
Failure Mode         Probability    Current Mitigation
App Failure          High           Automated fallback response
AWS Region Failure Low              Wait for recovery
AWS Zone Failure     Medium         Continue running in 2 of 3 zones
Datacenter Failure   Medium         Continue migrating to cloud
Data Store Failure   Low            Restore from S3
S3 Failure           Low            Restore from remote archive


   Risk-based approach given likely failures
   Tackle high-probability events first
Simian Army
Goals of Simian Army




“Each system has to be able to succeed, no matter what, even all on its own.
We‟re designing each distributed system to expect and tolerate failure from
other systems on which it depends.”

http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html
Chaos Monkey
 “By frequently causing failures, we force our services to
  be built in a way that is more resilient.”
 Terminates cluster nodes during business hours
 Rejects “If it ain‟t broke, don‟t fix it”
 Goals:
    Simulate random hardware failures, human error at small scale
    Identify weaknesses
    No service impact
Chaos Gorilla
 Chaos Monkey‟s bigger brother
 Standard deployment pattern is to distribute
  load/systems/data across three data centers (AZs)
 What happens if one is lost?
 Goals:
   Simulate data center loss, hardware/service failures at larger
    scale
   Identify weaknesses, dependencies, etc.
   Minimal service impact
Latency Monkey
 Distributed systems have many upstream/downstream
  connections
 How fault-tolerant are systems to dependency
  failure/slowdown?
 Goals:
   Simulate latencies and error codes, see how a service responds
   Survivable services regardless of dependencies
Conformity Monkey
 Without architecture review, how do you ensure designs
  leverage known successful patterns?
 Conformity Monkey provides automated analysis for
  pattern adherence
 Goals:
   Evaluate deployment modes (data center distribution)
   Evaluate health checks, discoverability, versions of key libraries
   Help ensure service has best chance of successful operation
Non-Simian Approaches
 Org model
   Engineers write, deploy, support code
 Culture
   De-centralized with as few processes and rules as possible
   Lots of local autonomy
   “If you‟re not failing, you‟re not trying hard enough”
   Peer pressure
 Productive and transparent incident reviews
AppSec Challenges
Lots of Good Advice
  BSIMM
  Microsoft SDL
  SAFECode
But, what works?




  Forrester Consulting, 12/10
Especially, given phenomena such as DevOps,
cloud, agile, and the unique characteristics of an
                   organization?
Deploying Code at Netflix
A common graph @ Netflix
                               Weekend afternoon ramp-up
 Lots of watching in prime time                          Not as much in early morning




             Old way - pay and provision for peak, 24/7/365

   Multiply this pattern across the dozens of apps that comprise the
                        Netflix streaming service
Solution: Load-Based Autoscaling
Autoscaling
 Goals:
   # of systems matches load requirements
   Load per server is constant
   Happens without intervention (the „auto‟ in autoscaling)
 Results:
   Clusters continuously add & remove nodes
   New nodes must mirror existing
Every change requires a new cluster push
(not an incremental change to existing systems)
Deploying code must be easy
           (it is)
Netflix Deployment Pipeline


                 RPM with
                app-specific                   VM template
                    bits                      ready to launch


                   YUM                             AMI




Perforce/Git                      Bakery                            ASG
Code change                    Base image +                      Cluster config
Config change                     RPM                           Running systems
Operational Impact
 No changes to running systems
 No systems mgmt infrastructure (Puppet, Chef, etc.)
 Fewer logins to prod
 No snowflakes
 Trivial “rollback”
Security Impact
 Need to think differently on:
    Vulnerability management
    Patch management
    User activity monitoring
    File integrity monitoring
    Forensic investigations
Architecture, organization, deployment
            are all different.
         What about security?
We‟ve adapted too.
Some principles we‟ve found useful.
Cloud Application Security: What We Emphasize
Points of Emphasis
 Integrate                  Two contexts:
                               1. Integration with your
 Make the right way easy         engineering ecosystem
 Self-service, with           2. Integration of your security
  exceptions                      controls
                             Organization
 Trust, but verify
                             SCM, build and release
                             Monitoring and alerting




                                                                 47
Integration: Base AMI Testing
 Base AMI – VM/instance template used for all cloud systems
      Average instance age = ~24 days (one-time sample)

 The base AMI is managed like other packages, via P4, Jenkins, etc.
 We watch the SCM directory & kick off testing when it changes
 Launch an instance of the AMI, perform vuln scan and other checks

                                                    SCAN COMPLETED ALERT

                                                    Site name: AMI1

                                                    Stopped by: N/A

                                                    Total Scan Time: 4 minutes 46 seconds

                                                    Critical Vulnerabilities: 5
                                                    Severe Vulnerabilities:   4
                                                    Moderate Vulnerabilities: 4
Integration: Control Packaging and Installation

  From the RPM spec file of a webserver:
 Requires:   ossec cloudpassage nflx-base-harden hyperguard-enforcer



 Pulls in the following RPMs:
    HIDS agent
    Config assessment/firewall agent
    Host hardening package
    WAF
Integration: Timeline (Chronos)
 What IP addresses have been blacklisted by the WAF in
  the last few weeks?
 GET /api/v1/event?timelines=type:blacklist&start=20130125000000000

 Which security groups have changed today?
 GET /api/v1/event?timelines=type:securitygroup&start=20130206000000000
Points of Emphasis
 Integrate                  Developers are lazy

 Make the right way easy
 Self-service, with
  exceptions
 Trust, but verify
Making it Easy: Cryptex
 Crypto: DDIY (“Don‟t Do It Yourself”)
 Many uses of crypto in web/distributed systems:
   Encrypt/decrypt (cookies, data, etc.)
   Sign/verify (URLs, data, etc.)
 Netflix also uses heavily for device activation, DRM
  playback, etc.
Making it Easy: Cryptex
 Multi-layer crypto system (HSM basis, scale out layer)
   Easy to use
   Key management handled transparently
   Access control and auditable operations
Making it Easy: Cloud-Based SSO
 In the AWS cloud, access to data center services is
  problematic
   Examples: AD, LDAP, DNS
 But, many cloud-based systems require authN, authZ
   Examples: Dashboards, admin UIs
 Asking developers to securely handle/accept credentials
  is also problematic
Making it Easy: Cloud-Based SSO
 Solution: Leverage OneLogin SaaS SSO (SAML) used
  by IT for enterprise apps (e.g. Workday, Google Apps)
 Uses Active Directory credentials
 Provides a single & centralized login page
    Developers don‟t accept username & password directly
 Built filter for our base server to make SSO/authN trivial
Points of Emphasis
 Integrate                  Self-service is perhaps the
                              most transformative cloud
 Make the right way easy     characteristic
 Self-service, with         Failing to adopt this for security
  exceptions                  controls will lead to friction
 Trust, but verify
Self-Service: Security Groups
 Asgard cloud orchestration tool allows developers to
  configure their own firewall rules
 Limited to same AWS account, no IP-based rules
Points of Emphasis
 Integrate                  Culture precludes traditional
                              “command and control”
 Make the right way easy     approach
 Self-service, with         Organizational desire for agile,
  exceptions                  DevOps, CI/CD blur traditional
                              security engagement
 Trust, but verify           touchpoints
Trust but Verify: Security Monkey
 Cloud APIs make verification       Includes:
  and analysis of configuration         Certificate checking
  and running state simpler             Firewall analysis
 Security Monkey created as            IAM entity analysis
  the framework for this analysis       Limit warnings
                                        Resource policy analysis
Trust but Verify: Security Monkey




                   From: Security Monkey
                   Date: Wed, 24 Oct 2012 17:08:18 +0000
                   To: Security Alerts
                   Subject: prod Changes Detected


                          Table of Contents:
                              Security Groups

                                      Changed Security Group


                                          <sgname> (eu-west-1 / prod)
                                           <#Security Group/<sgname> (eu-west-1 / prod)>
Trust but Verify: Exploit Monkey
  AWS Autoscaling group is unit of deployment, so
   changes signal a good time to rerun dynamic scans

 On 10/23/12 12:35 PM, Exploit Monkey wrote:

 I noticed that testapp-live has changed current ASG name from testapp-
 live-v001 to testapp-live-v002.

 I'm starting a vulnerability scan against test app from these
 private/public IPs:
 10.29.24.174
Takeaways
  Netflix runs a large, dynamic service in AWS

  Newer concepts like cloud & DevOps need an
   updated approach to resilience and security

  Specific context can help jumpstart a pragmatic
   and effective security program
Netflix References
 http://netflix.github.com
 http://techblog.netflix.com
 http://slideshare.net/netflix
Other References
 http://www.webpronews.com/netflix-outage-angers-customers-2008-
  08
 http://www.pcmag.com/article2/0,2817,2395372,00.asp
 http://www.readwriteweb.com/archives/etech_amazon_cto_aws.php
 http://bsimm.com/online/
 http://www.microsoft.com/en-
  us/download/confirmation.aspx?id=29884
 http://www.slideshare.net/reed2001/culture-1798664
 http://techcrunch.com/2013/01/31/read-what-facebooks-sandberg-
  calls-maybe-the-most-important-document-ever-to-come-out-of-the-
  valley/
 http://www.gauntlt.org
Questions?




             chan@netflix.com

Contenu connexe

Tendances

Successfully Implementing DEV-SEC-OPS in the Cloud
Successfully Implementing DEV-SEC-OPS in the CloudSuccessfully Implementing DEV-SEC-OPS in the Cloud
Successfully Implementing DEV-SEC-OPS in the Cloud
Amazon Web Services
 
OWASP AppSec EU - SecDevOps, a view from the trenches - Abhay Bhargav
OWASP AppSec EU - SecDevOps, a view from the trenches - Abhay BhargavOWASP AppSec EU - SecDevOps, a view from the trenches - Abhay Bhargav
OWASP AppSec EU - SecDevOps, a view from the trenches - Abhay Bhargav
Abhay Bhargav
 

Tendances (20)

Security at the Speed of Software Development
Security at the Speed of Software DevelopmentSecurity at the Speed of Software Development
Security at the Speed of Software Development
 
Overcoming Security Challenges in DevOps
Overcoming Security Challenges in DevOpsOvercoming Security Challenges in DevOps
Overcoming Security Challenges in DevOps
 
Microservice Monitoring and Quality Management for Modern Apps and Infrastruc...
Microservice Monitoring and Quality Management for Modern Apps and Infrastruc...Microservice Monitoring and Quality Management for Modern Apps and Infrastruc...
Microservice Monitoring and Quality Management for Modern Apps and Infrastruc...
 
Securing Systems at Cloud Scale with DevSecOps
Securing Systems at Cloud Scale with DevSecOpsSecuring Systems at Cloud Scale with DevSecOps
Securing Systems at Cloud Scale with DevSecOps
 
Successfully Implementing DEV-SEC-OPS in the Cloud
Successfully Implementing DEV-SEC-OPS in the CloudSuccessfully Implementing DEV-SEC-OPS in the Cloud
Successfully Implementing DEV-SEC-OPS in the Cloud
 
Maturing your organization from DevOps to DevSecOps
Maturing your organization from DevOps to DevSecOpsMaturing your organization from DevOps to DevSecOps
Maturing your organization from DevOps to DevSecOps
 
Managing Quality of Service for Containerized Microservice Applications
Managing Quality of Service for Containerized Microservice ApplicationsManaging Quality of Service for Containerized Microservice Applications
Managing Quality of Service for Containerized Microservice Applications
 
DevSecCon KeyNote London 2015
DevSecCon KeyNote London 2015DevSecCon KeyNote London 2015
DevSecCon KeyNote London 2015
 
SecDevOps 2.0 - Managing Your Robot Army
SecDevOps 2.0 - Managing Your Robot ArmySecDevOps 2.0 - Managing Your Robot Army
SecDevOps 2.0 - Managing Your Robot Army
 
SecDevOps: The New Black of IT
SecDevOps: The New Black of ITSecDevOps: The New Black of IT
SecDevOps: The New Black of IT
 
Implementing DevSecOps
Implementing DevSecOpsImplementing DevSecOps
Implementing DevSecOps
 
OWASP AppSec EU - SecDevOps, a view from the trenches - Abhay Bhargav
OWASP AppSec EU - SecDevOps, a view from the trenches - Abhay BhargavOWASP AppSec EU - SecDevOps, a view from the trenches - Abhay Bhargav
OWASP AppSec EU - SecDevOps, a view from the trenches - Abhay Bhargav
 
DevOps In Azure: Deliver Value With Automation
DevOps In Azure: Deliver Value With AutomationDevOps In Azure: Deliver Value With Automation
DevOps In Azure: Deliver Value With Automation
 
ISACA Ireland Keynote 2015
ISACA Ireland Keynote 2015ISACA Ireland Keynote 2015
ISACA Ireland Keynote 2015
 
Integrating DevOps and Security
Integrating DevOps and SecurityIntegrating DevOps and Security
Integrating DevOps and Security
 
A Throwaway Deck for Cloud Security Essentials 2.0 delivered at RSA 2016
A Throwaway Deck for Cloud Security Essentials 2.0 delivered at RSA 2016A Throwaway Deck for Cloud Security Essentials 2.0 delivered at RSA 2016
A Throwaway Deck for Cloud Security Essentials 2.0 delivered at RSA 2016
 
Cloud Security Essentials 2.0 at RSA
Cloud Security Essentials 2.0 at RSACloud Security Essentials 2.0 at RSA
Cloud Security Essentials 2.0 at RSA
 
DevOpsSec: Appling DevOps Principles to Security, DevOpsDays Austin 2012
DevOpsSec: Appling DevOps Principles to Security, DevOpsDays Austin 2012DevOpsSec: Appling DevOps Principles to Security, DevOpsDays Austin 2012
DevOpsSec: Appling DevOps Principles to Security, DevOpsDays Austin 2012
 
DevSecOps OWASP
DevSecOps OWASPDevSecOps OWASP
DevSecOps OWASP
 
DevSecOps: Minimizing Risk, Improving Security
DevSecOps: Minimizing Risk, Improving SecurityDevSecOps: Minimizing Risk, Improving Security
DevSecOps: Minimizing Risk, Improving Security
 

En vedette

2013 michael coates-javaone
2013 michael coates-javaone2013 michael coates-javaone
2013 michael coates-javaone
Michael Coates
 
Practical Cloud Security
Practical Cloud SecurityPractical Cloud Security
Practical Cloud Security
Jason Chan
 

En vedette (15)

Cloud risk and business continuity v21
Cloud risk and business continuity v21Cloud risk and business continuity v21
Cloud risk and business continuity v21
 
2013 michael coates-javaone
2013 michael coates-javaone2013 michael coates-javaone
2013 michael coates-javaone
 
Security at Scale - Lessons from Six Months at Yahoo
Security at Scale - Lessons from Six Months at YahooSecurity at Scale - Lessons from Six Months at Yahoo
Security at Scale - Lessons from Six Months at Yahoo
 
Practical Cloud Security
Practical Cloud SecurityPractical Cloud Security
Practical Cloud Security
 
Virtualization: Security and IT Audit Perspectives
Virtualization: Security and IT Audit PerspectivesVirtualization: Security and IT Audit Perspectives
Virtualization: Security and IT Audit Perspectives
 
Cloud Application Security: Lessons Learned
Cloud Application Security: Lessons LearnedCloud Application Security: Lessons Learned
Cloud Application Security: Lessons Learned
 
Practical Security Automation
Practical Security AutomationPractical Security Automation
Practical Security Automation
 
Amazon Web Services Security
Amazon Web Services SecurityAmazon Web Services Security
Amazon Web Services Security
 
AWS Security: A Practitioner's Perspective
AWS Security: A Practitioner's PerspectiveAWS Security: A Practitioner's Perspective
AWS Security: A Practitioner's Perspective
 
The Psychology of Security Automation
The Psychology of Security AutomationThe Psychology of Security Automation
The Psychology of Security Automation
 
Defending Netflix from Abuse
Defending Netflix from AbuseDefending Netflix from Abuse
Defending Netflix from Abuse
 
AWS re:Invent 2016: Microservices, Macro Security Needs: How Nike Uses a Mult...
AWS re:Invent 2016: Microservices, Macro Security Needs: How Nike Uses a Mult...AWS re:Invent 2016: Microservices, Macro Security Needs: How Nike Uses a Mult...
AWS re:Invent 2016: Microservices, Macro Security Needs: How Nike Uses a Mult...
 
AWS re:Invent 2016: Security Automation: Spend Less Time Securing Your Applic...
AWS re:Invent 2016: Security Automation: Spend Less Time Securing Your Applic...AWS re:Invent 2016: Security Automation: Spend Less Time Securing Your Applic...
AWS re:Invent 2016: Security Automation: Spend Less Time Securing Your Applic...
 
Netflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at GlueconNetflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at Gluecon
 
Netflix Global Cloud Architecture
Netflix Global Cloud ArchitectureNetflix Global Cloud Architecture
Netflix Global Cloud Architecture
 

Similaire à Resilience and Security @ Scale: Lessons Learned

Continuous Delivery series: How to automate your infrastructure toolchain
Continuous Delivery series: How to automate your infrastructure toolchainContinuous Delivery series: How to automate your infrastructure toolchain
Continuous Delivery series: How to automate your infrastructure toolchain
Serena Software
 
Dev ops for mainframe innovate session 2402
Dev ops for mainframe innovate session 2402Dev ops for mainframe innovate session 2402
Dev ops for mainframe innovate session 2402
Rosalind Radcliffe
 

Similaire à Resilience and Security @ Scale: Lessons Learned (20)

Fast, Secure Deployments with Docker on AWS
Fast, Secure Deployments with Docker on AWSFast, Secure Deployments with Docker on AWS
Fast, Secure Deployments with Docker on AWS
 
Enabling multicloud in the enterprise with DevSecOps
Enabling multicloud in the enterprise with DevSecOpsEnabling multicloud in the enterprise with DevSecOps
Enabling multicloud in the enterprise with DevSecOps
 
Devtest Orchestration for SDN & NFV
Devtest Orchestration for SDN & NFVDevtest Orchestration for SDN & NFV
Devtest Orchestration for SDN & NFV
 
Continuous Delivery series: How to automate your infrastructure toolchain
Continuous Delivery series: How to automate your infrastructure toolchainContinuous Delivery series: How to automate your infrastructure toolchain
Continuous Delivery series: How to automate your infrastructure toolchain
 
Implementing dev ops to face a two speed it architecture
Implementing dev ops to face a two speed it architectureImplementing dev ops to face a two speed it architecture
Implementing dev ops to face a two speed it architecture
 
Deploying more technology to shift from agility to anti-fragility
Deploying more technology to shift from agility to anti-fragilityDeploying more technology to shift from agility to anti-fragility
Deploying more technology to shift from agility to anti-fragility
 
Dev ops for mainframe innovate session 2402
Dev ops for mainframe innovate session 2402Dev ops for mainframe innovate session 2402
Dev ops for mainframe innovate session 2402
 
Agile and continuous delivery – How IBM Watson Workspace is built
Agile and continuous delivery – How IBM Watson Workspace is builtAgile and continuous delivery – How IBM Watson Workspace is built
Agile and continuous delivery – How IBM Watson Workspace is built
 
The Carrier DevOps Trend (Presented to Okinawa Open Days Conference)
The Carrier DevOps Trend (Presented to Okinawa Open Days Conference)The Carrier DevOps Trend (Presented to Okinawa Open Days Conference)
The Carrier DevOps Trend (Presented to Okinawa Open Days Conference)
 
XebiaLabs, CloudBees, Puppet Labs Webinar Slides - IT Automation for the Mode...
XebiaLabs, CloudBees, Puppet Labs Webinar Slides - IT Automation for the Mode...XebiaLabs, CloudBees, Puppet Labs Webinar Slides - IT Automation for the Mode...
XebiaLabs, CloudBees, Puppet Labs Webinar Slides - IT Automation for the Mode...
 
Enterprise DevOps: Scaling Build, Deploy, Test, Release
Enterprise DevOps: Scaling Build, Deploy, Test, ReleaseEnterprise DevOps: Scaling Build, Deploy, Test, Release
Enterprise DevOps: Scaling Build, Deploy, Test, Release
 
ClearScale: Continuous Automation with Docker on AWS
ClearScale: Continuous Automation with Docker on AWSClearScale: Continuous Automation with Docker on AWS
ClearScale: Continuous Automation with Docker on AWS
 
Dev ops developer (session 3)
Dev ops developer (session 3)Dev ops developer (session 3)
Dev ops developer (session 3)
 
CSC AWS re:Invent Enterprise DevOps session
CSC AWS re:Invent Enterprise DevOps sessionCSC AWS re:Invent Enterprise DevOps session
CSC AWS re:Invent Enterprise DevOps session
 
What is DevOps?
What is DevOps?What is DevOps?
What is DevOps?
 
(ENT210) Accelerating Business Innovation with DevOps on AWS | AWS re:Invent ...
(ENT210) Accelerating Business Innovation with DevOps on AWS | AWS re:Invent ...(ENT210) Accelerating Business Innovation with DevOps on AWS | AWS re:Invent ...
(ENT210) Accelerating Business Innovation with DevOps on AWS | AWS re:Invent ...
 
Scrum Portugal Meeting 1 Lisbon - ALM
Scrum Portugal Meeting 1 Lisbon - ALMScrum Portugal Meeting 1 Lisbon - ALM
Scrum Portugal Meeting 1 Lisbon - ALM
 
Application Lifecycle Management (ALM), by Marco Silva
Application Lifecycle Management (ALM), by Marco SilvaApplication Lifecycle Management (ALM), by Marco Silva
Application Lifecycle Management (ALM), by Marco Silva
 
Infrastructure as Code with Chef
Infrastructure as Code with ChefInfrastructure as Code with Chef
Infrastructure as Code with Chef
 
DevOps and Build Automation
DevOps and Build AutomationDevOps and Build Automation
DevOps and Build Automation
 

Dernier

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Dernier (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Resilience and Security @ Scale: Lessons Learned

  • 1. Resilience and Security @ Scale – Lessons Learned Jason Chan - chan@netflix.com
  • 2. Netflix, Inc. “Netflix is the world’s leading Internet television network with more than 33 million members in 40 countries enjoying more than one billion hours of TV shows and movies per month, including original series . . .” Source: http://ir.netflix.com
  • 3. Me  Director of Engineering @ Netflix  Responsible for:  Cloud app, product, infrastructure, ops security  Previously:  Led security team @ VMware  Earlier, primarily security consulting at @stake, iSEC Partners
  • 4. Netflix in the Cloud – Why?
  • 5. Availability and the Move to Streaming
  • 7. Netflix Culture “may well be the most important document ever to come out of the Valley.” Sheryl Sandberg, Facebook COO
  • 9. Netflix is now ~99% in the cloud
  • 10. On the way to the cloud . . . (architecture)
  • 11. On the way to the cloud . . . (organization) (or NoOps, depending on definitions)
  • 12. Some As-Is #s  33m+ subscribers  10,000s of systems  100s of engineers, apps  ~250 test deployments/day **  ~70 production deployments/day ** ** Sample based on one week‟s activities
  • 13. Common Approaches to Reslience
  • 14. Common Controls to Promote Resilience  Architectural committees  Designed to standardize on  Change approval boards design patterns, vendors, etc.  Centralized deployments  Problems for Netflix:  Freedom and Responsibility  Vendor-specific, component- Culture level HA  Highly aligned and loosely  Standards and checklists coupled  Innovation cycles
  • 15. Common Controls to Promote Resilience  Architectural committees  Designed to control and de-  Change approval boards risk change  Centralized deployments  Focus on artifacts, test and rollback plans  Vendor-specific, component- level HA  Problems for Netflix:  Freedom and Responsibility  Standards and checklists Culture  Highly aligned and loosely coupled  Innovation cycles
  • 16. Common Controls to Promote Resilience  Architectural committees  Separate Ops team deploys at  Change approval boards a pre-ordained time (e.g. weekly, monthly)  Centralized deployments  Problems for Netflix:  Vendor-specific, component-  Freedom and Responsibility level HA Culture  Standards and checklists  Highly aligned and loosely coupled  Innovation cycles
  • 17. Common Controls to Promote Resilience  Architectural committees  High reliance on vendor  Change approval boards solutions to provide HA and resilience  Centralized deployments  Problems for Netflix:  Vendor-specific, component-  Traditional data center oriented level HA systems do not translate well  Standards and checklists to the cloud  Heavy use of open source
  • 18. Common Controls to Promote Resilience  Architectural committees  Designed for repeatable  Change approval boards execution  Centralized deployments  Problems for Netflix:  Not suitable for load-based  Vendor-specific, component- scaling and heavy automation level HA  Reliance on humans  Standards and checklists
  • 20. What does the business value?  Customer experience  Remember these guys?  Innovation and agility  In other words:  Stability and availability for customer experience  Rapid development and change to continually improve product and outpace competition  Not that different from anyone else
  • 21. Overall Approach  Understand and solve for relevant failure modes  Rely on automation and tools instead of committees for evaluating architecture and changes  Make deployment easy and standardized
  • 22. Cloud Application Failure Modes and Effects Failure Mode Probability Current Mitigation App Failure High Automated fallback response AWS Region Failure Low Wait for recovery AWS Zone Failure Medium Continue running in 2 of 3 zones Datacenter Failure Medium Continue migrating to cloud Data Store Failure Low Restore from S3 S3 Failure Low Restore from remote archive  Risk-based approach given likely failures  Tackle high-probability events first
  • 24. Goals of Simian Army “Each system has to be able to succeed, no matter what, even all on its own. We‟re designing each distributed system to expect and tolerate failure from other systems on which it depends.” http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html
  • 25.
  • 26. Chaos Monkey  “By frequently causing failures, we force our services to be built in a way that is more resilient.”  Terminates cluster nodes during business hours  Rejects “If it ain‟t broke, don‟t fix it”  Goals:  Simulate random hardware failures, human error at small scale  Identify weaknesses  No service impact
  • 27. Chaos Gorilla  Chaos Monkey‟s bigger brother  Standard deployment pattern is to distribute load/systems/data across three data centers (AZs)  What happens if one is lost?  Goals:  Simulate data center loss, hardware/service failures at larger scale  Identify weaknesses, dependencies, etc.  Minimal service impact
  • 28. Latency Monkey  Distributed systems have many upstream/downstream connections  How fault-tolerant are systems to dependency failure/slowdown?  Goals:  Simulate latencies and error codes, see how a service responds  Survivable services regardless of dependencies
  • 29. Conformity Monkey  Without architecture review, how do you ensure designs leverage known successful patterns?  Conformity Monkey provides automated analysis for pattern adherence  Goals:  Evaluate deployment modes (data center distribution)  Evaluate health checks, discoverability, versions of key libraries  Help ensure service has best chance of successful operation
  • 30. Non-Simian Approaches  Org model  Engineers write, deploy, support code  Culture  De-centralized with as few processes and rules as possible  Lots of local autonomy  “If you‟re not failing, you‟re not trying hard enough”  Peer pressure  Productive and transparent incident reviews
  • 32. Lots of Good Advice  BSIMM  Microsoft SDL  SAFECode
  • 33. But, what works? Forrester Consulting, 12/10
  • 34. Especially, given phenomena such as DevOps, cloud, agile, and the unique characteristics of an organization?
  • 35. Deploying Code at Netflix
  • 36. A common graph @ Netflix Weekend afternoon ramp-up Lots of watching in prime time Not as much in early morning Old way - pay and provision for peak, 24/7/365 Multiply this pattern across the dozens of apps that comprise the Netflix streaming service
  • 38. Autoscaling  Goals:  # of systems matches load requirements  Load per server is constant  Happens without intervention (the „auto‟ in autoscaling)  Results:  Clusters continuously add & remove nodes  New nodes must mirror existing
  • 39. Every change requires a new cluster push (not an incremental change to existing systems)
  • 40. Deploying code must be easy (it is)
  • 41. Netflix Deployment Pipeline RPM with app-specific VM template bits ready to launch YUM AMI Perforce/Git Bakery ASG Code change Base image + Cluster config Config change RPM Running systems
  • 42. Operational Impact  No changes to running systems  No systems mgmt infrastructure (Puppet, Chef, etc.)  Fewer logins to prod  No snowflakes  Trivial “rollback”
  • 43. Security Impact  Need to think differently on:  Vulnerability management  Patch management  User activity monitoring  File integrity monitoring  Forensic investigations
  • 44. Architecture, organization, deployment are all different. What about security?
  • 45. We‟ve adapted too. Some principles we‟ve found useful.
  • 46. Cloud Application Security: What We Emphasize
  • 47. Points of Emphasis  Integrate  Two contexts: 1. Integration with your  Make the right way easy engineering ecosystem  Self-service, with 2. Integration of your security exceptions controls  Organization  Trust, but verify  SCM, build and release  Monitoring and alerting 47
  • 48. Integration: Base AMI Testing  Base AMI – VM/instance template used for all cloud systems  Average instance age = ~24 days (one-time sample)  The base AMI is managed like other packages, via P4, Jenkins, etc.  We watch the SCM directory & kick off testing when it changes  Launch an instance of the AMI, perform vuln scan and other checks SCAN COMPLETED ALERT Site name: AMI1 Stopped by: N/A Total Scan Time: 4 minutes 46 seconds Critical Vulnerabilities: 5 Severe Vulnerabilities: 4 Moderate Vulnerabilities: 4
  • 49. Integration: Control Packaging and Installation  From the RPM spec file of a webserver: Requires: ossec cloudpassage nflx-base-harden hyperguard-enforcer  Pulls in the following RPMs:  HIDS agent  Config assessment/firewall agent  Host hardening package  WAF
  • 50. Integration: Timeline (Chronos)  What IP addresses have been blacklisted by the WAF in the last few weeks?  GET /api/v1/event?timelines=type:blacklist&start=20130125000000000  Which security groups have changed today?  GET /api/v1/event?timelines=type:securitygroup&start=20130206000000000
  • 51. Points of Emphasis  Integrate  Developers are lazy  Make the right way easy  Self-service, with exceptions  Trust, but verify
  • 52. Making it Easy: Cryptex  Crypto: DDIY (“Don‟t Do It Yourself”)  Many uses of crypto in web/distributed systems:  Encrypt/decrypt (cookies, data, etc.)  Sign/verify (URLs, data, etc.)  Netflix also uses heavily for device activation, DRM playback, etc.
  • 53. Making it Easy: Cryptex  Multi-layer crypto system (HSM basis, scale out layer)  Easy to use  Key management handled transparently  Access control and auditable operations
  • 54. Making it Easy: Cloud-Based SSO  In the AWS cloud, access to data center services is problematic  Examples: AD, LDAP, DNS  But, many cloud-based systems require authN, authZ  Examples: Dashboards, admin UIs  Asking developers to securely handle/accept credentials is also problematic
  • 55. Making it Easy: Cloud-Based SSO  Solution: Leverage OneLogin SaaS SSO (SAML) used by IT for enterprise apps (e.g. Workday, Google Apps)  Uses Active Directory credentials  Provides a single & centralized login page  Developers don‟t accept username & password directly  Built filter for our base server to make SSO/authN trivial
  • 56. Points of Emphasis  Integrate  Self-service is perhaps the most transformative cloud  Make the right way easy characteristic  Self-service, with  Failing to adopt this for security exceptions controls will lead to friction  Trust, but verify
  • 57. Self-Service: Security Groups  Asgard cloud orchestration tool allows developers to configure their own firewall rules  Limited to same AWS account, no IP-based rules
  • 58. Points of Emphasis  Integrate  Culture precludes traditional “command and control”  Make the right way easy approach  Self-service, with  Organizational desire for agile, exceptions DevOps, CI/CD blur traditional security engagement  Trust, but verify touchpoints
  • 59. Trust but Verify: Security Monkey  Cloud APIs make verification  Includes: and analysis of configuration  Certificate checking and running state simpler  Firewall analysis  Security Monkey created as  IAM entity analysis the framework for this analysis  Limit warnings  Resource policy analysis
  • 60. Trust but Verify: Security Monkey From: Security Monkey Date: Wed, 24 Oct 2012 17:08:18 +0000 To: Security Alerts Subject: prod Changes Detected Table of Contents: Security Groups Changed Security Group <sgname> (eu-west-1 / prod) <#Security Group/<sgname> (eu-west-1 / prod)>
  • 61. Trust but Verify: Exploit Monkey  AWS Autoscaling group is unit of deployment, so changes signal a good time to rerun dynamic scans On 10/23/12 12:35 PM, Exploit Monkey wrote: I noticed that testapp-live has changed current ASG name from testapp- live-v001 to testapp-live-v002. I'm starting a vulnerability scan against test app from these private/public IPs: 10.29.24.174
  • 62. Takeaways  Netflix runs a large, dynamic service in AWS  Newer concepts like cloud & DevOps need an updated approach to resilience and security  Specific context can help jumpstart a pragmatic and effective security program
  • 63. Netflix References  http://netflix.github.com  http://techblog.netflix.com  http://slideshare.net/netflix
  • 64. Other References  http://www.webpronews.com/netflix-outage-angers-customers-2008- 08  http://www.pcmag.com/article2/0,2817,2395372,00.asp  http://www.readwriteweb.com/archives/etech_amazon_cto_aws.php  http://bsimm.com/online/  http://www.microsoft.com/en- us/download/confirmation.aspx?id=29884  http://www.slideshare.net/reed2001/culture-1798664  http://techcrunch.com/2013/01/31/read-what-facebooks-sandberg- calls-maybe-the-most-important-document-ever-to-come-out-of-the- valley/  http://www.gauntlt.org
  • 65. Questions? chan@netflix.com