Calculating the Savings of Moving Your Drupal Site to the Cloud
Real World Cloud Application Security
1. Real World Cloud
Application Security
Lessons Learned Running Large Scale Systems
in the Public Cloud
Jason Chan - chan@netflix.com
2. Netflix, Inc.
“With more than 27 million
streaming members in the
United States, Canada, Latin
America, the United Kingdom
and Ireland, Netflix, Inc. is the
world's leading internet
subscription service for enjoying
movies and TV programs . . .”
Source: http://ir.netflix.com
3. Me
• Cloud Security Architect @ Netflix
• Responsible for:
• Cloud app, product, infrastructure, ops
security
• Previously:
• Led security team @ VMware
• Earlier, primarily security consulting at
@stake, iSEC Partners
17. On the way to the
cloud . . .
(or NoOps,
depending on definitions)
18. Some As-Is #s
• 27m+ subscribers
• 10,000s of systems
• 100s of engineers, apps
• ~250 test deployments/day *
• ~70 production deployments/day *
* Sample based on this week’s activities
22. A common graph @ Netflix
Lots of watching in prime time Not as much in early morning
23. A common graph @ Netflix
Lots of watching in prime time Not as much in early morning
Old way - pay and provision for peak, 24/7/365
24. A common graph @ Netflix
Lots of watching in prime time Not as much in early morning
Old way - pay and provision for peak, 24/7/365
Multiply this pattern across the dozens of apps
that comprise the Netflix streaming service
28. Autoscaling
• Goals:
• # of systems matches
load requirements
29. Autoscaling
• Goals:
• # of systems matches
load requirements
• Load per server is
constant
30. Autoscaling
• Goals:
• # of systems matches
load requirements
• Load per server is
constant
• Happens without
intervention (the
‘auto’ in autoscaling
31. Autoscaling
• Goals: • Results:
• # of systems matches
load requirements
• Load per server is
constant
• Happens without
intervention (the
‘auto’ in autoscaling
32. Autoscaling
• Goals: • Results:
• # of systems matches • Continuously
load requirements adding/removing
nodes
• Load per server is
constant
• Happens without
intervention (the
‘auto’ in autoscaling
33. Autoscaling
• Goals: • Results:
• # of systems matches • Continuously
load requirements adding/removing
nodes
• Load per server is
constant • New nodes must
mirror existing
• Happens without
intervention (the
‘auto’ in autoscaling
34. Every change requires a new
cluster push
(not an incremental change to
existing systems)
39. Netflix Deployment
Pipeline
RPM file with
app-specific
bits
YUM
Perforce/Git Bakery
Code change Base image +
Config change RPM
40. Netflix Deployment
Pipeline
RPM file with VM template
app-specific ready to launch
bits
YUM AMI
Perforce/Git Bakery
Code change Base image +
Config change RPM
41. Netflix Deployment
Pipeline
RPM file with VM template
app-specific ready to launch
bits
YUM AMI
Perforce/Git Bakery ASG
Code change Base image + Cluster config
Config change RPM Running systems
42. Netflix Deployment
Pipeline
RPM file with VM template
app-specific ready to launch
bits
YUM AMI
Perforce/Git Bakery ASG
Code change Base image + Cluster config
Config change RPM Running systems
43. Operational Impact
• No changes to running systems
• No systems management infrastructure
• Fewer logins to prod
• No snowflakes
• Trivial “rollback”
44. Security Impact
• Need to think differently on:
• Vulnerability management
• Patch management
• User activity monitoring
• File integrity monitoring
• Forensic investigations
48. Base AMI Security
• AMI = Amazon Machine • Average age of running
Image instance: 24 days*
• @ Netflix, all apps are • 60% of instances less
based on “Base AMI”, than 1 week old*
and new pushes pick up
the latest
• Concentrating testing
and improvements here
provides greatest impact
* Based on one-time sampling (yesterday)
49. Base AMI Testing
• The base AMI is managed
like other packages, via
P4, Jenkins, etc.
• We watch the base AMI’s
SCM directory & kick
off testing when it
changes
• Launch an instance of
the AMI, perform vuln
scan and other checks
50. Base AMI Testing
• The base AMI is managed
like other packages, via
P4, Jenkins, etc.
• We watch the base AMI’s
SCM directory & kick
off testing when it
changes
• Launch an instance of
the AMI, perform vuln
scan and other checks
51. Base AMI Testing
• The base AMI is managed
like other packages, via
P4, Jenkins, etc.
• We watch the base AMI’s
SCM directory & kick SCAN COMPLETED ALERT
off testing when it Site name: AMI1
changes Stopped by: N/A
Total Scan Time: 4 minutes 46 seconds
•
Critical Vulnerabilities: 5
Launch an instance of Severe Vulnerabilities: 4
Moderate Vulnerabilities: 4
the AMI, perform vuln
scan and other checks
52. Security Packaging
• All security tools use the same toolchain as
the rest of engineering (P4/Git, Jenkins, etc.)
53. • From the RPM spec file of a webserver:
Requires: ossec cloudpassage nflx-base-harden
hyperguard-enforcer
54. • Pulls in the following RPMs:
• Host hardening package
• WAF agent
• OSSEC (HIDS agent)
• CloudPassage (config assessment,
FW, etc.)
55. Static Analysis
• Available self-service through build
environment (FindBugs, PMD)
• Jenkins (CI) plugin to display graphs and
support drill through to results
59. Central Alerting
Gateway
• A single place to generate alerts
• Python, Java libraries (or json post) to easily
alert on events of interest
• Ties in to PagerDuty notification system
• Allows for stateful alerting and some
response
• A prerequisite that our tools will leverage
61. Chronos
• Timeline system (API and UI) with Java/
Python libraries, or json post
• Track config changes, deployments, etc.
• Security tools also leverage for tracking and
analysis
62. Chronos Security
Examples
• What IP addresses have been blacklisted by
the WAF in the last few weeks?
GET /api/v1/event?
timelines=type:blacklist&start=20121012000000000
• Which security groups have changed today?
GET /api/v1/event?
timelines=type:securitygroup&start=20121024000000000
64. Cryptex
• Many uses of crypto in web/distributed
systems:
• Encrypt/decrypt (cookies, data, etc.)
• Sign/verify (URLs, data, etc.)
• Known as an area where developers should
not DIY
65. • Multi-layer crypto system (HSM basis, scale
out layer)
• Easy for developers to use
• Key management handled transparently
• Access control and auditable operations
ICipherContext cipherContext =
CryptexClientFactory.getCipherContext(KeySet.testkey);
// encryption
String cipherText = cipherContext.encrypt("NETFLIX");
// decryption
String plainText = cipherContext.decrypt(cipherText);
66. Cloud SSO
• Authenticated access to dashboards, admin
apps in the cloud is problematic
• No datacenter access, no LDAP, AD
67. Cloud SSO
• Solution - leverage OneLogin SaaS SSO
option (SAML) used by IT for enterprise
apps
• Built filter that integrates with our platform
web server to make SSO/authentication
trivial
69. Culture of ‘freedom and
responsibility’ precludes
traditional centralized,
command and control approach
70. Security Monkey
• Cloud APIs make • Includes:
verification and analysis
of configuration &
running state simpler
• Cert checking
• Firewall analysis
• Security Monkey created
as the framework for
this analysis
• IAM entity analysis
• Limit warnings
71. Security Monkey
From: Security Monkey
Date: Wed, 24 Oct 2012 17:08:18 +0000
To: Security Alerts
Subject: prod Changes Detected
Table of Contents:
Security Groups
Changed Security Group
<sgname> (eu-west-1 / prod)
<#Security Group/<sgname> (eu-west-1 / prod)>
72. Exploit Monkey
• Autoscaling group is unit of deployment, so
changes signal a good time to rerun
dynamic scans
On 10/23/12 12:35 PM, Exploit Monkey wrote:
I noticed that testapp-live has changed current ASG name from testapp-
live-v001 to testapp-live-v002.
I'm starting a vulnerability scan against test app from these private/
public IPs:
10.29.24.174
73. ELB Checker (gauntlt)
• AWS’ Elastic Load Balancer (ELB) provides cross-
datacenter traffic balancing, but no security
controls (if your cluster is attached to an ELB, it is
available to the Internet)
• Engineers may misunderstand use cases for ELBs,
security features, and/or other measures that can
be used to protect ELB-fronted clusters
74. Solution: gauntlt Testing
1. Launch gauntlt test runner
instance, loaded with “master
list” of ELBs and expected state
2. Determine “target list” of
current ELBs to evaluate
3. Generate per-ELB listener
gauntlt attack files
4. Execute attacks
5. Alert on failures and new ELBs
6. Triage findings and update ELB
master list
76. AWS Security Groups
• Asgard cloud orchestration
tool allows developers to
configure their own firewall
rules
• Limited to same-account
groups, no IP-based rules
• Handles 95% of
requirements, JIRAs for
additional changes, and
Security Monkey to keep an
eye on things
77. Takeaways
• Netflix runs a large, dynamic service in AWS
• Good guidance + specific context can help
jumpstart a pragmatic security program
• Newer concepts like cloud & DevOps need
updated approach to security
• Don’t swim upstream - integrate and
collaborate with your engineering partners