5. Background and
Disclaimers
• No cloud definitions, but . . .
Tuesday, October 11, 2011
6. Background and
Disclaimers
• No cloud definitions, but . . .
• Focus on IaaS
Tuesday, October 11, 2011
7. Background and
Disclaimers
• No cloud definitions, but . . .
• Focus on IaaS
• Netflix uses Amazon Web Services
Tuesday, October 11, 2011
8. Background and
Disclaimers
• No cloud definitions, but . . .
• Focus on IaaS
• Netflix uses Amazon Web Services
• Guidance should be generally applicable
Tuesday, October 11, 2011
9. Background and
Disclaimers
• No cloud definitions, but . . .
• Focus on IaaS
• Netflix uses Amazon Web Services
• Guidance should be generally applicable
• Works in progress, still many problems to
solve . . .
Tuesday, October 11, 2011
13. !"#"$%&'#&($
Netflix could not build data centers fast enough
Tuesday, October 11, 2011
14. !"#"$%&'#&($
Netflix could not build data centers fast enough
Capacity requirements accelerating, unpredictable
Tuesday, October 11, 2011
15. !"#"$%&'#&($
Netflix could not build data centers fast enough
Capacity requirements accelerating, unpredictable
Product launch spikes - iPhone, Wii, PS2, XBox
Tuesday, October 11, 2011
16. Outgrowing Data Center
http://techblog.netflix.com/2011/02/redesigning-netflix-api.html
Netflix API: Growth in Requests
Tuesday, October 11, 2011
17. Outgrowing Data Center
http://techblog.netflix.com/2011/02/redesigning-netflix-api.html
Netflix API: Growth in Requests
37x Growth 1/10 - 1/11
Tuesday, October 11, 2011
18. Outgrowing Data Center
http://techblog.netflix.com/2011/02/redesigning-netflix-api.html
Netflix API: Growth in Requests
37x Growth 1/10 - 1/11
Tuesday, October 11, 2011
19. Outgrowing Data Center
http://techblog.netflix.com/2011/02/redesigning-netflix-api.html
Netflix API: Growth in Requests
37x Growth 1/10 - 1/11
!"#"$%&#%'(
)"*"$+#,(
Tuesday, October 11, 2011
24. Data Center Patterns
• Long-lived, non-elastic systems
Tuesday, October 11, 2011
25. Data Center Patterns
• Long-lived, non-elastic systems
• Push code and config to running systems
Tuesday, October 11, 2011
26. Data Center Patterns
• Long-lived, non-elastic systems
• Push code and config to running systems
• Difficult to enforce deployment patterns
Tuesday, October 11, 2011
27. Data Center Patterns
• Long-lived, non-elastic systems
• Push code and config to running systems
• Difficult to enforce deployment patterns
• ‘Snowflake phenomenon’
Tuesday, October 11, 2011
28. Data Center Patterns
• Long-lived, non-elastic systems
• Push code and config to running systems
• Difficult to enforce deployment patterns
• ‘Snowflake phenomenon’
• Difficult to sync or reproduce
environments (e.g. test and prod)
Tuesday, October 11, 2011
32. Cloud Patterns
• Ephemeral nodes
• Dynamic scaling
• Hardware is abstracted
Tuesday, October 11, 2011
33. Cloud Patterns
• Ephemeral nodes
• Dynamic scaling
• Hardware is abstracted
• Orchestration vs. manual steps
Tuesday, October 11, 2011
34. Cloud Patterns
• Ephemeral nodes
• Dynamic scaling
• Hardware is abstracted
• Orchestration vs. manual steps
• Trivial to clone environments
Tuesday, October 11, 2011
35. When Moving to the Cloud,
Leave Old Ways Behind . . .
Tuesday, October 11, 2011
36. When Moving to the Cloud,
Leave Old Ways Behind . . .
Generic forklift is generally a mistake
Tuesday, October 11, 2011
37. When Moving to the Cloud,
Leave Old Ways Behind . . .
Generic forklift is generally a mistake
Adapt development, deployment, and management
models appropriately
Tuesday, October 11, 2011
38. When Moving to the Cloud,
Leave Old Ways Behind . . .
Generic forklift is generally a mistake
Adapt development, deployment, and management
models appropriately
Tuesday, October 11, 2011
39. Netflix Build and Deploy
http://techblog.netflix.com/2011/08/building-with-legos.html
Tuesday, October 11, 2011
40. Netflix Build and Deploy
http://techblog.netflix.com/2011/08/building-with-legos.html
Perforce
SCM
Tuesday, October 11, 2011
41. Netflix Build and Deploy
http://techblog.netflix.com/2011/08/building-with-legos.html
Continuous
Integration
Jenkins
Perforce
SCM
Tuesday, October 11, 2011
42. Netflix Build and Deploy
http://techblog.netflix.com/2011/08/building-with-legos.html
Continuous
Integration
Jenkins
Perforce Artifactory
SCM Binary
Repository
Tuesday, October 11, 2011
43. Netflix Build and Deploy
http://techblog.netflix.com/2011/08/building-with-legos.html
App-Specific
Continuous Packages and
Integration Configuration
Jenkins Yum
Perforce Artifactory
SCM Binary
Repository
Tuesday, October 11, 2011
44. Netflix Build and Deploy
http://techblog.netflix.com/2011/08/building-with-legos.html
App-Specific
Continuous Packages and
Integration Configuration
Jenkins Yum
Perforce Artifactory Bakery
SCM Binary Combine Base and
Repository App-Specific
Configuration
Tuesday, October 11, 2011
45. Netflix Build and Deploy
http://techblog.netflix.com/2011/08/building-with-legos.html
App-Specific Customized,
Continuous Packages and Cloud-Ready
Integration Configuration Image
Jenkins Yum AMI
Perforce Artifactory Bakery
SCM Binary Combine Base and
Repository App-Specific
Configuration
Tuesday, October 11, 2011
46. Netflix Build and Deploy
http://techblog.netflix.com/2011/08/building-with-legos.html
App-Specific Customized,
Continuous Packages and Cloud-Ready
Integration Configuration Image
Jenkins Yum AMI
Perforce Artifactory Bakery ASG
SCM Binary Combine Base and Dynamic
Repository App-Specific Scaling
Configuration
Tuesday, October 11, 2011
47. Netflix Build and Deploy
http://techblog.netflix.com/2011/08/building-with-legos.html
App-Specific Customized,
Continuous Packages and Cloud-Ready
Integration Image Live System!
Configuration
Jenkins Yum AMI Instance
Perforce Artifactory Bakery ASG
SCM Binary Combine Base and Dynamic
Repository App-Specific Scaling
Configuration
Tuesday, October 11, 2011
48. Netflix Build and Deploy
http://techblog.netflix.com/2011/08/building-with-legos.html
App-Specific Customized,
Continuous Packages and Cloud-Ready
Integration Image Live System!
Configuration
Jenkins Yum AMI Instance
Perforce Artifactory Bakery ASG
SCM Binary Combine Base and Dynamic
Repository App-Specific Scaling
Configuration
Every change is a new push
Tuesday, October 11, 2011
61. Common Challenges for
Security Engineers
• Lots of data from different sources, in
different formats
Tuesday, October 11, 2011
62. Common Challenges for
Security Engineers
• Lots of data from different sources, in
different formats
• Too many administrative interfaces and
disconnected systems
Tuesday, October 11, 2011
63. Common Challenges for
Security Engineers
• Lots of data from different sources, in
different formats
• Too many administrative interfaces and
disconnected systems
• Too few options for scalable automation
Tuesday, October 11, 2011
66. How do you . . .
• Add a user account?
Tuesday, October 11, 2011
67. How do you . . .
• Add a user account?
• Inventory systems?
Tuesday, October 11, 2011
68. How do you . . .
• Add a user account?
• Inventory systems?
• Change a firewall config?
Tuesday, October 11, 2011
69. How do you . . .
• Add a user account?
• Inventory systems?
• Change a firewall config?
• Snapshot a drive for
forensic analysis?
Tuesday, October 11, 2011
70. How do you . . .
• Add a user account?
• Inventory systems?
• Change a firewall config?
• Snapshot a drive for
forensic analysis?
• Disable a multi-factor
authentication token?
Tuesday, October 11, 2011
71. How do you . . .
• Add a user account? • CreateUser()
• Inventory systems?
• Change a firewall config?
• Snapshot a drive for
forensic analysis?
• Disable a multi-factor
authentication token?
Tuesday, October 11, 2011
72. How do you . . .
• Add a user account? • CreateUser()
• Inventory systems? • DescribeInstances()
• Change a firewall config?
• Snapshot a drive for
forensic analysis?
• Disable a multi-factor
authentication token?
Tuesday, October 11, 2011
73. How do you . . .
• Add a user account? • CreateUser()
• Inventory systems? • DescribeInstances()
• Change a firewall config? • AuthorizeSecurityGroup
Ingress()
• Snapshot a drive for
forensic analysis?
• Disable a multi-factor
authentication token?
Tuesday, October 11, 2011
74. How do you . . .
• Add a user account? • CreateUser()
• Inventory systems? • DescribeInstances()
• Change a firewall config? • AuthorizeSecurityGroup
Ingress()
• Snapshot a drive for
forensic analysis? • CreateSnapshot()
• Disable a multi-factor
authentication token?
Tuesday, October 11, 2011
75. How do you . . .
• Add a user account? • CreateUser()
• Inventory systems? • DescribeInstances()
• Change a firewall config? • AuthorizeSecurityGroup
Ingress()
• Snapshot a drive for
forensic analysis? • CreateSnapshot()
• Disable a multi-factor • DeactivateMFADevice()
authentication token?
Tuesday, October 11, 2011
76. Security Monkey
http://techblog.netflix.com/2011/07/netflix-simian-army.html
Tuesday, October 11, 2011
84. Analyzing Traditional
Firewalls
• Positioned at network chokepoints,
providing optimal internetwork visibility
Tuesday, October 11, 2011
85. Analyzing Traditional
Firewalls
• Positioned at network chokepoints,
providing optimal internetwork visibility
• Use tools like tcpdump, NetFlow,
centralized logging to gather data
Tuesday, October 11, 2011
86. Analyzing Traditional
Firewalls
• Positioned at network chokepoints,
providing optimal internetwork visibility
• Use tools like tcpdump, NetFlow,
centralized logging to gather data
• Review traffic patterns and optimize
Tuesday, October 11, 2011
88. AWS Firewalls (Briefly)
• “Security Group” is unit of measure for
firewalling
Tuesday, October 11, 2011
89. AWS Firewalls (Briefly)
• “Security Group” is unit of measure for
firewalling
• Policy-driven and network-agnostic,
configuration follows an instance
Tuesday, October 11, 2011
90. AWS Firewalls (Briefly)
• “Security Group” is unit of measure for
firewalling
• Policy-driven and network-agnostic,
configuration follows an instance
• Network diagram irrelevant
Tuesday, October 11, 2011
91. AWS Firewalls (Briefly)
• “Security Group” is unit of measure for
firewalling
• Policy-driven and network-agnostic,
configuration follows an instance
• Network diagram irrelevant
• Chokepoints and sniffing are not possible
Tuesday, October 11, 2011
92. AWS Firewalls (Briefly)
• “Security Group” is unit of measure for
firewalling
• Policy-driven and network-agnostic,
configuration follows an instance
• Network diagram irrelevant
• Chokepoints and sniffing are not possible
• Outbound connections not filterable (!)
Tuesday, October 11, 2011
94. Security Group Analysis
• Use config and inventory to map reachability
Tuesday, October 11, 2011
95. Security Group Analysis
• Use config and inventory to map reachability
• Leverage APIs to evaluate reachability and
detect violations:
Tuesday, October 11, 2011
96. Security Group Analysis
• Use config and inventory to map reachability
• Leverage APIs to evaluate reachability and
detect violations:
• Security groups with no members
Tuesday, October 11, 2011
97. Security Group Analysis
• Use config and inventory to map reachability
• Leverage APIs to evaluate reachability and
detect violations:
• Security groups with no members
• “Insecure” services (e.g. Telnet, FTP)
Tuesday, October 11, 2011
98. Security Group Analysis
• Use config and inventory to map reachability
• Leverage APIs to evaluate reachability and
detect violations:
• Security groups with no members
• “Insecure” services (e.g. Telnet, FTP)
• Rules that use “any” keyword
Tuesday, October 11, 2011
99. Security Group Analysis
• Use config and inventory to map reachability
• Leverage APIs to evaluate reachability and
detect violations:
• Security groups with no members
• “Insecure” services (e.g. Telnet, FTP)
• Rules that use “any” keyword
• Visualize config into data flow diagram
Tuesday, October 11, 2011
102. Connectivity Analysis
• Reachability shows what “can” communicate
Tuesday, October 11, 2011
103. Connectivity Analysis
• Reachability shows what “can” communicate
• What about what is communicating?
Tuesday, October 11, 2011
104. Connectivity Analysis
• Reachability shows what “can” communicate
• What about what is communicating?
• Take same approach, leverage APIs for
firewall and inventory and combine with
host data
Tuesday, October 11, 2011
105. Connectivity Analysis
• Reachability shows what “can” communicate
• What about what is communicating?
• Take same approach, leverage APIs for
firewall and inventory and combine with
host data
• Visualize data into connectivity diagram
Tuesday, October 11, 2011
109. Common Security
Product Model
• Examples - AV, FIM, etc.
Tuesday, October 11, 2011
110. Common Security
Product Model
• Examples - AV, FIM, etc.
• “Management” station with client “nodes”
Tuesday, October 11, 2011
111. Common Security
Product Model
• Examples - AV, FIM, etc.
• “Management” station with client “nodes”
• Limited tagging or abstraction
Tuesday, October 11, 2011
112. Common Security
Product Model
• Examples - AV, FIM, etc.
• “Management” station with client “nodes”
• Limited tagging or abstraction
• Strong “manager” and “managed” model
Tuesday, October 11, 2011
113. Common Security
Product Model
• Examples - AV, FIM, etc.
• “Management” station with client “nodes”
• Limited tagging or abstraction
• Strong “manager” and “managed” model
• Push and pull approaches
Tuesday, October 11, 2011
114. Common Security
Product Model
• Examples - AV, FIM, etc.
• “Management” station with client “nodes”
• Limited tagging or abstraction
• Strong “manager” and “managed” model
• Push and pull approaches
• Per node licensing
Tuesday, October 11, 2011
117. “Thundering Herd”
• Mass deployments
• “Red/Black” push - concurrent clusters of
500+ nodes
Tuesday, October 11, 2011
118. “Thundering Herd”
• Mass deployments
• “Red/Black” push - concurrent clusters of
500+ nodes
• Elasticity related to traffic spikes
Tuesday, October 11, 2011
119. “Thundering Herd”
• Mass deployments
• “Red/Black” push - concurrent clusters of
500+ nodes
• Elasticity related to traffic spikes
• Licensing constraints
Tuesday, October 11, 2011
121. Node Ephemerality and
Service Abstraction
• Data related to individual nodes becomes
less important
Tuesday, October 11, 2011
122. Node Ephemerality and
Service Abstraction
• Data related to individual nodes becomes
less important
• Dealing with short-lived systems, IP and ID
reuse
Tuesday, October 11, 2011
123. Node Ephemerality and
Service Abstraction
• Data related to individual nodes becomes
less important
• Dealing with short-lived systems, IP and ID
reuse
• Event and log archives and data relationships
Tuesday, October 11, 2011
124. Resource Usage
Logging and Auditing
Tuesday, October 11, 2011
125. Resource Usage
Logging and Auditing
• Public-facing APIs make access controls
more difficult and more important
Tuesday, October 11, 2011
126. Resource Usage
Logging and Auditing
• Public-facing APIs make access controls
more difficult and more important
• Programmable infrastructure needs robust
logging and auditing capabilities
Tuesday, October 11, 2011
127. Resource Usage
Logging and Auditing
• Public-facing APIs make access controls
more difficult and more important
• Programmable infrastructure needs robust
logging and auditing capabilities
• Can metering data be repurposed?
Tuesday, October 11, 2011
132. “Trusted Cloud”
• Various components related to providing
higher assurance/trust levels in the cloud
Tuesday, October 11, 2011
133. “Trusted Cloud”
• Various components related to providing
higher assurance/trust levels in the cloud
• Virtual TPM / hardware root of trust
Tuesday, October 11, 2011
134. “Trusted Cloud”
• Various components related to providing
higher assurance/trust levels in the cloud
• Virtual TPM / hardware root of trust
• Controlled execution
Tuesday, October 11, 2011
135. “Trusted Cloud”
• Various components related to providing
higher assurance/trust levels in the cloud
• Virtual TPM / hardware root of trust
• Controlled execution
• HSM in the cloud
Tuesday, October 11, 2011
136. Thanks!
Questions?
chan@netflix.com
(I’m hiring!)
Tuesday, October 11, 2011