Why Teams call analytics are critical to your entire business
Cloud Security At Netflix, October 2013
1. Cloud Security @ Netflix
October 25, 2013
Jay Zarfoss
(Cloud Security Guy @ Netflix)
2. This presentation
• What it covers:
– A discussion of what it means to fit security into the Netflix
Cloud universe
– A description of the the past, present, and future Netflix
cloud security architecture
• What it (mostly) skips:
– The broader Netflix culture and architecture
– For generally cloudy topics, see Adrian Cockcroft’s
slideshare at www.slideshare.net/adrianco
– For general culture see www.slideshare.net/netflix
3. Netflix Company Profile
now via self service*
Instructions: Find your favorite BASH terminal and type the following:
> UPDATED_SIZE=`curl ir.netflix.com | perl -ne
's/ / /g; if(/d+ million members in d+
countries/){print "$&";}’`
> echo “Netflix is the world’s leading
Internet subscription service for enjoying TV
and movies, with more than ${UPDATED_SIZE}”
*No whining; remember that you’ll never again need to wait for me to update this slide
like you had to wait for database access when you started at your last job.
4. Our Cloudy Culture
Agile
Unsynchronized
No waiting
Dynamic
Redundant
*aaS
Decentralized
Ephemeral
Open Source
Freedom
Self Service
NoSQL
Rapid
Decoupled
Chaotic
Resilient
*These are not terms that are normally associated with security, or security
architectures, but yet we adopt all of these for security development; with
some perspective (of course).
5. “But how can you trust the Cloud?”
• This is simply an old question rephrased for
the new generation of computing.
– How can you trust the CPU?
– How can you trust the OS?
• Security design often requires
trust of the lower layer.
– Even through they’ve all let us
down at some point before.
– And “trust” does not mean “blind faith”
6. “But we have special requirements”
• Frankly, they’re probably not that special
– You can fail pretty much any requirement with or
without using cloud methodologies
– 67% of 670 surveyed companies fail PCI compliance*
• The core AWS services (EC2, S3, ELB) meet PCI
DSS 2.0 compliance**
– It’s generally assumed that the more exotic features
(DynamoDB) will be getting compliance sooner rather
than later -- So why not offload some of that
compliance work?
*http://www.informationweek.com/security/management/67-of-companies-fail-credit-card-securit/229401946
**http://www.slideshare.net/CloudPassage/aws-slides-pci-20130124
7. The Security Conflict
• Goal: prevent us from hurting ourselves, while
not preventing us from moving quickly and
being flexible.
8. Perspective, Perspective, Perspective
• No one will worry about you getting hurt
playing paintball in a bomb disposal suit.
But then, you’ll almost certainly lose the game.
• Bomb technicians don’t wear paintball suits.
Even if they are easier to work in.
9. Further Security Caveats
Technology alone will never prevent malicious insiders
from doing damage. (Never has this sentiment been
more relevant.)
Smart professionals will use safer tools when they’re
available (so let’s give them those tools)!
10. What do good tools look like?
• Intuitive yet powerful GUIs that shield you
from stumbling over the secrets
– Integrate with single sign-on to keep out your kids
and track you down ifwhen you screw up
• Powerful APIs to do just about everything…
– Except what there’s no legitimate use case for
11. Reflections on Better APIs
The Cloud Offers Incredible APIs so developers can call
upon new hardware with a single line of code.
With great power comes great responsibility.
12. Packets from the sky
Don’t worry, it’s just rain…
• Your own trust of software running on a cloud
instance should ideally be predicated on some
cryptographically authenticated material.
– Ironically, your cloud provider wants to do the same
thing, since they don’t want you denying your bill…
• Not long ago, there was no way to do this other
than deploying these keys yourself in your own
build pipeline.
– Thus, your security was only as nimble as your build
and deployment system. Maybe ok. Probably much
slower than you want/need it to be.
13. Deploying AWS keys, the Legacy Way
“That was in the before time, in the long long ago… (alright, it was 2011)”
Presumably, your machines in the cloud are running code that actually wants
to do something against the Cloud Provider’s API. E.g. Read/write to a
database. Legacy AWS paradigm is that all of these operations need to be
authenticated by signing (HMACing) with access keys.
(Amazon’s term: “credentials”; my term: “AWS Keys”).
//fortunately, AWS provides helper objects that do most of the work
BasicAWSCredentials cred =
new BasicAWSCredentials("accessID", "secretKeyID");
AmazonSimpleDBClient client = new AmazonSimpleDBClient(cred);
//ugly HMAC generating code safely tucked away in here somewhere
client.listDomains();
Sure.. But how did “accessID” and “secretKeyID” get on the machine?
14. 1st Attempt: Stick them in a system property
// if it makes you feel better, let’s pretend I obfuscated this
BasicAWSCredentials cred =
new BasicAWSCredentials(
System.getProperty(“accessID”),
System.getProperty(“secretKeyID"));
AmazonSimpleDBClient client = new AmazonSimpleDBClient(cred);
client.listDomains();
• This… works… I guess…, but what happens if the
key gets out?*.
– Rebake hundreds of AMIs
– Redeploy thousands of Machines
• Requires all hands on deck and a big fiasco.
*Thanks to supplemental security controls, like ip-whitelisting, this may not be quite as horrible as it sounds. Still bad.
15. 2nd Try: Load Keys At Runtime (Better?)
• Fits nicely into Cloud Platform “whatever”-aaS layer.
– Security Groups can enforce who can make request.
– And makes a pretty tidy REST call:
GET server/getAWSKey
<AWSKEY>
<accessKeyID>open</aceessKeyID>
<secretKey>sesame</secretKey>
</AWSKEY>
• What happens when the subaccount associated with
the key gets accidentally deleted?
– Update the key in AWS console and then swap the key in
the key servers (technically easy; will still get your heart
pumping when you do it for real – trust me!)
– You may still have to reboot a lot of machines! But why?
16. Objects, like peaches, are sticky.
(Still delicious.)
RESTfulObj AWSKey = RESTService.get(“server/getAWSKey”);
BasicAWSCredentials cred =
new BasicAWSCredentials(
AWSKey.getAccessID(),
AWSKey.getSecretKey());
AmazonSimpleDBClient client = new AmazonSimpleDBClient(cred);
client.listDomains();
The mindful Object-Oriented programmer will tend to keep this object
around rather than re-creating all of the time. (Trust me).
Guess what object caches the AWS Keys.
17. Promote Safer Foods.
// provider paradigm dynamically asks for keys every time
AWSCredentialsProvider prov = new AWSCredentialsProvider(){
public AWSCredentials getCredentials(){
RESTfulObj AWSKey =
RESTService.get(“server/getAWSKey”);
return new BasicAWSCredentials(
AWSKey.getAccessID(),
AWSKey.getSecretKey());
}
};
AmazonSimpleDBClient client = new AmazonSimpleDBClient(prov);
client.listDomains();
No cached key (yay!). But…Goodluck chasing everyone around with a
broomstick making them write their code this way.
18. Systematically enforce Refresh.
Or: Revoke Privileges for unsafe food altogether
• Only issue temporary keys good for a few hours
(> your longest conceivable operation)
– AWS Mechanism to do this: (AWSSecurityTokenService)
GET server/getAWSKey
<AWSKEY>
<accessKeyID>open</aceessKeyID>
<secretKey>sesame</secretKey>
<expires>1352083995</expires>
</AWSKEY>
• Simple, but powerful consequences to this, i.e.
Accidentally writing keys to logs and backup lost?
– Disadvantages? (I would argue materially none)
19. Abracadabra at Runtime (Best)
http://aws.typepad.com/aws/aws-iam/
• June 11th 2012: Amazon introduces temporary AWS
Security Credentials via Metadata Service
– On-demand access keys via Amazon API; expire quickly
– Effectively, Amazon is hosting the key server and only
giving keys to your cloud instances.
– Predefined “roles” determine the permissions of the keys
– Wish we had had this when we first moved to the cloud.
• Still useful to have your own key server, why?
– For one, developers will chase you down with
pitchforks if they can’t run against the cloud API at
their desk. (And they’d have every right to…)
20. IAM Role configuration via Asgard
View into Asgard Launch
configuration assigning a Role
which determines the
permissions of the key an
instance will receive via IAM
paradigm.
21. New Ways to Hide All Your Keys
http://aws.typepad.com/aws/2013/04/variables-in-aws-access-control-policies.html
• April 3, 2013: Amazon introduces variables in
AWS access control policies.
– Provides an obvious place to store sensitive
nuggets your software needs to work
// one ACL to rule them all
{
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Effect": "Allow",
"Resource": ["arn:aws:s3:::mybucket/myclientsoftware.${aws:userid}.keystore"]
}
Just apply the right role to your auto-scale group and you’re done!
22. Secure Bootstrapping (still) frustrating
// at least now there’s a reasonable place to put the file
-Djavax.net.ssl.keyStore=<file smartly loaded from ACL-limited store>
• Options are better today with new ACL Rules
• But…
– What if I want to hot-swap these? Wait, you mean I
have to write them to a file and restart?! Yuck!!
• Unfortunate artifact of software designed for the datacenter
where machines stay put for a long time
– One mistake in the AWS console and my keystore file
(complete with SSL private keys) is open to the world?
• If your eyelid isn’t twitching, it should be.
23. So… we still want our own tools
// whenever you find yourself writing code like this,
// I hope you’re asking yourself if the keys aren’t
// left sitting on the kitchen counter
cipherContext = factory.getCipherContext(“algorithm”, “keyName”);
• (Most) developers don’t want to think about
where this key lives. So let’s have the library
worry about that for them.
• Some keys are more important than others
– “oh, shit” vs. “OH SHIT”
24. Custom Cloud Key Management
Don’t leave your child in the middle of a busy
intersection.
25. Netflix Key Management
• All sorts of business cases require keying
material:
–
–
–
–
Password reset tokens
Encrypting sensitive databases
Authenticating Netflix Ready Devices (NRDs)
DRM keys
• I’m not having the DRM debate here; so don’t try
– Symmetric, Asymmetric, HMAC keys, ….
• So how do you handle those keys?
– Depends. (Paintballs or Pipebombs?)
26. Cryptex Service
• Without going into too much detail, Cryptex is
our *aaS for key management with associated
client libraries in Java and Python.
– We worry about where the keys live
• So you (Mr./Ms. big data person) don’t have to
– Flexible, dynamic, auto-scaling, fast moving
• Except when it’s not supposed to be
• Future/Ongoing work
– Better integrating this into Datacenter-y software that
wants fixed static things is a constant challenge and
requires lots of new plumbing – wanna help?!
27. Variations in Key Handling
• Low: Key is provided to the edge service instance
– Virtually unlimited throughput, resistant to any
backend service outages
• Medium: Key stays on the single-purpose Netflix
key management servers; each instigating crypto
operation is a REST call (small data is better!)
– Key never lives on a customer facing server
(one nasty bug or “oops” won’t cause exposure)
• High: Keys live in specialized hardware (HSM)
28. Netflix Global Crypto Ops/Sec
• Low (< 1ms latency)
– It’s a (really) big number. And highly variable.
• Medium (~ 4ms latency)
– Tens of thousands of operations/sec at daily peak
(number is shrinking as we get smarter with our
protocols which favor low sensitivity keys)
• High (~ 10ms latency)
– Over one thousand operations/sec at daily peak
29. (Fairly) Common Dialogue
Big Data Developer:
I’m working on super-cool new feature, X. And
it will use some crypto and need some keys.
Which sensitivity of key do I want?
Me:
Tell me the story of what happens when we lose
the key somehow.
30. Various Key Loss Scenarios
Low: We’d rotate the key via one button-push and customers
wouldn’t notice an impact; minimal damage control.
Medium: We’d rotate the key and the whole team would have to
work for a week straight cleaning up the mess created.
High: I don’t want to talk about it.
Let the Cloud help you along the way….Early and automated
detection, combined with fast-reaction means more keys can be
low/medium sensitivity (less resource intensive).
Design your new system to be able to use LOW keys
for the bulk of the heavy lifting!!
31. AWS CloudHSM
http://aws.typepad.com/aws/2013/03/aws-cloud-hsm-secure-key-storage-andcryptographic-operations.html
• March 26th 2013, AWS announces availability of
Safenet-manufactured CloudHSMs to general
cloud-computing public.
– Old-skool industry standard security solution…
without the need for your IT people to baby sit.
– All the right acronyms: FIPS 140-2, CC EAL 4+
– Amazon has no way to recover your keys (do please
take care not to lose them)
– Single tenant
• This is the new home for our high sensitivity keys.
33. Why are we sharing?
• In a sense, Netflix benefits when other cloud
users and cloud venders follow common paths.
– Problems will invariably pop up, but when these
problems occur to industry standard practices,
everyone shares the load of getting them fixed.
• Example of a great benefit of common practice
– TLS has become industry standard for secure
transport, but has had its lumps lately (BEAST, RC4)*
– Because it affects everyone, we’re all motivated to
look for solutions and share those cost
*http://blog.cryptographyengineering.com/2011/12/whats-deal-with-rc4.html
34. Security and Flexibility don’t have to be
always at odds with each other…
Security can fit in a fast-changing
environment where flexibility is paramount.
The trick is to leverage the same flexibility to
allow the Security to keep up.