The Netflix recipe for migrating your organization from building a datacenter based product to a cloud based product. First presented at the Silicon Valley Cloud Computing Meetup "Speak Cloudy to Me" on Saturday April 30th, 2011
1. Moving
Your
Organiza.on
To
Public
Cloud
April
30th,
2011
Adrian
Cockcro@
@adrianco
#neDlixcloud
hFp://www.linkedin.com/in/adriancockcro@
2. With
a
hop,
skip
and
jump
into
public
cloud…
Prototype
to
get
familiar
with
cloud
Convince
Managers
of
cloud
value
Get
Developers
comfortable
with
new
tools
Incremental
deployment
strategies
6. Capacity
Planning
in
Clouds
• Capacity
is
expensive
• Capacity
takes
.me
to
buy
and
provision
• Capacity
only
increases,
can’t
be
shrunk
easily
• Capacity
comes
in
big
chunks,
paid
up
front
• Planning
errors
can
cause
big
problems
• Systems
are
clearly
defined
assets
• Systems
can
be
instrumented
in
detail
8. Data
Center
NeDlix
could
not
build
new
datacenters
fast
enough
Capacity
growth
is
accelera.ng,
unpredictable
Product
launch
spikes
-‐
iPhone,
Wii,
PS3,
XBox
9. Which
Cloud?
What
MaFers?
• Scalability
over
the
full
range
– Small
scale
–
trivial
sign
up
and
low
cost
to
learn
– Large
scale
–
deploy
1000’s
of
systems
per
hour
• Large
and
Mature
Feature
Set
– Less
work
to
do
yourself
– Well
understood
and
robust
• Large
Developer
Community
– Easy
to
find
expert
staff
– Lots
of
tools
and
open
source
support
10. Cloud
Portability?
• PlaDorm
vendor
lock-‐in
vs.
Cloud
vendor
lock-‐in
– Who
do
you
trust
for
the
long
term?
– How
likely,
how
much
effort
to
switch
vendors?
• Portable
tools
and
plaDorm
issues
– Lowest
common
denominator
portability
– Slow
to
add
advanced
features,
abstrac.on
conflicts
• Reach
Around
the
PlaDorm
– Access
to
underlying
features
creeps
in
– You
aren’t
really
portable
in
the
end…
11. What
About
Cost?
• Explicitly
a
non-‐goal
– Don’t
distract
the
developers,
catch
excep.ons
only
– Expect
costs
to
decline
over
.me
as
market
matures
• Cloud
costs
are
fully
burdened
– Includes
facili.es,
power,
staffing,
automa.on
– No
charges
for
idle
and
obsolete
systems
• Opportunity
Costs
– Drama.cally
simpler
and
faster
decision
making
– How
much
is
manager/execu.ve
aFen.on
span
worth?
13. Leverage
AWS
Scale
“the
biggest
public
cloud”
AWS
investment
in
tooling
and
automa.on
Use
AWS
zones
and
regions
for
high
availability,
scalability
and
global
deployment
14. Leverage
AWS
Feature
Set
“the
market
leader”
EC2,
S3,
SDB,
SQS,
EBS,
EMR,
ELB,
ASG,
IAM,
RDB,
VPC…
15. “The
cloud
lets
its
users
focus
on
delivering
differen4a4ng
business
value
instead
of
was4ng
valuable
resources
on
the
undifferen)ated
heavy
li0ing
that
makes
up
most
of
IT
infrastructure.”
Werner
Vogels
Amazon
CTO
17. Devops
• Developers
who
own
their
code
in
produc.on
• Ops
staff
who
can
write
code
and
tools
• How
do
they
bootstrap
into
cloud?
– All
key
tools
are
open
source
or
in
the
cloud
– Trivial
$
investment
to
learn
AWS,
NoSQL
etc.
– No
excuse
to
not
have
it
on
your
resume…
18. Implica.ons
for
IT
Opera.ons
• Cloud
is
run
by
developer
organiza.on
– Our
IT
department
is
the
AWS
API
– We
have
no
IT
staff
working
on
cloud
• Cloud
capacity
is
much
bigger
than
Datacenter
– Datacenter
oriented
IT
staffing
is
flat
– We
have
moved
a
few
people
out
of
IT
to
write
code
• Tradi.onal
IT
Roles
are
going
away
– Don’t
need
SA,
DBA,
Storage,
Network
admins
21. “In
the
datacenter,
robust
code
is
best
prac4ce.
In
the
cloud,
it’s
essen4al.”
22. Port
to
Cloud
Architecture
Short
term
investment,
long
term
payback!
Pay
down
technical
debt
Robust
paFerns
23. Transi.on
• The
Goals
– Faster,
Scalable,
Available
and
Produc.ve
• An.-‐paFerns
and
Cloud
Architecture
– The
things
we
wanted
to
change
and
why
• Developer
Transi.ons
and
Tools
– Cloud
Bring-‐up
Strategy
24. Datacenter
An.-‐PaFerns
What
do
we
currently
do
in
the
datacenter
that
prevents
us
from
mee.ng
our
goals?
25. Old
Datacenter
vs.
New
Cloud
Arch
Central
SQL
Database
Distributed
Key/Value
NoSQL
S.cky
In-‐Memory
Session
Shared
Memcached
Session
ChaFy
Protocols
Latency
Tolerant
Protocols
Tangled
Service
Interfaces
Layered
Service
Interfaces
Instrumented
Code
Instrumented
Service
PaFerns
Fat
Complex
Objects
Lightweight
Serializable
Objects
Components
as
Jar
Files
Components
as
Services
26. Tools
and
Automa.on
• Developer
and
Build
Tools
– Jira,
Perforce,
Eclipse,
Jeeves,
Ivy,
Ar.factory
– Builds,
creates
.war
file,
.rpm,
bakes
AMI
and
launches
• Custom
NeDlix
Applica.on
Console
– AWS
Features
at
Enterprise
Scale
(hide
the
AWS
security
keys!)
– Auto
Scaler
Group
is
unit
of
deployment
to
produc.on
• Open
Source
+
Support
– Apache,
Tomcat,
Cassandra,
Hadoop,
OpenJDK,
CentOS
– Soon?
TwiFer
Rainbird
hFp://techcrunch.com/2011/02/04/twiFer-‐rainbird/
• Monitoring
Tools
– AppDynamics
–
Developer
focus
for
cloud
hFp://appdynamics.com
– EpicNMS
–
flexible
data
collec.on
and
plots
hFp://epicnms.com
27. Cloud
Developers
JFDI
Boot
Camp
• Concentrated
Stretch
Goal
– Built
a
rough
prototype
working
web
site
in
test
account
– Room
full
of
engineers
sharing
the
pain
for
1-‐2
days
• Hands-‐on
in
the
cloud
with
a
new
code
base
– Debug
lots
of
tooling
and
conceptual
issues
very
fast
– Try
out
architectures
and
paFerns,
throwaway,
no
risk
• Whiteboard
and
Wiki
Pages
–
Built
During
Boot
Camp
– What
core
objects
already
exist,
how
to
make
your
own
– What
components
already
exist
or
are
work
in
progress
28. Developer
Instances
Collision
• Development
in
shared
test
account
• Shared
data
sources
and
most
services
• Sam
and
Rex
both
want
to
deploy
web
front
end
• Who
wins?
Sam
Rex
web
in
test
account
29. Developer
Service
Stacks
• Developer
specific
service
instances
– Configured
via
Java
proper.es
at
run.me
– Rou.ng
implemented
by
REST
client
library
• Server
Configura.on
– Configure
discovery
service
“stack”
string
– Registers
as
<appname>-‐<stack>
• Client
Configura.on
– Route
traffic
on
per-‐service
basis
including
stack
30. Per-‐Service
Stack
Rou.ng
Developers
choose
what
to
share
Sam
Rex
Mike
web-‐sam
web-‐rex
web-‐dev
backend-‐dev
backend-‐dev
backend-‐mike
32. Shadow
Traffic
Redirec.on
• First
traffic
sent
to
cloud
– Real
traffic
stream
to
validate
cloud
back
end
– Uncovered
lots
of
process
and
tools
issues
– Uncovered
Service
latency
issues
• TV
Device
calls
Datacenter
API
– Returns
Genre/movie
list
for
a
customer
– Asynchronously
duplicates
request
to
cloud
– Start
with
send-‐and-‐forget
mode,
ignore
response
33. Shadow
Redirect
Instances
Modified
Datacenter
Datacenter
Service
Instances
Modified
Cloud
Cloud
Service
One
request
per
Instances
visit
Data
Sources
queueservice
videometadata
36. First
Page
• First
full
page
–
Starz
Channel
Genre
– Simplest
page,
no
sub-‐genres,
minimal
personaliza.on
– Lots
of
investment
in
new
Struts
based
page
design
• New
“merchweb”
front
end
instance
– movies.neDlix.com
points
to
merchweb
instance
• Uncovered
lots
of
latency
issues
– Used
memcached
to
hide
S3
and
SimpleDB
latency
– Improved
from
slower
to
faster
than
Datacenter
37. Starz
Page
Cloud
Instances
Front
End
merchweb
mul.ple
requests
Middle
Tier
starz
memcached
per
visit
Data
Sources
queueservice
rentalhistory
videometadata
38. Controlled
Cloud
Transi.on
• WWW
calling
code
chooses
who
goes
to
cloud
– Filter
out
corner
cases,
send
percentage
of
users
• Redirect
if
Needed
– The
URL
that
customers
see
is
hFp://movies.neDlix.com/WiContentPage?csid=1
– If
problem,
redirect
to
old
Datacenter
page
hFp://www.neDlix.com/WiContentPage?csid=1
• Play
BuFon
and
Star
Ra.ng
Ac.on
redirect
– Point
URLs
for
ac.ons
that
create/modify
data
back
to
datacenter
to
start
with
39. Big-‐Bang
Transi.on
• iPhone
Launch
(August/Sept
2010)
– Not
enough
capacity
in
the
datacenter,
cloud
only
– App
Store
gates
release,
one
shot,
can’t
back
out
• SOASTA
Cloud
Based
Load
Genera.on
– Has
to
work
at
large
scale
on
day
one
– Stress
test
API
and
end-‐to-‐end
func.onality
40. WWW
Page
by
Page
• 2010
Gradual
Migra.on
from
Datacenter
– Add
pages
as
dependent
services
come
online
– Home
page
–
most
complex
and
highest
traffic
• 2011
Clean
up
stragglers
and
dependencies
– Shut
down
en.re
datacenter
service
.ers
– Move
developer
focus
totally
to
cloud
41. Hop,
Skip,
Jump
• Move
yourself
• Move
your
management
and
colleagues
• Move
your
developers
and
devops
• Move
your
product
42. Takeaway
Hop,
skip,
jump……
splash!
Come
on
in,
the
water’s
fine,
just
a
bit
cloudy.
hFp://www.linkedin.com/in/adriancockcro@
@adrianco
#neDlixcloud
43. Amazon Cloud Terminology Reference
See http://aws.amazon.com/ This is not a full list of Amazon Web Service features
• AWS
–
Amazon
Web
Services
(common
name
for
Amazon
cloud)
• AMI
–
Amazon
Machine
Image
(archived
boot
disk,
Linux,
Windows
etc.
plus
applica.on
code)
• EC2
–
Elas.c
Compute
Cloud
– Range
of
virtual
machine
types
m1,
m2,
c1,
cc,
cg.
Varying
memory,
CPU
and
disk
configura.ons.
– Instance
–
a
running
computer
system.
Ephemeral,
when
it
is
de-‐allocated
nothing
is
kept.
– Reserved
Instances
–
pre-‐paid
to
reduce
cost
for
long
term
usage
– Availability
Zone
–
datacenter
with
own
power
and
cooling
hos.ng
cloud
instances
– Region
–
group
of
Availability
Zones
–
US-‐East,
US-‐West,
EU-‐Eire,
Asia-‐Singapore,
Asia-‐Japan
• ASG
–
Auto
Scaling
Group
(instances
boo.ng
from
the
same
AMI)
• S3
–
Simple
Storage
Service
(hFp
access)
• EBS
–
Elas.c
Block
Storage
(network
disk
filesystem
can
be
mounted
on
an
instance)
• RDB
–
Rela.onal
Data
Base
(managed
MySQL
master
and
slaves)
• SDB
–
Simple
Data
Base
(hosted
hFp
based
NoSQL
data
store)
• SQS
–
Simple
Queue
Service
(hFp
based
message
queue)
• SNS
–
Simple
No.fica.on
Service
(hFp
and
email
based
topics
and
messages)
• EMR
–
Elas.c
Map
Reduce
(automa.cally
managed
Hadoop
cluster)
• ELB
–
Elas.c
Load
Balancer
• EIP
–
Elas.c
IP
(stable
IP
address
mapping
assigned
to
instance
or
ELB)
• VPC
–
Virtual
Private
Cloud
(extension
of
enterprise
datacenter
network
into
cloud)
• IAM
–
Iden.ty
and
Access
Management
(fine
grain
role
based
security
keys)