Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup that is cheap and can scale

DevOps for Startups
Email: {anything}@jedberg.net
Twitter: @jedberg
Web: www.jedberg.net
Facebook: facebook.com/jedberg
Linkedin: www.linkedin.com/in/jedberg
Jeremy Edberg
Founder and CEO
MinOps
https://minops.com 
https://sql.bot

If it won’t scale, it'll fail.

The key to scaling is ﬁnding the
bottlenecks before your users
do

Why should we
learn from other
people’s mistakes?

Takeaways
• Infrastructure as Code
• Microservices/Serverless
• Queueing Theory
• Chaos Engineering
• Logs
• Incident reviews

Infrastructure as Code
• Changes are routine, small,
easy, and repeatable
• Resources are easily
managed by users and
disposable
• Enables continuous
deployment and
improvement
• Solutions can be easily
tested, measured, and then
rolled back

• Losing track of servers
and resources
• Conﬁguration drift
• Snowﬂakes
• Fear of a fully
automated system  
(lack of trust in oneself)
Infrastructure as Code
Challenges

Automate all the things!
http://hyperboleandahalf.blogspot.com/2010/06/this-is-why-ill-never-be-adult.html

• Application startup
• Conﬁguration
• Code deployment
• System
deployment
Automate all the things!

Test and prod are different
Prod is in need of constant updates
Slow iteration and deployment
Polyglot unfriendly
Deploy in weeks, live for years
Physical Servers

Prod is immutable
Rapid iteration and deployment
Multi-tenancy
Polyglot friendly
Deploy in minutes, live for weeks
Virtual Machines

Test and prod are the same
Prod is immutable
Rapid(er) iteration and deployment
High multi-tenancy
Polyglot friendly
Deploy in seconds, live for hours
Containers

Smallest unit of compute
Super scalable
Rapid iteration
Extreme multi-tenancy
Very polyglot friendly
Easier to collaborate
Deploy independently, live for seconds
Serverless
λ

A whole lot of choices
Amazon’s EcosystemHodgepodge of services

Amazon’s Serverless Ecosystem
Lambda
SNS
DynamoDB
SQS
S3
Kenisis

What is serverless anyway?
• There are still servers, you just don’t
manage them anymore
• It also means you don’t access them
anymore
• So you don’t need to (or get to)
optimize them.

Serverless computing is all about
speeding up development by allowing
rapid iteration and removing
management overhead

Choosing your unit of compute
• VMs 
Machine as the unit of scale 
Abstracts the hardware
• Containers 
Application as the unit of scale 
Abstracts the OS
• Serverless 
Functions as the unit of scale 
Abstracts the language runtime
EC2
ECS
Lambda

How do I choose?
• VMs 
“I want to conﬁgure machines,
storage, networking, and my
OS”
• Containers 
“I want to run servers,
conﬁgure applications, and
control scaling”
• Serverless 
“Run my code when it’s
needed”
EC2
ECS
Lambda
I didn’t write
the software
myself

Advantages to a Monorepo
• No worrying about
dependencies
• Don’t have to account for
data movement
• Deployments are simple
• Coordination is easy

Multiple services
Internal Microservices Platform
Monolithic
Success follows a standard pattern

ReleaseTestBuild
Developer Deployment Pain: High
DevOps Deployment Pain: Medium

ReleaseTestBuild
ReleaseTestBuild
ReleaseTestBuild
ReleaseTestBuild
ReleaseTestBuild
Developer Deployment Pain: Medium
DevOps Deployment Pain: High

ReleaseTestBuild
ReleaseTestBuild
ReleaseTestBuild
ReleaseTestBuild
ReleaseTestBuild
λ
λ
λ }
Developer Deployment Pain: Low
DevOps Deployment Pain: 🔥

Advantages to a Service Oriented
Architecture
• Easier auto-scaling
• Easier capacity planning
• Identify problematic code-paths more easily
• Narrow in the effects of a change
• More efﬁcient local caching

Highly aligned, loosely coupled
• Services are built by different
teams who work together to
ﬁgure out what each service
will provide.
• The service owner publishes
an API that anyone can use
and returns proper response
codes

Distributed Computing and a
Distributed Workforce
• The two go hand in hand
when you have a good
distributed systems culture
• Microservices and Micro
Teams

Proper  
Microservices  
Architecture
Service
and  
Resource  
Discovery
Network
and Trafﬁc
Conﬁg
Automated  
Testing
Continuous
Deployment
Security
Monitoring  
and Alerting

Mature companies spend 25% of their
engineering resources on their internal platform

And when you’re done it is only “good
enough”
Building an internal
microservices platform is hard

What do all the parts
of microservices have
in common?

Servers
Capacity planning
Right-sizing
Autoscaling
Load and performance
Patches
Tuning
Conﬁguration
Utilization
Access control
Packages and AMIs

Serverless
Right-sizing
Autoscaling
Load and performance
Patches
Tuning
Conﬁguration
Utilization
Access control
Packages and AMIs
Fully managed
Continuous Scaling
Function is the deployment unit
Capacity planning

Proper  
Microservices  
Architecture
Automated  
Testing
Continuous
Deployment
Security

Security
• Shorter TTL == less
chance for an attack
to take hold

Continuous scalingNo servers to
manage
Never pay for idle
– No cold servers
(only happy
accountants)
Beneﬁts of AWS Lambda

What does Lambda do for you?
• Scales server capacity automatically
• API to trigger execution
• Ensures function is executed in parallel
and at scale
• Logging, monitoring, etc
• Easy pricing

Monitoring
• Everything is in Cloudwatch or Cloudwatch
logs

Pricing
• Choose your RAM
from 128MB to
1500MB
• CPU and Network
scaled based on RAM

Cost Comparison
There’s about 2.5M seconds in a month, so 3M requests is about 1.2 per second
The T2.Small is $18.98 a month, more than Lambda already

Lambda lets you manage 
your code and infrastructure 
in the same place

Lambda lets your developers manage 
their code and your infrastructure 
in the same place

All the problems you have with
microservices are multiplied 10X
with serverless

Problems with
Serverless
• efﬁcient dependency usage
• local dev environments
• making sure everyone has the same
dependencies
• knowing when someone else is
deploying the same function

Testing
• You can’t test the network, but
a good application test should
obviate the need to do so.
• Not really a solved problem.
Can do local testing.
• Can also send json to the
function and compare the
results.

Tips and Tricks
• Limit your function size
(JVM startup time
especially)
• Remember execution is
async
• Don’t assume function
container reuse but
take advantage of it

Tips and Tricks
• Remember the 500MB in /tmp
• Use function aliases
• Use the included logger

Tips and Tricks
• Set up alarms on all
Lambda Cloudwatch
metrics
• Avoid throttling by using
SNS between any service,
such as S3
• Beware of inﬁnite loops by
functions calling each other.

Avoiding Inﬁnte Loops
• With a distributed team, this
is an easy mistake to make
• To avoid it, pass a call stack
and check for self in the
stack
8

So where does that leave
us?
Serverless or containers?
Services or monorepo?

Actionable MetricsMonitoring  
and Alerting

Monitoring  
and Alerting Actionable Metrics

Choosing a metric
Monitoring  
and Alerting

Self Serve is the Key
• Let developers choose what
metrics to submit
• What graphs they put on
their dashboards
• What to alert on
• They are closest to the app,
so they know best
Monitoring  
and Alerting

Alert on increase of
failure, not lack of success
Increase in 500s  Decrease in 200s
Monitoring  
and Alerting
👍 👎

P50, P90, P99
Monitoring  
and Alerting

P50, P90, P99
0
15
30
45
60
1m 2m 3m 4m 5m 6m 7m 8m 9m 10m 11m 12m 13m 14m 15m
P50 P90 P99
Monitoring  
and Alerting

Immutable Data
• If you can, write your software
such that everything in the
cache is immutable.

Moving data is
the single
biggest cost your
distributed
system will incur

But you need to move data for reliability,
so it’s a tradeoff
Use queues as often
as possible

Which is greater?  
Queues or Sliced Bread?

Queuing
• Queue anything you are
writing to a data store
• Monitor your queue
lengths for great insight
and scaling!
0
2
4
6
8
10
12
14
16
18
1 3 5 7 9 11 13 15 17 19

0
2
4
6
8
10
12
14
16
18
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Items
Seconds
Queue Depth

0
10
20
30
40
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Cumulative Flow Diagram
Items
Seconds
Arrivals
Departures

Capacity utilization increases
queues exponentially
• Every time you reduce the excess capacity
by 1/2, you double the average queue size.
• This has a direct effect on the ratio of wait
time to work time for a single work unit
• Use this to balance cost vs. latency
0
2
4
6
8
10
10 20 30 40 50 60 70 80 90 100

• Variability increases
queue sizes linearly
• Operating at high
utilization increases
variability
The price of
variability

The price of
variability
Fast Medium Slow

Chaos Engineering
• Simulate things
that go wrong
• Find things that
are different

Two most important
things to test
Instance Loss
Increased Latency

• What went wrong?
• How could we have detected it
sooner?
• How could we have prevented it?
• How can we prevent this class of
problem in the future?
• How can we improve our behavior
for next time?
Ask the key questions:
Incident Reviews

Takeaways
• Infrastructure as Code
• Microservices/Serverless/
Monolith
• Queuing Theory
• Chaos Engineering
• Logs Suck
• Incident reviews

Questions?
Email: {anything}@jedberg.net
Twitter: @jedberg
Web: www.jedberg.net
Facebook: facebook.com/jedberg
Linkedin: www.linkedin.com/in/jedberg
Company: minops.com

Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup that is cheap and can scale

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup that is cheap and can scale

Similaire à Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup that is cheap and can scale (20)

Plus de Startupfest

Plus de Startupfest (20)

Dernier

Dernier (15)

Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup that is cheap and can scale