Keeping an application running at scale can be a daunting task. When do you need to add more capacity? Larger databases? Additional servers? These questions get harder as the complexity of your application grows. Microservice based architectures and cloud-based dynamic infrastructures are technologies that help you keep your application running with high availability, even during times of extreme scaling. We will discuss some of the best practices we’ve learned working with New Relic customers on how you can manage your applications running at scale, and how technologies such as microservices and dynamic infrastructure can help you with this challenge.
As presented by Lee Atchison, Senior Director, Strategic Architecture of New Relic at Amazon Web Services Summit, Sydney on April 6, 2017.
2. Who am I?
30 years in industry
5 in New Relic
(Architect Lead, Cloud, Service Migration)
7 in Amazon Retail & AWS
(Built First AppStore, AWS Elastic Beanstalk)
Who Specialize in:
Cloud computing
Services & Microservices
Scalability, Availability
leeatchison@leeatchison
Senior Director Strategic Architecture
9. 9
Keeping Your App Running…At Scale
Availability…
…is more than
you think it is.
10. Does this sound like something you’ve heard recently…
…overheard OPs conversation...
11. The conversation…
“We were wondering how changing a
setting on
our MySQL database might impact
our performance…
12. The conversation…
“We were wondering how changing a
setting on
our MySQL database might impact
our performance…
… but we were worried
that the change may
cause our production
database to fail…”
13. The “scary” overheard conversation…
“… Since we didn’t want to
bring down production,
we decided to make the
change to our backup (replica)
database instead…
Under
Construction
… but we were worried
that the change may
cause our production
database to fail…”
14. The “scary” overheard conversation…
“… Since we didn’t want to
bring down production,
we decided to make the
change to our backup (replica, hot
standby) database instead…
… After all, it wasn’t
being used for anything
at the moment.”
Under
Construction
15. The “scary” overheard conversation…
Until, of course, the backup was
needed…
Under
Construction
X
16. The “ scary” overheard conversation…
Until, of course, the backup was
needed…
This was a true story
Under
Construction
!!!!X
X
25. Need Data at Every Level
Amazon EC2 Instance
BrowserMobile
Server (Virtual)
Hardware
Server OS
Application &
Application
Microservices
Typical Server / Amazon EC2
Instance
• Application & Application
Microservices
• Server OS
• Hardware (virtual)
26. Amazon EC2 Instance
BrowserMobile
Server (Virtual)
Hardware
Server OS
Application &
Application
Microservices
Low Level Monitoring
Amazon
CloudWatch
AWS
CONSOLE
Amazon CloudWatch
Monitors
• EC2 instance
• Virtualization
• Hardware
• [CPU / Disk / Networking]
Doesn’t know about:
• Server OS
• Memory / Filesystem
• Processes
• Configuration
• Application
- Latency
- Error rates
27. Amazon EC2 Instance
BrowserMobile
Server (Virtual)
Hardware
Server OS
Application &
Application
Microservices
DASHBOARDS
Infrastructure / Application Monitoring
New Relic
Application
Monitoring
New Relic
Infrastructure
Monitoring
Amazon
CloudWatch
AWS
CONSOLE
Monitors (Server):
• How O.S. is performing
• Configuration Changes
• Processes
• Hardware
Monitors (Application):
• App health
• App performance
• Microservices
Doesn’t know
• Virtualization
28. Amazon EC2 Instance
BrowserMobile
Server (Virtual)
Hardware
Server OS
Application &
Application
Microservices
Full Stack Monitoring
New Relic
Application
Monitoring
New Relic
Infrastructure
Monitoring
Amazon
CloudWatch
AWS
CONSOLE
Integrations
New Relic
Monitors
CloudWatch
monitors
DASHBOARDS
AWS / CloudWatch
• Visibility into virtualization
• CPU / Disk / Networking
• 14 AWS Services
APM
• CPU / Disk / Networking
• Memory / Filesystem
• Processes
- Infrastructure components
- Configuration inventory
• Application / Microservices:
- Latency
- Error rates
- App insights
35. Cloud as a “Better Data Center”
Resources are allocated to
uses, just like in a data
center
Provisioning process
is faster
Lifetime of components is
relatively long
Capacity planning is
still important and
still applies
36. Why use a “Better Data Center”?
Add new Capacity
(faster)
Improve Application Availability
(redundancy)
Compliance
38. Cloud as a “Dynamic Tool for Dynamic Apps”
Use Only the Resources
you need
Allocate / de-allocate
resources on the fly
Resource allocation is an
integral part of your
application architecture
39. Dynamic Cloud
Resources are: Application in charge:
Allocated Application is aware of and is controlling
traditional OPs resources
Consumed De-allocated
41. Dynamic Usage Example…
Docker Container Age
(by Minute and Hour)
1,200,000
11% underone minute
Container age (minutes)
42. Dynamic Cloud Technologies
Dynamic Cloud is about scaling and availability
EC2 Auto Scaling
Mobile / IoT Dynamic routing
Load balancing
Queues and notifications
Docker
43. Dynamic Cloud Enables Better Applications Faster
Traditional Data Center Cloud Data Center Dynamic Cloud
Good Better Best
The way you’ve done things in the past
won’t work in the future.
44. Dynamic Cloud
Server running application/
processes
Process running
a command
Function performing a
task or operation
EC2 Docker Lambda
Things happen faster because of…
45. Microcomputing & AWS Lambda
• Highly dynamic
• Incredibly scalable
• No infrastructure to provision
• Massively shared infrastructure
Also known as:
• Functions as a Service (FaaS)
• Compute as a Service (CaaS)
• Serverless
51. Dynamic Cloud has unique monitoring requirements…
How do I track what the dynamic cloud is
doing for me (or to me)?
52. What is a Dynamic Cloud Application?
• Application & Application Microservices
Responsible for the parts you care about
• Infrastructure
• Allocation/Provisioning
• Scaling
Let cloud manage rest
Server OS
Server (Virtual)
Hardware
Application &
Application
Microservices
Provisioning
Application &
Application
Microservices
Application &
Application
Microservices
BrowserMobile
53. Server OS
Server (Virtual)
Hardware
Application &
Application
Microservices
Provisioning
Application &
Application
Microservices
Application &
Application
Microservices
BrowserMobile
Monitoring Dynamic Cloud Applications
AWS
CONSOLE
CloudWatch
54. Server OS
Server (Virtual)
Hardware
Application &
Application
Microservices
Provisioning
Application &
Application
Microservices
Application &
Application
Microservices
BrowserMobile
AWS InfrastructureApplication Performance
CloudWatch
AWS
CONSOLE
New Relic
Application
Monitoring
New Relic
Infrastructure
Monitoring
DASHBOARDS
Integrations
55. Server OS
Server (Virtual)
Hardware
Application &
Application
Microservices
Provisioning
Application &
Application
Microservices
Application &
Application
Microservices
BrowserMobile
CloudWatch
AWS
CONSOLE
New Relic
Application
Monitoring
New Relic
Infrastructure
Monitoring
DASHBOARDS
AWS InfrastructureApplication Performance
New Relic
Monitors
CloudWatch &
AWS monitors
Integrations
56. Server OS
Server (Virtual)
Hardware
Application &
Application
Microservices
Provisioning
Application &
Application
Microservices
Application &
Application
Microservices
BrowserMobile
How do you monitor this?
?How do you
monitor this?
57. Where did it go? It was just here!!
The thing you monitored 10 minutes ago…
...doesn’t exist anymore!?
58. Monitoring the Dynamic Cloud
Monitor the Cloud Components
themselves
Monitor the lifecycle of the Cloud
Components
Very different than monitoring traditional Data Center components
61. Changing World
Dev
Now - DYNAMIC World
Ops
• We know:
• Change is inevitable
• We must:
• Embrace and drive change
• Enabling:
• Quicker growth
• More reliable growth
62. 62
Keeping Your App Running…At Scale
Dynamic
Cloud…
...make availability
happen.
Migration…
...how do I get my app
to the cloud?
67. Enterprise IT Cloud Adoption Strategy
Experiment
Non-evasive, safe technologies
- S3
- Perhaps: CloudFront, SQS, SES
Stay away from EC2/Servers
Security: Easy as one-offs
No “Policies” implemented yet
“Just seeing what this is all about”
Progressions in Cloud
Adoption
What is this cloud thing?
69. Progressions in Cloud
Adoption
Enterprise IT Cloud Adoption Strategy
Secure the Cloud
IAM (Credentials)
VPC (Secure network)
AWS Direct Connect (just another data center)
Cloud policies begin to be formed
All parts of the company are now involved
Critical evolution point
Can we trust the cloud?
71. Progressions in Cloud
Adoption
Enterprise IT Cloud Adoption Strategy
Enable Servers, Enable SaaS
EC2
- Basic “data center migration”
- Just another server type available…
Multiple AZs/Regions
- Part of multi-datacenter resiliency strategy
Independently: SaaS usage increases
- Non-critical or internal uses first
The cloud seems to work pretty well…
77. Progressions in Cloud
Adoption
Enterprise IT Cloud Adoption Strategy
Mandate Cloud Usage
Cloud as a data center replacement
Company is now “all in” with cloud
Netflix…
Why do we need our own data centers?
78. What is the cloud?
Can we trust the cloud?
The cloud works pretty well…
Dynamic Cloud becomes a thing…
Dynamic Cloud is deeply ingrained…
Why do we need our own data centers?
Progressions in Cloud AdoptionThe steps aren’t easy…
79. Experiment
Secure the Cloud
Enable Servers, Enable SaaS
Enable Value-Added Services
Enable Unique Services
Mandate Cloud Usage
Progressions in Cloud Adoption
Different Companies
Different Speed
Different Needs
86. Adoption Success Strategies
Understand
where your
culture is
Consciously plan
your acceptance
Drive your cultural
change to your
desired level
Monitor
your adoption
Understand
your needs
87. Monitor Your Adoption
Before Migration
Baseline application
(servers, databases,
caches, applications,
microservices)
Determine your steady
state
88. Monitor Your Adoption
During Migration
Incorporate cloud’s
internal monitoring
Continue
application
monitoring
Understand and solve all deviations from steady state…
89. The Biggest Role Monitoring Plays In Migration
Performance Post Migration
& During Optimization
Pre-migration Feasibility & Benchmarking
90. Continue Monitoring…
Infrastructure is
now out of your
control
Some cloud
specific concerns (EC2
instance failures, instance
degradation)
Dynamic Technologies
Impact Our Applications
Understand
application
impact
Ongoing
application &
infrastructure
monitoring is
essential
Monitor Your Adoption
91. 919191919191
Fairfax Media Limited is a leading multi platform media
company in Australasia, reaching 10.6 million
Australians and 2.9 million New Zealanders.
Media/Entertainment
“Because we monitored our on-premises systems with New Relic
before we migrated them to Amazon Web Services, we were
able to identify potential issues and fix them during the
migration process.”
- Cheesun Choong
Head of Product Platforms
Results
Reduced
diagnosis time
from hours to
minutes
Migrated to AWS
with confidence
Identified
underutilized servers
to save money
92. 92
Keeping Your App Running…At Scale
Dynamic
Cloud…
...make availability
happen.
Migration…
...how do I get my app
to the cloud?
Availability…
…is more than
you think it is.
Monitor your application and infrastructure
93. Monitoring just the server
EC2 Instance
Server OS
Server (Virtual)
Hardware
Application &
Application Microservices
AWS
CONSOLE
CloudWatch
Worked when rate of change was low…
95. Server OS
Server (Virtual)
Hardware
Application &
Application
Microservices
Provisioning
Application &
Application
Microservices
Application &
Application
Microservices
BrowserMobile
Full Stack Monitoring
New Relic
Application
Monitoring
New Relic
Infrastructure
Monitoring
DASHBOARDS
• Top to bottom monitoring…
• Full stack accountability...
• Dynamic infrastructure control...
You need:
96. Digital Fan Experience for Major League Baseball
New Relic empowers our developers
to experiment and work fast without
compromising on the quality of the
MLB fan experience.
– Sean Curtis
Senior Vice President of Engineering
99. Change is speeding up
Traditional Data Center Cloud Data Center Dynamic Cloud
Dynamic Cloud enables better applications faster.
Good Better Best
The way you’ve done things in the past
won’t work in the future.
100. Server OS
Server (Virtual)
Hardware
Application &
Application
Microservices
Provisioning
Application &
Application
Microservices
Application &
Application
Microservices
BrowserMobile
Full Stack Monitoring
New Relic
Application
Monitoring
New Relic
Infrastructure
Monitoring
DASHBOARDS
101. Thank you
Lee Atchison ∙ Senior Director Strategic Architecture
New Relic
Architecting for Scale
By: Lee Atchison
Published by: O’Reilly Media
www.architectingforscale.com
leeatchison@leeatchison
102. This document and the information herein (including any information that may be incorporated by reference) is provided for informational
purposes only and should not be construed as an offer, commitment, promise or obligation on behalf of New Relic, Inc. (“New Relic”) to sell
securities or deliver any product, material, code, functionality, or other feature. Any information provided hereby is proprietary to New Relic and
may not be replicated or disclosed without New Relic’s express written permission.
Such information may contain forward-looking statements within the meaning of federal securities laws. Any statement that is not a historical fact
or refers to expectations, projections, future plans, objectives, estimates, goals, or other characterizations of future events is a forward-looking
statement. These forward-looking statements can often be identified as such because the context of the statement will include words such as
“believes,” “anticipates,”, “expects” or words of similar import.
Actual results may differ materially from those expressed in these forward-looking statements, which speak only as of the date hereof, and are
subject to change at any time without notice. Existing and prospective investors, customers and other third parties transacting business with New
Relic are cautioned not to place undue reliance on this forward-looking information. The achievement or success of the matters covered by such
forward-looking statements are based on New Relic’s current assumptions, expectations, and beliefs and are subject to substantial risks,
uncertainties, assumptions, and changes in circumstances that may cause the actual results, performance, or achievements to differ materially
from those expressed or implied in any forward-looking statement. Further information on factors that could affect such forward-looking
statements is included in the filings we make with the SEC from time to time. Copies of these documents may be obtained by visiting New Relic’s
Investor Relations website at http://ir.newrelic.com or the SEC’s website at www.sec.gov.
New Relic assumes no obligation and does not intend to update these forward-looking statements, except as required by law. New Relic makes no
warranties, expressed or implied, in this document or otherwise, with respect to the information provided.
Safe Harbor
Notes de l'éditeur
Dynamic Infrastructure and The CloudAdventures in Keeping Your Application Running…at Scale
AWS Summit - Sydney, Australia
Lee Atchison ∙ Senior Director Strategic Architecture at New Relic, Inc.
I’d like to tell you a story. Does this story sound familiar to you?
It’s Sunday.
The day of the big game.
You’ve invited 20 of your closes friends over to watch the game on your new 300” ultra max TV.
Everyone has come, your house is full of snacks and beer. Everyone is laughing. The game is about to start.
And…
…the lights go out……the TV goes dark……the game, for you and your friends, is over.
Obviously disappointed, what happened?
You decide to pick up the phone and call the local power company.
The representative, unsympathetically, says: “We’re sorry, but we only guarantee 95% availability of our power grid.”
They could not understand why you were complaining, after all you had power “most of the time”.
Why is availability important?
* Because your customers expect your service to work…all the time.
* Anything less than 100% availability can be catastrophic to your business.
A hope and a prayer…
Laugh at it, but more people do this than you might expect.
Keeping your application running is possible. I will discuss three points to making it happen.
{c} First, availability…is more than you think it is...
I want to tell you about an overheard OPs conversation. I want you to tell me if this sounds like something you’ve heard yourself in your OPs organizations…
We were wondering how changing a setting on our MySQL database might impact our performance…
… but we were worried that the change may cause our production database to fail…
… Since we didn’t want to bring down production, we decided to make the change to our backup (replica) database instead…
… After all, it wasn’t being used for anything at the moment.
Until…of course...the backup was needed...
Does this story sound familiar? This exact story is a true story, and unfortunately is not uncommon.
Availability issues such as I described here may seem obvious…but many are much more subtle. For example...
Imagine we are a e-commerce website. We’ve got a mobile app that can purchase items in ourshop. {C} Bob uses his phone, buys something, and it takes 300ms. That’s great! {C} Sally logs in, buys something, but the database is slow. It takes much longer. She is not a happy customer.
Availability is not just whether a page responds, but how long it takes to respond.
The customer doesn’t care why a problem occurred, they don’t care why your app is slow. If it doesn’t meet their expectations at a time they expect, nothing else matters…
But keeping your application available can be tough. It may be fuzzy. Performance may be good for some users, and bad for others. But, can you even detect this, or do you just show that, on average, your site is doing fine?
The real answer to how your application is doing is not a hope and a wish. It’s in the details. It’s in the data.
Modern application monitoring can’t be done by simply looking from the outside in. It can’t be done with averaged or sampled data. You must collect data from all areas of your application, and from all transactions. You must collect tons and tons of data.
---
In fact, you typically need to collect more monitoring data than data that is within your application. And it grows continuously, every day, every second. Everything that anyone does on your application, generates performance data.
If anybody is using your application, you must collect data about exactly how they are using it and how the infrastructure behind it works together. All of it is important.
All parts of your application, from your servers thru your apps, to the business outcomes they represent {C} All generate data that you must analyze together.
So, you expect your site to be up. And when it is down, what do you do? Do you look to attach blame? No, you want to find the problem.
You want to know what happened.
To know what happened, we need data. We need data from every level of our application. Here is a typical, simple, web application. It consists of an application and some services. It consists of servers running an operationg system, and they consist of virtual hardware that all that runs on. They may also run in our customers browsers, or in their mobile applications.
Often people think that all they need is low level virtual hardware monitoring. They monitor their instances using tools like CloudWatch. But CloudWatch provides a very limited view of the world. You get virtual hardware level information, but that’s about it. You don’t even get information about the operating system, memory, processes, or system configuration. And you absolutely get no information about your application.
To know how your application is really performing. You need an application performance monitoring tool. You also need to know how the rest of your infrastructure is running (the operating system for instance). You also need to know how your remote application, such as those running on mobile devices or your customer’s browsers are running.
To monitor the application, you need full stack performance monitoring.
Because if you don’t monitor the data you need at the time you need it. You’ll:
1) Waste time fire fighting, 2) Meaningless finger pointing across teams, 3) Lose money, 4) Make customers unhappy, 5) Unhappy customers tell other people…
You also need the right data. You need to know how your application is performing, to answer questions as simple as, “Am I actually open for business?”. But you also want to know how easy it is for your customers to make use of your application. What is their experience? And you need to know how your business is doing.
You need to monitor the right components…and you need to monitor the right data.
Success involves all three types of analytics. Is the software working? Is it meeting the customer’s needs? Is it meeting your business needs? All of these three things are interconnected.
Because, avoiding this is critical to every business.
Point 2, there are technologies that can help you keep your application running…technologies such as the dynamic cloud. How do I mean? Let’s take a look.
How can the cloud help? Well, it turns out that there are two fundamental ways people make use of the cloud. The first is to use the cloud as a “Better Data Center”. The second is to use the “Dynamic Nature” of the cloud to build better apps faster. I’m going to talk about each of these methods.
Let’s first look at using the cloud as a “Better Data Center”.
What do I mean by using the cloud as a “Better Data Center”? I mean:* Resources are allocated to uses, just like in a regular data center <click>
* The provisioning process for new resources, though, is significantly faster <click>
* The lifetime of the resources you create is relatively long…usually measured in days, weeks, months, or years. <click>
* However, even with a faster provisioning process, traditional “capacity planning” is still important and still applies.
Why would we want to use the cloud simply as a “better data center”? What are the benefits to us building applications? Since we can add new capacity faster, we can build and scale our applications easier in the cloud. In addition to adding servers easier and quicker, we can add entire new data centers easier, which can improve our application availability and redundancy. Additionally, this ability to add additional data centers can improve our compliance, especially when it comes to things like EU Safe Harbor laws.
So, now, let’s switch to talking about using the cloud in a dynamic environment.
What do I mean by using the cloud as a “dynamic tool for dynamic applications”? I mean:
Use only the resources you need <click>
* Allocate and deallocate resources on the fly <click>
* Resource allocation becomes an integral part of your application architecture.
In a dynamic application, resources are allocated, consumed, and deallocated on the fly. And the application is aware of and is controlling this management of resources. The application is essentially performing traditional OPs resource management tasks.
New Relic did an analysis recently about how our customers are making use of Docker. The question we wanted to answer was, how long do docker containers live? This diagram shows the answer to that question. The horizontal axis is the number of hours a docker container has lived for, and the vertical axis is the number of containers in that time bucket. As you can see, there is a long tail, with some docker containers running for well over a year. However, there is a huge number of docker containers that run for less than one hour. In fact, if we zoom in on just that one hour time period…
we can see that most docker containers we run actually only run for less than one minute! Over 11% of all docker containers we run will run for less than 60 seconds.
This is some customer’s application or service, some business logic, that starts up, runs, and shuts down all within 60 seconds. This is very rapid. These are containers that are launched only for a specific business purpose and are terminated when that purpose is completed. This is what we mean by dynamic infrastructure.
And there are lots of different cloud technologies that can be used in this dynamic manner…from queues to routing to auto scaled EC2 instances. Many resources in the cloud can be used in this dynamic fashion.
The dynamic cloud allows you to build better applications, faster. The way you’ve done things in the past won’t work in the future.
Change happens faster in the cloud. This is because of dynamic servers, dynamic infrastructure, and, more recently, {c} the cloud is even more dynamic due to technologies such as AWS Lambda.
What is Lambda? Lambda is one of many technologies that implement what’s called “Functions as a Service” or “Compute as a Service”. You might also know it as “Serverless”, but that is not as accurate of a description of it. Lambda allows creating microcomputing environments. This allows creating highly dynamic and incredibly scalable functions that can be executed without the need to provision any infrastructure what-so-ever. They provide automatic scaling using a massively shared infrastructure.
In a nutshell, AWS Lambda simply takes an event from some AWS resource. This is called the “trigger”. This event can be something like an object being updated in an S3 bucket…or a database update in DynamoDB, or a call to an API Gateway. Some sort of event within the AWS ecosystem.
Lambda takes that event and creates an instance of a Lambda function, on the fly, that can process that event.
The processing is usually a very simple action...something like updating another object in S3, or responding to the API Gateway request...whatever action the lambda script was designed to execute in response to that trigger.
Any number of triggers can occur as fast as possible, and multiple instances of the lambda function will automatically be created to handle all of the concurrent events, instantly scaling the function to as many instances as is necessary to handle all events as quickly as possible. This automatic scaling is designed to be transparent to everyone, including the customer who created the script. This is the definition of near infinite scaling.
Building dynamic infrastructures in the cloud allows you to {c} scale your applications better. {c} It also allows you to make changes to your application faster and easier. {c} Both of these ultimately result in higher availability…
But only if you know what your application is actually doing…
(But only if you know what your application is actually doing…)
This brings up an interesting concern. In a dynamic cloud, you have dynamic resources. Resources that are coming and going fast. Instances are starting and stopping. Containers are coming and going. And functions are executing and terminating.
If resources are coming and going so fast, how can you monitor them? How do you monitor a dynamic application in a dynamic cloud?
Here is an example of a dynamic application. It looks much like the static application. It might have more services and microservices that compose the application, this is typical of a more modern application.
We still have AWS CloudWatch monitoring the low level cloud infrastructure.
And we still have traditional application performance monitoring that monitors the static nature of the application components.
Overall, this provides **almost** top to bottom monitoring of the entire application.
But what about this piece? How do you monitor the provisioning process itself? Given that resources are coming and going regularly, how do you monitor that?
How do you monitor components that are there one moment, but less than 60 seconds later, they are gone?
<click>
Remember the docker information…
It turns out that monitoring a dynamic application in a dynamic cloud is very different than monitoring traditional data center components.
You must of course still monitor each of the cloud components themselves…each of the services and resources and components that make up your application.
{c}
But you also must monitor the lifecycle of the cloud components. This is because it matters not only **that** a resource was used, it matters **when** that resource was used. Because just looking at the resources running right now is inadequate when trying to diagnose a problem from even a few minutes ago. The resources that were in use when the problem occurred are **not** the same resources in use now.
So, in the old world, your operations team was comfortable. They knew the resources they controlled, they created them, they managed them. All was simple and manageable.
But in this new world, resources are created and destroyed dynamically. The world of the operations team can no longer be as simple as tracking resources on a spreadsheet. The resources they are responsible for are dynamic and transient. Their world has gotten a lot more complicated.
This change is inevitable. The change is needed because our customers are expecting more and more from our applications. The change is needed because our customers are expecting better and more reliable performance from our applications. The change is inevitable because to meet the needs of our customers, our organizations must grow quickly and build applications that are more reliable than ever before.
The cloud helps achieve this, and this more and more the reason why moving to the cloud is so important for us.
The third point, is getting to the cloud. Migrating to the cloud is easy, right?
How do we move to the cloud? Often, we start our migration to the cloud with lofty expectations. But we find out that moving to the cloud isn’t necessarily as easy as we would like it to me. Problems occur. The cloud doesn’t meet our expectations that have been promised to us.{c}There is pressure to declare ”victory” before we are ready.{c}Promised performance gains are not occurring. Costs run out of control.{c}And schedules just don’t matter anymore.{c}How can we meet our promises to our stakeholders if we can’t get the cloud to do what we want it to do? Most companies moving to the cloud struggle with this. Some struggle more than others. Some fail to overcome the struggle.
But moving to the cloud does not have to be scary or dangerous. It can be done safely, but you must be willing to learn as you go. Learn and adapt the cloud to meet your company’s needs, and learn and adopt your expectations to the reality of what the cloud can offer.
Let’s take a look at how most enterprises figure out how to migrate to the cloud. There are six *typical* steps that most companies take to move to the cloud.
They don’t all use all the steps. Some stop part way up the path.
Some skip steps.
But this is typical…
Let’s look at each of these in turn.
Let’s start with “Experiment”.
This is the first, tentative step into the cloud. It involves using safe technologies. Technologies that we can use in simple and subtle ways in parts of our applications that may be less critical.
There are no cloud policies created. We just build one off implementations to see how the cloud can fit into our needs.
Most companies have at least started on this step.
After you’ve done some basic “feet wetting” in the cloud, security typically becomes a concern.
Critical evolution point in the company’s culture
…all displines in the company are involved (Legal, Finance, Security)
…companies that can’t get past this point, can’t be successful in the cloud
Once policies are in place and the cloud can be trusted…you start using other features the cloud has to offer.
Three choices:
...1) Put some workloads in the cloud, some in your own data center
...2) Resiliency - additional data center(s)
...3) Move applications to the cloud, out of existing data centers
Independently: SaaS uage increases (internal apps first)
Now the cloud is important to you, so you start to see what else the cloud can do for us.
”Managed Services”
Now, we start looking at cloud native services…services only available in the cloud.
Point of commitment…now dependent on the cloud
So now we are committed to the cloud…now comes the last step. Mandated use.
Mandate use of the cloud
>>>Typically wanting to get out of the data center business
Netflix, etc
The steps aren’t easy…
But ultimately, these are the steps involved.
Different companies go thru these steps at different speeds.
Different companies find the right “stopping point” that matches their needs
While these are the steps our *company* may go thru.
As we build new and migrate existing applications, our applications go thru a similar learning process…
How can a given application take advantage of the cloud?
This adoption may happen faster or slower for different types of applications.
Let’s take a look at these as two different axis on a chart.
Coporate adoption process on the left, Application adoption process on the bottom
Another way to look at this: based on application types and requirements...
So we can see we are more likely to use the “newer” technologies, such as Lambda, in new applications. But we are much less willing to use these technologies in our more business critical applications.
There exists a sweet spot…
>Corporate adoption is strong, but not “mandated”
>Application adoption is strong, but not “committed”
*This is the destination for a lot of companies and applications
Very near some of the common, core AWS services
So, that’s all great data. I know I need to move to the cloud to keep my company moving forward. But what about the nuts and bolts. What should *I* do to be successful in moving to the cloud?
How can I make sure a cloud migration is successful?
Understand where your culture is
Risk tolerance, Cloud commitment, Expertise
Understand your needs
Redundancy? Cost? New Opportunity?
Consciously plan your acceptance
What level are you?
What level do you need to be?
Drive your culture to where you feel you need to be
Monitor your adoption
Before migration
Baseline application
Servers
Databases
Caches
Applications
Microservices
Determine your steady state
Important before you migrate!
During migration
Incorporate Cloud’s internal monitoring
…provides cloud specific infrastructure monitoring
…AWS CloudWatch
Continue application monitoring
*Here, looking for performance deviations from steady state
Track down & explain all deviations before moving on
Understand all deviations from norm
Solve problematic deviations/problems
Deviation in performance before and after migration give us a clue to migration related issues
Continue monitoring post migration
Should understand: The infrastructure is now out of your control…you need to keep an eye on it
Cloud infrastructure changes can impact your application…you need to keep an eye on it
There are some cloud specific concerns:
EC2 instance failures
Greater part of your availability plans
Often impacts other AWS systems as well
Instance degradation (more common than you’d think)
Ongoing application & infrastructure monitoring is essential
APM, Insights, Browser, Synthetics
So, that’s the third point in keeping your application running at scale…successful cloud migration.
{c}Together, these three points can keep your application highly available and running at scale.
{c}And underlying all three is monitoring your application and your infrastructure.
It used to be, long ago, that all it took to make sure an application was running was to look at the server. Did the amount of CPU or memory utilization change recently? If it did, there might be a problem. Everything was static, everything was smooth. Everything was constant. A change indicated a problem.
But in this new world, resources are created and destroyed dynamically. The world of the operations team can no longer be as simple as tracking resources on a spreadsheet. The resources they are responsible for are dynamic and transient. Their world has gotten a lot more complicated.
In order to monitor your dynamic applications in the dynamic cloud, you must monitor all aspects of your application, top to bottom, using a full stack monitoring solution, a solution such as New Relic.
Dynamic applications require dynamic scaling and use of dynamic technologies.
(how many streams during each day?)
Our customers won’t stand by waiting for us to solve availability problems.
And panic is not the solution. Nor is blame.
The dynamic cloud has caused significant change to our world. Our world has sped up, and the rate of change in application development has increased. The cloud alone has speed things up, and the dynamic cloud has sped things up even more. The way you’ve done things in the past just won’t work in the future.
This is good…but it is also scary.
In order to monitor your dynamic applications in the dynamic cloud, you must monitor all aspects of your application, top to bottom, using a full stack monitoring solution, a solution such as New Relic.