- The webinar will last 60 minutes with Q&A at the end. Questions should be asked via the chat panel and participants should keep their lines muted. The webinar will be recorded.
- John Gray from Datadog, Thomas Robinson from AWS, and Patrick Hannah from CloudHesive will present on monitoring tools and strategies across cloud infrastructure and the AWS Managed Service Provider program.
- Next-generation managed service providers need comprehensive monitoring across customers' infrastructure to quickly resolve issues, improve efficiency, and provide value. Tools like Datadog allow for unified monitoring across platforms and environments.
2. • The duration of this webinar is 60 minutes
• Q+A will take place at the end
• Ask questions via the ”chat” panel
• Please keep yourself on mute
• This webinar will be recorded
Webinar Logistics
4. Built for modern infrastructure
Your infrastructure has changed, you need a different way to manage your stack
#MonitorAllTheThings
You need a single pane of glass for Operations and Development Teams
Made for org-wide adoption
You need monitoring to be easy, flexible, scalable - so that the entire department will use it
Why Datadog
5. Cloud-ready
Your infrastructure will change, you will need a different way to manage your new stack
Bridge dev and ops
You’ve always wanted a single pain of glass for Ops + Dev Teams, with the cloud, you’ll need it
Streamlining dev and deployment cycles
You need monitoring to be easy, flexible, scalable – so that the entire department will use it
Why Datadog
6. Infrastructure-wide visibility
Your customers’ servers, Your customers’ clouds, Your customers’ metrics, Your customers’ apps, Your Team. Together in one place.
Create custom KPIs
and composite metrics
Compare and correlate
metrics from multiple IT
components
Track events from
the systems in your
environment
7. 1. Quickly resolve your customers’ critical
issues and meet SLAs
2. Serve more customers efficiently with
monitoring automation
3. Start providing value to your customers in
minutes
Why Datadog for your managed services business
Provide deep insight into your customers’ next generation cloud-based infrastructures
1. Technical and sales onboarding training and
resources
2. Co-marketing activities including demand
generation, content creation, email
templates
3. Dedicated Partner Success Team focused
on partner success and grow
What we provide Program benefits
8. AWS MANAGED SERVICE PROVIDER (MSP)
PARTNER PROGRAM
THOMAS ROBINSON
SOLUTION ARCHITECT, MSP PROGRAM TECHNICAL LEAD
9. AWS PARTNER NETWORK
THE APN HAS ADDED 10,000+ OVER THE PAST 12
MONTHS
100%Y o Y
AWS Consulting
Partners
130%Y o Y
AWS Managed
Service Partners
O n A W S
M a r k e t p l a c e
G r o w t h U s e A P N p a r t n e r
s o l u t i o n s & s e r v i c e s
90%+
Fortune 100
60%
Partners
Headquartered
Outside U.S.
370M
EC2 Hours Per
Month
10. C O N S U L T I N G
P A R T N E R S
AWS PARTNER NETWORK
T E C H N O L O G Y
P A R T N E R S
P r o f e s s i o n a l s e r v i c e s f i r m s t h a t
h e l p c u s t o m e r s o f a l l s i z e s d e s i g n ,
a r c h i t e c t , m i g r a t e , o r b u i l d n e w
a p p l i c a t i o n s o n AW S
C o m m e r c i a l s o f t w a r e a n d I n t e r n e t
s e r v i c e s c o m p a n i e s t h a t p r o v i d e
s o f t w a r e s o l u t i o n s t h a t a r e e i t h e r
h o s t e d o n , o r i n t e g r a t e d w i t h A W S
12. A SHIFT HAS OCCURRED
New Approaches
New Ways to
Add Value
Customer
Engagement
DevOps &
Automation
Dynamic &
Agile
13. WHAT IS A NEXT GEN MSP?
P l a n &
D e s i g n
B u i l d &
M i g r a t e
R u n &
O p e r a t e
O p t i m i z e
“I need help migrating, running, & optimizing my AWS workloads.”
14. A W S M A N A G E D S E R V I C E
P R O V I D E R P R O G R A M
T h e A W S M S P p r o g r a m p r o v i d e s q u a l i f i e d A P N
C o n s u l t i n g P a r t n e r s w h o a r e s k i l l e d a t c l o u d
i n f r a s t r u c t u r e a n d a p p l i c a t i o n m i g r a t i o n , a n d
d e l i v e r v a l u e t o c u s t o m e r s b y o f f e r i n g
p r o a c t i v e m o n i t o r i n g , a u t o m a t i o n , a n d
m a n a g e m e n t o f t h e i r c u s t o m e r ’ s e n v i r o n m e n t
w i t h b u s i n e s s , m a r k e t i n g a n d e n a b l e m e n t
b e n e f i t s .
AWS MSP PROGRAM
15. W H Y
B E C O M E A N
A W S M S P
P A R T N E R ?
AWS MSP PROGRAM
• G a i n a c c e s s t o a w i d e r a n g e o f M S P -
s p e c i f i c b u s i n e s s , t e c h n i c a l , a n d
m a r k e t i n g b e n e f i t s
• P o s i t i o n y o u r f i r m a s a n e x t g e n M S P
a n d b e p r o m o t e d a s a n AW S M a n a g e d
S e r v i c e P a r t n e r o n t h e AW S w e b s i t e
• A c c e s s t o e x c l u s i v e M S P M a r k e t i n g
C a m p a i g n s a n d C o n t e n t
• C o n s u l t a t i v e 3 r d P a r t y Va l i d a t i o n A u d i t
• F a s t e r p a c e o f g r o w t h ( 1 3 0 % y e a r - o v e r -
y e a r c o m p a r e d t o 111 % f o r n o n - M S P
A P N C o n s u l t i n g P a r t n e r )
16. S e r v i c e D e s k &
C u s t o m e r S u p p o r t
S L A s & R e p o r t i n g
S e c u r i t y
M a n a g e m e n t
D e v O p s &
A u t o m a t i o n
B i l l i n g & C o s t
M a n a g e m e n t
P r o c e s s & C o s t
O p t i m i z a t i o n
B u s i n e s s H e a l t h &
M a n a g e m e n t
C u s t o m e r
O b s e s s i o n
S o l u t i o n D e s i g n
C a p a b i l i t i e s
I n f r a s t r u c t u r e & A p p l i c a t i o n
M i g r a t i o n C a p a b i l i t i e s
AWS MSP PROGRAM
17. WHAT IT MEANS TO BE A
NEXT-GENERATION MSP
PATRICK HANNAH
VP OF ENGINEERING, CLOUDHESIVE
18. • Who am I?
• What is my background?
• What do I hope to get out of this presentation?
• How am I using AWS?
• What do I love about AWS?
Who am I?
19. Professional Services
• Assessment (Current environment, datacenter or cloud
footprint)
• Strategy (Getting to the future state)
• Migration (Environment-to-cloud, Datacenter-to-cloud)
• Implementation (Point solutions)
• Support (Break/fix and ongoing enhancement)
DevOps Services
• Assessment
• Strategy
• Implementation (Point solutions)
• Management (Supporting infrastructure, solutions or
ongoing enhancement)
• Support (Break/fix and ongoing enhancement)
Who is CloudHesive
Managed Security Services (SecOps)
• Encryption as a Service (EaaS) – encryption at rest and in
flight
• End Point Security as a Service
• Threat Management
• SOC II Type 2 Validated
Next Generation Managed Services
• Leveraging our Professional, DevOps and Managed
Security Services
• Single payer billing
• Intelligent operations and automation
• AWS Audited
20. Problem Statement:
I need to be able to (monitor|get) information about my
“things”.
What’s important?
What are my things?
• Platforms
• Environments
• Systems
• Servers
• Services
• Applications
• Literal Things
What characteristics of my things do I care about?
• Is it up/down?
• Have I crossed some sort of arbitrary threshold?
• Is there an interesting event or lack thereof?
• Is there a certain quantity of either?
21. Difference sources of data:
• AWS, CloudWatch
• AWS, CloudTrail
• AWS, Config
• Linux proc
• Linux syslog
• Windows WMI
• Windows Event Log
• Application Logs
• Third party tool logs (APM, Security, etc.)
How does that translate on AWS?
Different methods of alerting:
• E-Mail
• SMS
• Voice
• Push
Different methods of collecting:
• Native APIs
• Agents
22. • No trending
• No single pane of glass
• Redundant work
• Lost data
What’s the outcome of this approach?
23. Problem Statement:
I need CONTEXT about the alerts I get from my “things”.
What’s really important?
Why?
Things can carry different SLAs, depending on:
• Type of environment
• Where it sits in the lifecycle
• What it does (mission critical, back office)
• Type of customer (industry)
• Does it heal itself? (autoscaling, recovery, etc.)
• Context
24. Datadog is central to our event
monitoring platform
How does CloudHesive solve it?
How does it work?
• Data from the sources described on previous slides +
more are sent to Datadog
• It performs the initial triage via a series of pre-configured
monitors
• Non Severity 1 go to a work queue (Jira)
• Severity 1 go to an escalation queue (OpsGenie)
• All events persisted to long term storage (SumoLogic)
25. With outlier detection we are able to
find underperforming members of
clusters, autoscaled groups, etc and
act appropriately.
Outlier Detection!
26. We covered real time but what about looking backwards?
• Root cause analysis (eg. on this date/time the application
underperformed – why?)
• Change planning (eg. expecting a 10x increase in traffic,
will our autoscaling strategy work?)
What else can we do?
27. Problem Statement:
Now that I know what I want to monitor, how do I select the
right tools?
Integrations and the AWS Ecosystem
Implemented by default
• AWS Integration/Agent Installation/Agent Configuration
Integrates
• Over 100 integrations
What does it do best?
• Time series data, Key/Value pairs
Scales (Operationally and Technically)
• Ever run your own monitoring platform? The last thing you
want is your platform to be impacted by the same event
impacting your monitored infrastructure
28. • Insight across customers
• New customers get a default suite of
integrations and monitors
• Support customer DevOps initiatives
• Stronger Next Generation MSP
• Security? Security!
What powers do we gain?
29. Next Steps
Questions about monitoring or
the Datadog Partner Program?
John Gray
partners@datadoghq.com
Questions around the AWS
MSP Partner program?
Thomas Robinson
Aws-msp@amazon.com
Questions around being a
Datadog partner?
Patrick Hannah
Patrick.hannah@cloudhesive.com
Who are you?
Patrick Hannah, CloudHesive (where I’m a co-founder and the VP of Engineering)
What’s your background?
Architecture, Security, DevOps on AWS for 6 years, prior to that Contact Center Architecture and Operations for over 8 years.
What do you hope to get out of the presentation?
I want to help folks get as the same out of AWS as I have.
I’d also like to see how others are using AWS – as with just about any thing in technology there are multiple ways to do something right (or wrong).
How are you using cloud services?
Every aspect of my life From Alexa powered Echos to my day job.
Why did you pick the cloud services that you are using?
AWS is at the forefront of Cloud; their service catalog can support most traditional on-premise software use cases (infrastructure) but they also offer more abstracted services for software built on the cloud that negate the need to manage server infrastructure – on premise or on cloud.
CloudHesive offers end-to-end solutions to migrate and securely operate our customers’ mission critical applications on the Public Cloud.
We were founded in 2014 with the purpose of enabling our customers’ use of the Public Cloud, specifically AWS.
Our offerings span four distinct categories: Professional Services, DevOps, Managed Security Services and Next Generation Managed Services.
What’s important?
That’s a somewhat vague question.
In this case, I’m referring to monitoring.
Something for which you will get varying answer from depending on who you ask.
To me, monitoring solves the general challenge faced by developers, operations, business, etc. around the need for visibility into the full stack of their infrastructure.
This spans a number of different components and can be performed in a number of different ways, and is often encountered with strong opinion.
The visibility provided by AWS into the infrastructure and the instrumentation provided by development platform specific libraries exponentially grows the data points generated by and associated with the application.
Coupled with the various streams of alerting (noise), you may find your team spending more time managing alerts than doing their jobs.
Traditional approaches to monitoring fail to address the challenge of correlating data across multiple services.You lose the ability to trend and you need to review multiple systems to come to a conclusion, resulting in redundant work and the loss of important monitoring data.
So monitoring is important, but what’s really important?
More than getting data thrown at you, you need context to understand the importance of that data and action to take.
Outage of a development environment during non-development hours is not a Sev1
Crossing a CPU threshold in an auto scaled environment may not be important.
Am I looking at data from a real time perspective? Or Historical?
DataDog is the tool used to collect initial events from our various systems.
CloudHesive has been using DataDog for over two years to collect, sort through, process and categorize the data received from these systems and make decisions on what action to take.
Coupled with a rich set of integrations (like the ones listed on this slide + more) it is an excellent platform for Next Generation MSPs to leverage to solve their need to corral the ever growing sets of operational data.
More challenges exist to solve, specifically around dealing with fixed thresholds, fixed counts and alerting based on pattern (or lack of pattern) matching.
So what else can we do?
Leveraging the outlier detection capabilities in DataDog, we have the ability to look at how pools of resources are behaving and identify underperforming (or simply not working resources).
In the past, this has provided us insight into poorly performing hardware (think first generation instances, overprovisioned instances and bare metal issues (neither on AWS).
It’s also helped us identify application issues around garbage collection, configuration, etc.
The focus of the presentation to this point has been about the real time collection, processing and alerting of data in DataDog.
Just as important, though, we need to be able to look back to identify events that may have been overlooked, perform root cause analysis or perform capacity planning.
As mentioned before, monitoring tools have been around for some time now and the AWS Ecosystem is filled with them.
How do I pick the right one for my use case? When we selected DataDog, we reviewed about 8 monitoring platforms, from open source to commercial to SaaS.
We ultimately decided it is not our core business to run a monitoring platform (for which I did in a previous life and still have nightmares about).
My initial philosophy on monitoring was agentless, but after running into numerous SNMP bugs, warmed up to the idea (pre DataDog).
With that said, we narrowed down the list and ultimately selected DataDog for it’s ease of implementation (pretty much there by default) and it’s strong suite of integrations.
DataDog recently introduced an APM and is capable of handling events (such as Windows Event Log), which makes it seem like a single tool to do all jobs.
In our case, we went with a strategy where DataDog was the prime collector, processor and forwarder for threshold based events (time series data, etc.) and went through similar processes finding Log management, APM and Escalation platforms.
We were pleasantly surprised to see how well these integrated, specifically New Relic, SumoLogic, OpsGenie and Slack.
Even better, if an integration didn’t’ exist, we wrote it. As a matter of fact DataDog is the key engine in some of the autoscaling solutions we have implemented and we have gone so far as to recommend it’s use in IoT devices.
To conclude my presentation, in the final slide I will talk about the powers we gained from implementing DataDog.
We have unparalleled insight across our customers. This let’s us identify platform wide outages as well as make recommendations to customers on which configurations work best for their particular use case.
New customers get the collective insight we have by way of a default suite of integrations and monitors.
It also allows us to support our customer’s DevOps initiatives in a number of ways, but cannot stress how well it works with AutoScaling.
All-in-all it makes us a stronger Next Generation MSP, and we continue to improve our operations with it, and visa versa.
Last but not least, while we focused on operational monitoring, DataDog is not only designed with security in mind, but helps support our managed security services as well.
Operational metrics are good early indicators of Security Breaches, and we have successfully identified issues in the past based on them (not to mention security tools can pass their data into DataDog and likewise).