This document discusses the value of DevOps and monitoring tools in improving collaboration between development and operations teams and justifying investments in automation. It notes that traditionally, dev teams focused on features while ops focused on incidents, but both were measured by vague business metrics like revenue and uptime. New tools can help baseline current problems, measure progress over time, and demonstrate business impact to obtain support for further investments. The document advocates for monitoring the full customer experience rather than individual system components.
7. Typical Dev Day
1. Look at the overnight integration tests
2. Buy chocolates for the team if you broke the build
3. Scramble to fix the build
4. Pick the top priority item from your backlog
5. Start coding
6. Get dragged into troubleshooting prod. incidents
7. Hastily check in new code in as you ran out of time
13. Typical Ops Day
1. Open 30 new tickets
2. Make 200 phone calls
3. Attend executive P1 status update meeting
4. Argue about what a P1 and P2 really is
5. Reprioritise P2 tickets to P1
6. Reprioritise P3 tickets to P2
7. Close tickets as ‘Cannot reproduce’ or ‘Duplicate’
21. 2am Friday - #FFS
We have had an
alert that the load on
one of your staging
servers is critical.
22. How much time do false
alarms waste?
Role Hours Per Week Cost Per Week Cost Per Year
Ops 20 £400 £20,800
L2 10 £200 £10,400
L3 15 £300 £15,600
Hosting 6 £120 £6240
Network 6 £120 £6240
CMS 10 £200 £10,400
Total 55 £1,340 £69,680
Conservative estimates assuming £20/hour
24. Typical Day
1. Open 30 new tickets
2. Make 300 phone calls
3. Attend executive P1 status update meeting
4. Argue about what a P1 and P2 really is
5. Reprioritize P2 tickets to P1
6. Reprioritize P3 tickets to P2
7. Close tickets as ‘Cannot reproduce’ or ‘Duplicate’
1. Look at the overnight integration tests
2. Buy chocolates for the team if you broke the build
3. Scramble to fix the build
4. Pick the top priority item from your backlog
5. Start coding
6. Get dragged into troubleshooting prod. incidents
7. Hastily check in new code in as you ran out of time
25. Things that would help
1. Automation
2. Collaboration
3. Better Tooling
4. Business Metrics
26. Things that could justify
them
1. Baseline the starting point
2. Measure progress
3. Calculate Business Impact
4. Promote success not problems
5. Demonstrate value
44. Traditional monitoring approach is limited
END USER EXPERIENCE
BUSINESS TRANSACTION
APPLICATION
Server
OS
DB
MQ
Web
JVM
EXPANDED
APPROACH
Business transaction
EXISTING
APPROACH
Silo’d domain visibility
99.9% 99.9% 99.9% 99.9%
45. How many of you
use performance
management tools?
46.
47. Identify early
!
Troubleshoot fast
!
Resolve quickly
!
Quantify impact
x
59. What data could we collect?
Attribute Person 1 Person 2
Heart Rate 150 150
Blood Pressure 180/90 180/90
Eye Color Blue Brown
Blood Type O+ O-White
Blood Cell Count 3.5 3.8
Hair Color Brown Blue
Height 180cm 175cm
Shoe size 11 10
Weight 180kg 94kg
Current activity sitting skating
60. IS PERSON 2 PERFORMING WELL?
Time
Distance
10,000 metres!
Record time: 12min 58sec
12min 44sec!
baseline
67. Understand the impact of slow performance
10.1 s
* Screenshot from US e-Commerce AppDynamics Customer
Application
Revenue
Application
Response time
Application
Errors
$64,499 per min
$11,987 per min
100 ms
68. Understand the benefit of an application release
Application
Revenue
Application
Response time
code
release 1
code
release 2
code
release 3
$44,499 per min
$58,237 per min
1.9 s
3.1 sec