The document discusses building a business case for middleware monitoring by estimating financial return on investment. It provides an example of how to calculate costs of incidents from outage data and use it to show a monitoring solution would pay for itself within 5.6 months. It also shares lessons from implementing monitoring at Citibank, including starting with a small proof of concept, quantifying results, and getting support from stakeholders.
2. Obtaining funding can be challenging
Isn’t monitoring free with TIBCO
Admin tools?
Don’t we already have a
standard?
I’ve never built a business case
before.
It’s not in the budget
this year.
3. The Benefits of Monitoring
Fewer high priority incidents
Faster MTTR
Fewer people on conference calls
Better use of IT resources
+ Benefits to the Brand
7. Severity 1
Annual Incident Volume
MTTR (hours)
Severity 2
Annual Incident Volume
MTTR (hours)
Discovery: Gathering Performance Data
*Usually available from your trouble ticket system admin
8. Average hourly cost of outages
Loss of revenue
SLA penalties
Discovery: Estimating the Cost Of an Outage
*Usually provided by the owner of the business service
$
18. Visibility in context
Single version of truth
Rapid Isolation & Remediation
Root Cause vs Symptoms
Interruption Avoidance
Expert Systems
What we Needed
19. Start with Proof of Concept
Select low hanging fruit
Make it small but meaningful
Quantify results & map next
steps
Involve and socialize it with
your main “detractors”
Don’t just be a “fool with a new
tool”. Show progress.
Sell the Idea
20. Example of a POC process
Lets monitor the Login
Process for the Online
Banking.
21. Citi Demo of RTView Enterprise Monitor
Let’s see the results and what is possible…
22. Once the POC is done…
Define end-to-end monitoring requirements
Estimate effort to build dashboards
Build dashboards into PRD requirements
Define requirements for both business and technology
Establish an agile cadence for deployment of updates
NEXT
STEPS
23. Increasing your chances of success
Tie to a measureable strategic objective
Be conservative in ROI inputs
Manage the politics
If possible, start small
Payback of less than one year
24. Let’s take
some
questions
• Submit your questions in Q&A
panel
• We will send you a link to the
recording after this event
Ask your questions here…
25. www.SL.com
End to end monitoring and analytics for middleware-powered applications
RTView Enterprise Monitor ®
Notes de l'éditeur
Thank you David and thank you to everyone for your time today.
When we worked with the Citi team last year, we had a good opportunity to work on building a business case in parallel with the technical evaluation of our product RTView Enterprise Monitor. I’m going to walk through how we have seen successful business cases pulled together and then turn it over to Alejandro who will share some of his experiences around business case development and then provide everyone with a demo of the monitoring system.
“How many of you have noticed that it’s getting harder and harder to get IT purchases approved lately? [pause for effect - or ask for a show of hands for a little humor] - Well, you’re not alone. And it’s not just around monitoring purchases. Today, we are specifically talking about monitoring but a lot of what you’ll learn today is applicable to many IT purchases…
Obtaining funding for monitoring is often challenging. We often hear comments such as:
But the right consolidated monitoring tool will provide a lot of benefits:
Support teams need tools designed specifically for them, not a more general and standardized operations-focused systems monitoring or APM tool.
These first two categories are where we usually focus our business case justification: reducing Severity 1 and Severity 2 incidents and faster Mean-time-to-repair. The rest of them tend to be “soft benefits”. While they represent real value, they are tougher to quantify. So we usually leave them out of the financial analysis.
We all know that war room conference calls can be a huge time sync. We have one large bank who told us they sometimes have up to 30 people on these conference calls and they last for hours. One bank we know of told us about a conference call that lasted more than 25 hours last year. If monitoring reduces the number of these calls, since you are not really saving the payroll costs, it is a big productivity gain but not a quantifiable cost savings.
Better use of IT resources is a big one as are Benefits to the brand. What’s the value to the brand and how can you quantify it?
Well, my daughter and a family friend were caught up . . .
In the big airline outage that happened over the weekend, they got stranded in Oakland when the Southwest.com website and a lot of supporting applications such as airport scanners were down. And it went on for days. Our friend eventually had to drive down to southern California and didn’t get in until 3 in the morning. Hard to quantify the cost to Southwest for the brand cost for this event across the customer base but brand probably took a hit.
So there are a lot of benefits to monitoring. Yet middleware teams still struggle to get IT purchases approved. Why is that?
A recent study published in the Harvard Business Review explains that one of the trends in IT purchasing is that decisions are made less by a single decision maker and increasingly by teams. The average technology purchase decision is now made by a team of 5 or 6 people. The more people are involved, the more difficult it is to get the purchase approved and only 31% of IT purchases are actually approved.
these 5-6 people on the approval team aren’t necessarily in IT and they don’t necessarily "feel your pain” in the same way that you do - like spending hours in a war room trying to find the source of a problem and digging through log files. These people, CIOs, CTOs, purchasing, legal, and LOB need a strong business case to see the value of what you are proposing.
So you better have a strong business case!
A number of things go into building an effective business case that we’ll cover in this session. I’m going to start with the financial model.
The first thing you need to do is some homework and data gathering so you can build the model.
It’s critical to involve the business users at this stage get them involved and bought in.
You need to quantify your incident volume and MTTR or Mean time to resolution for both high priority Sev 1 and Sev 2 incidents. Generally a Sev 1 is defined as a mission critical application or service is down. Sev 2 is defined as an application or service that is impaired.
You might be able to access this data directly from your trouble ticketing system such as ServiceNow or Remedy, so speak with your trouble ticket system admin.
However you get it, you want to gain consensus agreement with the business on these numbers.
The other piece is getting an estimate of the cost of an outage. This is usually the loss of revenue for revenue-generating services. But it can also be other quantifiable costs such as Service Level Agreement violations.
This information is usually provided by the owner of the business service
You want to use very conservative numbers and you want to gain consensus agreement with the business on these numbers.
There are a number of published studies on this which can also be helpful – we have listed some out in one of the spreadsheet worksheets. Ranges generally fall between $100,000 - $300,000 per hour.
And a study by IDC in 2014 reports average cost of a critical application failure for Fortune 1000 companies is $500,000 to $1 million per hour
But it can be much, much higher. Gartner/Dataquest published a study documenting up to $2.5 M for credit card authorization services and $6.5M for brokerage operations.
At its simplest, we’re looking to quantify lost revenue due to system outages and impairment
We’re using an excel spreadsheet developed over the past year and a half working with several customers and are happy to make this available to our clients and to provide any assistance to the process.
Now for this quick walkthrough we are using generic input data. This is not Citi’s actual incident and cost information.
The spreadsheet contains tabbed worksheets for entering your total costs, the incident information you have gathered, benefits and summary tabs.
The output of all this will be a ROI calculation and payback period based on your input values within the linked spreadsheet.
For Costs, we want to look at total costs so we include not only licensing and annual software support but also costs for any consulting, training, hardware, and labor as well. We want these costs to be as accurate and conservative as possible.
Here is where you enter your incident statistics and hourly outage costs. Different business cases might use different or additional inputs (such as Service Level Agreement penalties added to the hourly costs) but we find these to be the most common and relevant.
There are two areas of benefits identified, one for tangible and one for intangible. Since our business case is based just on tangible, we focus here. The worksheet references the costs and incident inputs you added earlier and after adding your assumptions, you get your costs savings below.
For outage avoidance, a typical range of improvement runs 25% - 50% but we are taking a very conservative estimate of 20%.
For decreased mean time to resolution (MTTR) we are taking a similar approach and starting with 20%
Explain some of the numbers.
Show that with 10 incidents predicted for the first year with no change in monitoring that 20% reduction equals 2 incidents avoided x incident duration x hourly cost of outage = this savings (circle $273,000)
circle the total of $1.8M
We do this again for Sev 2 incidents.
Adding up both sev 1 and sev 2 over the 3 year period, we total up a total for expected annual benefits of over $2M
circle the total of annual benefits
Back to the summary section now we can see our inputs roll up to a solid 5.6 month payback period.
Payback period seems to generally be the most accepted metric and the model also supports other summary metrics including Net Present Value and Project Internal Rate of Return which is also a good objective approach.
The spreadsheet approach is also good for doing what-if scenarios. If your finance guy challenges a number, great. You can change it and now you’ve got your finance guy on board.
At this point I would like to introduce Alejandro Ayestaran who will explain a little more about the project.
We started working with Alejandro and his team approximately one year ago. He has been a great partner and has worked closely with us on product feedback and some enhancements. We appreciate him agreeing to share his experience at the NOW conference and here today.
Alejandro is Regional Director at Citibank North America is currently the head of the Latin America Systems Integration development and shared services area, reporting to the CIO of Latam. He is currently leading the implementation strategy for a 2-speed architecture transformation in Mexico. This includes a complete modernization of the technology stack along with OmniChannel, Microservices, lightweight API-based development, cloud-based integration with continuous delivery.
He has more than 15 years of experience working with SOA, EAI, BPM and Integration technologies such as TIBCO, MQ Series, and UML Modeling Tools.
Alejandro !
Lack of visibility into TIBCO middleware and infrastructure (current and historical metrics) – At the right level.
Inability to share information with other teams on TIBCO integration and workflow processes
Requirements to reduce MTTR (reduce cost and increase customer satisfaction) – Incidents take too long to be resolved.
No end-to-end root-cause analysis tools – Distributed accountability and diverse support teams
Reactive and not proactive – Business is the one telling us that “things are broken”
Current Dashboards designed to be high-level only and not intended to display details of activities and services needed for troubleshooting
Large dependency on Development/L3 Support teams to actively solve most of the identified issues. Creating a SPOFs on individuals or small teams.
Large number of deployments
Complicated responsibilities
Long times to pull info from a distributed environment
What do we need?
Increased visibility – Real-time alerts and dashboards, showing what is happening now, in the context of business transactions and not necessarily IT components.
Different Support Teams see the same information at the same time, provided in a consistent way so that no time is wasted discussing what is the “truth” in order to work effectively together.
Reduce MTTR – Rapid isolation of error conditions allowing for fast and accurate remediation.
Root Cause Analysis – RTView/EM helps separate the cause from the symptom which enables a reduction in repeated failures.
Interruption avoidance - Real time monitoring and alerting allow Citi to avoid error conditions that may cause service interruption – resulting in improved customer experience and reduced cost of ownership.
Training your experts– Dashboards will have business context of what en end-to-end transaction flow looks like and all the layers traversed. That minimize dependencies on L3 & Dev teams.
Large number of deployments
Complicated responsibilities
Long times to pull info from a distributed environment
Okay, I get it. How do I sell the idea?
Lack of effective monitoring it’s a problem, but nobody wants to really invest on fixing it. Simply because its everywhere but in reality touches so many teams that it has no real owner.
The users need to see it to believe it. An endeavor like this requires a very graphical and compelling Proof of Concept.
Select a low hanging fruit. What is the major headache for my business line? Where are we leaving money on the table? What is killing our support team?
It needs to be small but also meaningful. End to end sample but complete and adding real value.
Quantify the results and calculate, “next steps & efforts” as part of the ask.
Involve and socialize it with your main “detractors”, include their feedback and how to keep them “happy”
Don’t be just a “fool with a new tool”. Show progress and connect technology with real an tangible business value.
Once its done, how to move forward?
Once the POC portion its done, you need to start having estimates at several levels/areas:
Complete set of components from EM that you will use. BW, EMS, Files, MQ, others?
Estimate level of effort for building completely separately from Software & Infrastructure. In fact, a factory model would be needed to build and update dashboards fast enough.
Align monitoring dashboards with functional releases for business applications. This means, for anything new: “you are not going to PRD unless the corresponding monitoring dashboards are also built”. A sizable catchup would be needed for existing processes.
Add functions that allow to breach the gaps between business and technology. Answer simple questions like “how many transactions in the last 2 hours?” “What is the current TPS?” “What transaction is taking longer?”
Establish an agile cadence for deployment of updates if you need to support “digital speed”. Once a week deployment with changes as the very minimum.
Welcome and thanks for your time.
Introductions
One hour