This document discusses how organizations can use big data and operational analytics to transform IT operations. It outlines how taking a data-driven approach that combines machine data and wire data can provide real-time visibility across networks, applications, databases and other systems. This approach overcomes limitations of using individual monitoring tools by silo. The document also covers key considerations for implementing IT big data solutions such as data gravity, improving the signal-to-noise ratio, and understanding when data needs to be accessed in real-time. It provides an example of how healthcare company McKesson used network traffic analysis to improve Citrix application performance and reduce IT costs.
3. Agenda
• The next-generation IT Big Data approach
• Moving toward real-time observational
data
• Key considerations for IT Big Data
• IT Big Data use cases
• Q&A
4. A Tool-Centric Approach = IT Silos
Network
Administrators
Virtualization
Team
Database
Administrators
VDI
Administrators
Application
Owners
Business
Analysts
Storage
Administrators
Security
Operations
5. A Tool-Centric Approach = IT Silos
Network
Administrators
Virtualization
Team
Database
Administrators
VDI
Administrators
Application
Owners
Business
Analysts
Storage
Administrators
Security
Operations
Big Data
for IT
7. Tapping New Sources of Visibility
Driven by
Big Data
Technology
Machine Data
Wire Data
8. Wire Data
• All communication
on the network from
packets to payload
• 1000 x bigger
than machine data
• Definitive
source of truth
• Data you
already have
9. Wire Data: Real-Time Observational
Analysis
A small sample of what wire data contains…
All L2-L7
communication
on the network
From Unstructured Packets
To Structured Wire Data
Extracting real-time
insight from all
communication and
data streams
Business Data
Product ID
Customer ID
Shopping Cart ID
Cart Items
Cart Values
Discounts
Order ID
Abandoned?
Application Data
POST Content
AJAX Data
Section
Sub-Section
Page Title
Session Cookie
Proxied IP Address
Error Message
Availability Data
HTTP status codes
Application errors
Connection resets
Heartbeats
SSL certificate validity
Synthetic pingers
SNMP traps
Authentication errors
Capacity Data
Throughput
Transactions
Dropped packets
Application stalls
Application slowdowns
Geolocation/
IP mapping
Storage Access
(reads/writes)
SSL Offload
Security Data
Command and Control
Shadow IT
(SaaS, cloud)
Network traversal
Unauthorized outbound
connections & protocols
Storage/DB access
Blacklisted traffic
Brute force attacks
Surreptitious tunneling
Performance Metrics
Caching Behavior
Compression Behavior
Base HTML Load Time
Round Trip Time
Client Request Time
Server Reply Time
Server Send Time
Total Time Taken
10. Self Reporting + Observation =
Insight
• Self-reported data
(machine data)
– “What are your symptoms?”
– “When did this start?”
– “Does this hurt?”
• Observational data
(wire data)
– MRI
– Blood tests
– Heart rate, pupil dilation,
appearance, etc.
11. IT Operations Analytics Survey
ExtraHop and TechValidate partnered to survey 88 respondents from
65 organizations that use the ExtraHop platform.
• 65% of respondents are combining data sources for ITOA now, or plan to do so
within one year
• 54% of respondents are currently integrating wire data and machine data in
some manner
• 67% of respondents saw ITOA capabilities as important for IT security
12. Key Considerations for IT Big Data
Moving data around can
be expensive
Data Gravity
Pull out more of the
signal, filter out more of
the noise
Signal-to-Noise
Understand when real-
time access to data is
important
Motion of Data
14. Signal-to-Noise Ratio
Signal
• Garbage in; garbage out
• Examples of data
sources with poor quality
– Threat detection
– Verbose logging
• Time is required to
separate signal from
noise
15. Motion of Data
Data at Rest
(Batch processing)
Example: MapReduce
in Hadoop
Data in Motion
(Stream processing)
Example: Apache Spark,
ExtraHop
DB
DB
DB
Data
mart
user
report
query
source
source
source
Batch 1Batch 2
user
16. SOLUTION
CHALLENGE
McKesson Managed Services
BACKGROUND
“ ExtraHop enables us to solve
incredibly complex problems in a
matter of hours. Extrapolated
across our business, we’re saving
at least $400,000 annually in
terms of time spent
troubleshooting.”
─ Scott Checkoway,
Director of Application Hosting
• Citrix application launch times dropped 75% (40 to 12 sec)
• Staff optimization: from 2.6 to 1 engineer for every 4
hospitals - $260,000 savings in first year
• Reduced MSFT SQL licenses - $200,000 savings annually
• Understand the impact of application updates
• Complex: Hospitals’ and McKesson’s IT environments
• Equip IT generalists; lessen reliance on specialists
• High coordination costs, slow troubleshooting processes
• Operational costs increased while user satisfaction
decreased
• Hosted healthcare applications for hospitals
• 7x24x365 mission critical operations
• Rapidly growing customer base
• Stringent and costly performance-based SLAs
17. Citrix Environments Are Complex!
• Is there latency between the user and web server?
• Slow Active Directory server?
• Network issues in the Citrix cluster?
• Contention in the SAN?
18. See across
Citrix, web,
database,
storage, LDAP,
DNS, etc.
Visibility on the Wire
Correlate activity across all tiers with wire data
Monitor SLAs in
real time.
Drill into critical
KPIs (launch,
load times, etc.)
per user.
19. Visibility Into Citrix Application
Delivery
McKesson improved Citrix
application launch times by
75% with ExtraHop.
McKesson avoided more than
$260,000 in staffing costs in
its first year with ExtraHop.
20. Understand the Impact of Application
Updates
• Improved user
experience
• Fewer surprises for IT
Ops
• Faster feedback for
app teams
BENEFIT
Drill down to see
how SQL queries
are performing.
Compare performance
across versions and
across time periods.
21. Identify Active/Inactive Databases
Saved $200,000
annually in reduced
database license costs.
BENEFIT
See all database
transactions.
Show all activity by
every database and
degree of usage.
22. Operations Analytics: Real-Time Patient
Tracking Observe admittance, discharge, and
transfers (ADTs) in real time.
Who and how many are being
admitted right now? Do we need to
adjust staff?
Track admissions by location and
gender.
Why are so many males being
admitted in Kent? Is it an epidemic?
• Optimize processes and staffing
for improved patient quality.
• Identify potential epidemics.
BENEFIT
This slide explains why troubleshooting the Citrix delivery of the applications was so difficult. Even Citrix consultants could not solve the problem.
In our Remote Hosting environment we are delivering McKesson application to a number of customers.
We have 3500 VM’s running on a large number of physical machines.
The Citrix environment alone is almost 1/3 of the infrastructure (XenApp servers, ZDC, Web Interface, Netscalers) all running on a common network and a few large storage arrays
Each tier of the infrastructure (App, Network, and Storage) had their own tools and reporting to assist with issue resolution so when a problem occurred you were left asking questions like:
Is there latency between the user and web server?
Slow Active Directory server?
Network issues in the Citrix cluster?
Contention in the SAN?
Although Doug’s team was very skeptical, ExtraHop proved its value by uncovering the root cause of the slowness.
Customers were very unhappy with the performance of the Citrix application delivery – suffering from 40 second application launches.
Internally we spent 2 months working on the network, Citrix server infrastructure, and storage tiers.
Mostly we were looking at logs from all of the systems and individual system/vendor tools
Using ExtraHop in a small POC we were able to detect the root cause (a bad AD design) and in 1 week and improved customer experience.
The ability to see the full Citrix environment in a single tool allowed us to see the potential bottleneck within hours of the initial install and configuration.
It allowed for collection and correlation of multiple data streams in real-time that an individual system log did not reveal the root cause.
As a result of operationalizing the ExtraHop platform, the McKesson Managed Services team was able to realize some significant cost savings.
A significant customer experience improvement and customer satisfaction scores improved in the months following the resolution
Cost reduction in number of FTE’s required to support the Citrix infrastructure since the number of customer problem tickets decreased.
One of the problems we faced in the hosting of McKesson applications was that not very customer was running the same release of the applications. We are not a true SaaS environment so variability in the applications and in a few cases the versions of the underlying infrastructure do exist.
This variability leads to some problems in supporting the application mix especially performance related problems.
We used Extrahop to track the response time for any database request and the number of times the DB request ran
This allowed us to generate a list of our “top 10” long running queries by application version.
This information was then sent to our development team to have them understand a few items:
Total system wait – if a DB request runs 10,000 times a day and you were able to improve it by 5 seconds – you improve the total system wait time 13.3 hours a day
Conversely if you focus the effort on a DB request that executes 10 times a day but improve performance by 30 seconds you only improve the total system wait time by 5 minutes
It allowed McKesson to change where we wanted development to prioritize their limited time for performance optimization.
Once we had the long running database request categorized by application we could then start a cross walk with the application development teams to determine if the same problems exist between application releases or if something new was introduced.
By doing this in our QA environment during load testing it allowed us to catch some major problems prior to release the upgrade to the early upgrade customers
As I stated earlier our infrastructure is several thousand VM;s and hundreds of physical servers.
As we were growing people would stand up systems for some testing or QA purpose and but never retire them officially.
Upgrade/migrations would occur and the source system would be left up just incase we needed to pull something over after go live….
This created a large number of orphan systems that we were paying to license. In the database world this was especially costly.
Management wanted to retire the systems but was fearful in shutting something down inadvertently
We used ExtraHop to monitor the data flow on the network in and out of the systems for a set duration.
It allowed us to show management that the systems were truly not being used and could be scheduled for decommission.
The result was a 200K annual reduction in the database licensing costs in our infrastructure.
Looking to the future I see real promise in looking at wire data for real time Patient analytics
In the healthcare space systems communicate across the wire using a number of standards (HL7, X12..)
By pulling the data off the wire we gain a few big advantages.
It can be done in real time. I do not have to wait for the data to be received and posted into a database before I can start looking at it
I don’t add load to a system to generate a report or process a DB request. I can leave those systems to do their job of serving up end-user needs
I don’t need to a team of resources (one for every application)
Some of the potential uses would be a real time analysis of patient needs (increased ER admits) and shifting staff in the hospital to assist with a spike in activity.
Trend at diagnosis codes to identify potential disease patterns