The document discusses using machine learning for anomaly detection in Splunk. It describes how machine learning can automatically learn baseline behavior, overcome limitations of human analysis, and detect anomalous behavior. Specifically, it explains how machine learning builds probability distribution functions from historical data to determine what events are unexpected or outliers. This allows identifying anomalies without needing predefined thresholds.
7. Step 2) stats
command, sort by
count
Question: How to
show as a function
of time, not just
overall?
8. Step 3) add
bucketing for
breakdown by time
Question: What is
an anomalous
count per bucket?
100? 1000?
10,000? Maybe we should try to use some more stats?
9. Step 4) add some
“basic” statistical
analysis:
avg +/- 2
Question: How to
show the individual
“outliers” (and not
lose the concept of
time)?
10. Step 5) use
eventstats to repair
time problem and
add “where” clause
to only show those
outside of +/-2
Question: Are these
161 results accurate?
(I hope you didn’t build an
alert and get 161 of them!)
11. Problem: Statistical modeling is
INCORRECT for this data
– (-75) events doesn’t make
sense for avg - 2
– how much confidence do
you have in avg + 2 ?
Result:
• Wrong model= false
positives/negatives
12. The Problem: +/-2
assumes data is
Gaussian (Bell Curve)
Clearly, this data is
better fit by a
Poisson curve
13. Examples of Non-Gaussian Data
status=503
Memory Utilization
CPU load
status=404
Revenue Transactions
14. One More Problem…
• Even if the demonstrated technique was
accurate:
– Still need to persist what you’ve learned “so far”
so that you don’t have to keep re-inspecting
historical data as new data comes in
– This requires you to manually write/read
information into a summary index
15. Why Machine Learning?
• Overcome limitations of human analysis
• Auto-learn baseline behavior using proper
modeling
• Detect anomalous behavior
16. First, an Analogy
• How could I accurately predict how much
Postal-mail you are likely to get delivered to
your home tomorrow?
17. I Would…
• Watch your mail delivery for a while
– 1 day?
– 1 week?
– 1 month?
– 1 year?
• Use my observations to create a…
22. Using Machine Learning
to build a Probability Distribution Function
• PDF must be built specifically for each
“instance”
• PDF should be constructed automatically
merely by watching the data
25. Why Machine Learning?
• Overcome limitations of human analysis
• Auto-learn baseline behavior using proper
modeling
• Detect anomalous behavior
26. Finding “what’s unexpected”…
Your job is often looking for unexpected change in your
environment, either proactively through monitoring or
reactively through diagnostics/troubleshooting
27. % likelihood (probability)
Using the PDF to Find
What is Unexpected
zero pieces
of mail?
fifteen
pieces of
mail?
pieces of mail per day
28. Relate back to data in Splunk
• # Pieces of mail = # events of a certain type
– number of failed logins
– number of errors of different types
– number of events with certain status codes
– etc.
• Or, performance metrics
– response time
– utilization %
30. • Prelert Anomaly Detective
– Automatically, and correctly
models data via self-learning
– Applies sophisticated
Bayesian techniques
– Persists “on-going” analysis
to allow real-time alerting
– Makes it easy to use
3 significant alerts, not 161!
31. • Results are:
– Accurate outliers
– Automatically clustered
and scored by their
probabilistic “unlikelihood”
– Relevant in time, easy to
make alerts
– Clickable for drill-down
32. • Drill-downs:
– Automatically constructs
useful search syntax and
time selection
– Shows anomalies in
context of the original data
– Serve as a possible
jumping-off point for
subsequent manual mining
33. Automated Anomaly Detection
• Less time searching & troubleshooting
• Proactive trustworthy alerts without
thresholds
• Auto-discovers the previously unknown
35. Use Case
• Data sources:
– App logs
– Network performance
– SQL-Server metrics
• Prelert identifies
network discards that
cause app to
disconnect from DB
Correlating Anomalies
Across Data Types
36. Use Case
• Data source: Netstat
• Prelert finds a rare FTP
connection from a
server that doesn’t
normally use FTP
Servers making
unusual TCP connections
37. Use Case
• Data source: Custom
logs
• Prelert identifies unusual
$0.60 transaction –
traced to bug in currency
conversion
Revenue
Transactions
38. Use Case
• Data source:
BlueCoat proxy
• Prelert identifies
users abusing
Internet privileges
gambling sites
porn sites
Clients pervasively
visiting rare URLs
39. Use Case
• Response time of
online bank website
• Prelert alerts on
spikes without the
need to create a
single threshold
Monitoring Performance
w/o Thresholds
40. Use Case
• Data source: BlueCoat
proxy
• Prelert identifies client
attempting to exploit an
outside IIS webserver
Unusual outbound
traffic rates