Practical Applications of Machine Learning in Cybersecurity
McAfee Confidential
April 25, 2019
Celeste Fralick, Ph.D., CQA
Senior Principal Engineer, Chief Data Scientist
Office of the CTO, McAfee
Practical Applications of
Machine Learning in
Cybersecurity
2
The Analytics Hype-line
Loosely based on https://en.wikipedia.org/wiki/Timeline_of_machine_learning
Predictive
Analytics
Emerge
1940
AI
Proposed
by John
McCarthy
1956
Neural
Networks
Emerge
by Frank
Rosenblatt
1958
Neural
Networks
Dismissed
1969
Big Data
Emerges
2005
Data
Scientists
Emerge
2001
Watson
Makes AI
Interesting
Again
2011
Neural
Networks
Acceptable
2015
Machine
Learning
Solves
Everything
2016
AI = All
Analytics
2018
Not to scale
Demystifying Analytic Terms
Structured data
Data that resides in a fixed field
within a record or file, including
relational databases and
spreadsheets
Unstructured data
Data that is not organized in
a pre-defined manner, including
text-heavy docs & social media
Semi-structured data
Data that does not conform
strictly with relational databases,
but contains tags/markers to
enable hierarchy
Reinforcement Learning
Data that maximizes rewards
based on exploration and
exploitation of known
environments (walking baby)
Why do we care about these terms? It helps to select models & features!
Demystifying Analytic Terms: What’s a “Feature”?
Type of machine
Age of machine
Cleanliness of machine
Temperature of water
Type of water
Brand of coffee
Origin of coffee
Grind of coffee
Type of roast
Organic coffee
Mug or cup
A Feature is an individual measurable
property or characteristic that enables
the desired output.
L
AI
Deep Learning
Machine Learning
Statistics
Architecture and Data Management
Complexity
&
Intelligence
Reason, logic,
value judgments
Trains & learns,
patterns
Complex, layered
Models,
summary stats
Data lineage,
compute
capability
• McAfee Investigator
• McAfee ATD
• Real Protect
• Mobile Security
Pyramid of Complexity and Intelligence in Analytics
The McAfee Analytic Ecosystem: ML/DL/AI Applications
Cloud
McAfee
Threat
Research
On
Premises
Security
Operation
Center
Gateway
ML
DL
AI
MLML
ML
DL
DL
DL
AI
DL
AI
Via telemetry, threat
analyses, and industry
feeds, McAfee
integrates expert
analytics throughout
the security ecosystem
8McAfee Confidential
Risks in Analytic Development
• Poor intelligence leads to bad business decisions
• Unhappy customers, reduced ROI & ROA
• Lack of growth and cash generation
• Increased False Positives and False Negatives
9
Examples of Specific Risks in Analytic Development
Bias Statistical. human, ethics, intent
Adversarial Machine Learning Evading or poisoning of training or test sets
Lack of Explainability (XAI) How are decisions made? Liability?
Citizen Data Scientists Data + one model ≠ data science
Poor Scientific Protocol Repeatable analytic development process
How long will model last in field? Implications
of changes, periodic training?Data Decay
RISK DESCRIPTION
Why are there so many
“citizen” data scientists?
• “Sexy” title (HBR), LOTS of data
• Demand for immediate business
intelligence & action
• Too many areas to learn
• Too few data scientists
• Ill defined job role
• “Easy to learn” mentality without
underlying statistical fundamentals
Credits: CIO Journal (2014) and B. Marr (2016)
Statistics
Math SW/HW
Domain
Data Mgmt
& Arch
System
Engineering
What a Data
Scientist
Needs to
Know
Analytics
Analytic Risk
Assessment
Verification &
Validation
Analytic
Plan & Peer
Review
Define
Requirements
Post Production
Release Analytic
Review(s)
Analytic
Report & Peer
Review
Discover, develop &
iterate analytics
Planning ProductionDevelopmentExploration
Define
Usage
Model &
Problem
Framing State of Art
Assessment
Analytic
Discontinuance
Analytic Life Cycle (Waterfall)
• Does the Training sample represent the larger and
final population? How do you know?
• Is the sample balanced? If not, why not?
• What is your expected compute footprint?
• What 3-5 models will be attempted? What error
rates will be compared?
• How well will the proposed models explain the
expected output? (Explainability)
• How vulnerable are the algorithms to AML?
• How often will the algorithm learn?
• How will model drift be detected in the field?
Identify, Quantify, Mitigate, and Learn Analytic Risks
(also, use these questions to check your Data Scientist!)
Analytic Risk
Assessment
Exploration
Analytic Life Cycle (Agile)
13
Analytic
Plan &
Peer
Review
Analytic
Report &
Peer
Review
Post Production
Release
Analytic
Review(s)
Discover,
develop & iterate
analytics Validation &
Verification
Analytic
Discontinuance
Define
Usage
Model &
Problem
Framing
Define
Requirements
State of Art
Assessment
Analytic Risk
Assessment
Validation: Have you done the RIGHT analytic?
• Trace back to customer use case and contract
• e.g.: Causal relationships, flow charts, visuals,
graphs
Verification: Have you done the analytic RIGHT?
• Verify the mathematics and model fit
• e.g., ROC, RMSE, R2, confidence limits
ROC:https://commons.wikimedia.org/wiki/File%3ARoccurves.png
15
Summary
• Understand & mitigate the hype
• Risks are inherent in Analytics
• Utilize an Analytic Development
Protocol
• Perform an Analytic Risk
Assessment
• Validate & Verify!
•
•