Fundamentals of Machine Learning: Perspectives from a Data Scientist (ISC West 2018)

Fundamentals of Machine Learning:
Perspectives from a Data Scientist
Dr. Sven Krasser, Chief Scientist, CrowdStrike, Inc.

Unsupervised Learning
Clustering
1
2
3

Supervised Learning
Classification

1988 Anthropometric
Survey of
Army Personnel

Source: http://mreed.umtri.umich.edu/mreed/downloads.html#anthro
• Over 4000 soldiers surveyed
• Over 100 types of measurements
• Reported by gender

FIRST LOOK
Height [mm]
Density
• Difference in distribution
• Significant overlap

SECOND DIMENSION
Height [mm]
Weight[10-1kg]
• Correlation
• Overlap

FEATURE SELECTION
“Buttock Circumference” [mm]
Weight[10-1kg]
• Correlation
• Gender-specific
slope
• Reduced overlap
• Selection of features
matters
• How to make a
prediction?

k-NEAREST NEIGHBOR
Weight[10-1kg]
m
f

SUPPORT VECTOR
MACHINE
Weight[10-1kg]

SUPPORT VECTOR
MACHINE
Weight[10-1kg]
• Overfitting
• Classifier does not
generalize
• Let’s take a
closer look…

LET’S CLASSIFY
Weight[10-1kg]
• Classifier generalizes
• Note some
misclassifications
• Let’s assume we want
to detect males (blue)
– I.e. “blue” is our
positive class

LET’S CLASSIFY
Weight[10-1kg]

LET’S CLASSIFY
Weight[10-1kg]
• Get more “blue” right
(true positives)
• Get more “red” wrong
(false positives)

RECEIVER OPERATING
CHARACTERISTICS CURVE
False Positive Rate
TruePositiveRate
Detect more by accepting more false
positives

MORE DIMENSIONS
• Some 160
dimensions
• Projected back
to 2-dimensional
screen
• Perfect
separation

400 401 402 403 404 405 406 407 408 409 410 411 412 413 414
area codes
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

Mission Accomplished
We just add more dimensions… right?

If not for the…
Curse of Dimensionality

Source: https://commons.wikimedia.org/w/index.php?curid=2257082

Dimensionality
and Sparseness
Height (mm)
Weight[10-1kg]

InfoSec Applications
File Analysis

EngineeredFEATURES
forExecutableFiles
32/64BIT
EXECUTABLE
SUBSYSTEM
TYPE
MACHINE
INSTRUCTION
DISTRIBUTION
FILESIZE TIMESTAMP
DEBUG
INFORMATION
PRESENT
PACKERTYPE FILEENTROPY
NUMBEROF
SECTIONS
NUMBER
WRITABLE
SECTIONS
NUMBER
READABLE
SECTIONS
NUMBER
EXECUTABLE
SECTIONS
DISTRIBUTION
OFSECTION
ENTROPY
IMPORTEDDLL
NAMES
IMPORTED
FUNCTION
NAMES
COMPILER
ARTIFACTS
LINKER
ARTIFACTS
RESOURCE
DATA
PROTOCOL
STRINGS
IPS/DOMAINS
PATHS
PRODUCT
METADATA
DIGITAL
SIGNATURE
ICON
CONTENT
…

• Unstructured file content
• Algorithm uncovers
interesting properties
• Requires a lot more more
input data
• Unlocks more insight
• “Deep Learning”

String-based feature
Executablesectionsize-basedfeature

Subspace Projection A
SubspaceProjectionB

99%DETECTIONRATE
1%FALSEPOSITIVES
Malware?

Malware
99%DETECTIONRATE
1%FALSEPOSITIVES

99%DETECTIONRATE
1%FALSEPOSITIVES
Not Malware

99% True Positive RateChanceofatleastone
successforadversary
Number of attempts
1%
>99.3%
500

• Large datasets require algorithmic approaches
– Many sensors, e.g. IoT
– Large input, e.g. video surveillance
– Complex relationships, e.g. social graph
• Hidden structure
• Better accuracy, better response time

• Making the most out of available data
• Less friction, better customer experience
• Automation
• Empiricism (but careful of bias in input data)
Why deploy an ML-based technology?

• Increasingly effective and viable technology
• Mind the innovator’s dilemma
• Replace rule-based systems
– ML modeling is repeatable
– Maintainability
– Measurability
Why build ML-enabled products?

• True positive/false positive trade-off
– ROC curve
– Base rate
– Overfitting
• What is the data?
– Does the data intuitively contain signal?
– What is the system trained on?
• Training data applicable to your use case
• Ground truth
Beyond the Hype: Recognizing Solid ML

• Making defense easier
• But: also making attack easier
– Adversarial models
– Adversarial examples
“Adversarial Patch,” Brown et al.,
https://arxiv.org/abs/1712.09665

• Autonomous systems
– Malicious use of e.g. drones
– Manipulating autonomous systems (self-driving cars)
• Spoofing
– Lyrebird
– DeepFake
• Adversarial data
– Circumvent facial recognition
– Road signs etc.
Some Adversarial Challenges for the Physical Domain

>>> from sklearn.datasets import load_iris
>>> from sklearn import tree
>>> iris = load_iris()
>>> clf = tree.DecisionTreeClassifier()
>>> clf = clf.fit(iris.data, iris.target)
>>> clf.predict(iris.data[:1, :])
array([0])
Getting Started with Scikit-Learn
Source: http://scikit-learn.org/stable/modules/tree.html#classification

https://developers.google.com/machine-learning/crash-course/

Fundamentals of Machine Learning: Perspectives from a Data Scientist (ISC West 2018)

Recommandé

Recommandé

Contenu connexe

Similaire à Fundamentals of Machine Learning: Perspectives from a Data Scientist (ISC West 2018)

Similaire à Fundamentals of Machine Learning: Perspectives from a Data Scientist (ISC West 2018) (20)

Dernier

Dernier (20)

Fundamentals of Machine Learning: Perspectives from a Data Scientist (ISC West 2018)