SlideShare une entreprise Scribd logo
1  sur  34
Télécharger pour lire hors ligne
Getting Started with Machine Learning
for Incident Detection
August 2016 | Target. Hunt. Disrupt.
Chris McCubbin, Director of Data Science, Sqrrl
David J. Bianco, Security Technologist, Sqrrl
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
2	
  
A	
  story	
  we	
  all	
  know:	
  Regular	
  expressions	
  
  “Good	
  theory	
  leads	
  to	
  good	
  programs”	
  
  Who	
  here	
  has	
  implemented	
  and	
  optimized	
  a	
  Nondeterministic	
  
Finite	
  Automata	
  compiler?	
  
  You	
  probably	
  use	
  one	
  every	
  day	
  
  Regex:	
  Grep,	
  perl	
  
  You	
  don’t	
  care	
  how	
  it	
  works	
  inside	
  
  But	
  you	
  might	
  need	
  to	
  know	
  some	
  quirks	
  
  Regex	
  can’t	
  count	
  (google	
  up	
  “regex	
  HTML”	
  on	
  stackoverflow)	
  
Grep	
  has	
  no	
  ‘bad	
  cases’	
  
  Perl	
  is	
  more	
  powerful	
  (lazy,	
  backreferences)	
  
  But	
  it	
  is	
  helpful	
  to	
  know	
  what	
  it’s	
  good	
  for,	
  how	
  to	
  use	
  it,	
  etc.	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
3	
  
Agenda	
  
  What	
  is	
  Machine	
  Learning	
  (ML)	
  good	
  at?	
  
How	
  does	
  ML	
  work?	
  What	
  are	
  the	
  quirks	
  of	
  useful	
  Machine	
  Learning	
  techniques?	
  
  Can	
  I	
  use	
  Machine	
  Learning	
  easily?	
  
  How	
  can	
  you	
  customize	
  &	
  improve	
  our	
  examples?	
  
	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
4	
  
When’s	
  the	
  last	
  time	
  you	
  heard…?	
  
“It’s	
  a	
  Best	
  Practice	
  to	
  review	
  your	
  logs	
  every	
  
day.”	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
5	
  
Machine-­‐Assisted	
  Analysis	
  
Practical	
  Cyborgism	
  for	
  Security	
  Operations	
  
●  Bad	
  at	
  context	
  and	
  
understanding	
  
●  Good	
  at	
  repetition	
  
and	
  drudgery	
  
●  Algorithms	
  work	
  
cheap!	
  
●  Contextual	
  analysis	
  
experts	
  who	
  love	
  
patterns	
  
●  Possess	
  curiosity	
  &	
  
intuition	
  
●  Business	
  knowledge	
  
●  Good	
  results	
  from	
  
massive	
  amounts	
  of	
  
data	
  
●  Agile	
  investigations	
  
●  Quickly	
  turn	
  
questions	
  into	
  insight	
  
COMPUTERS	
   EMPOWERED	
  
ANALYSTS	
  
PEOPLE	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
6	
  
Problem	
  Statement:	
  HTTP	
  Proxy	
  Logs	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
7	
  
Our	
  solution:	
  Clearcut!	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
8	
  
Two	
  different	
  types	
  of	
  machine	
  learning	
  
  Supervised	
  
  Have	
  labeled	
  training	
  data?	
  
  Classification	
  algorithms	
  
  Random	
  Forests	
  
  Unsupervised	
  
  No	
  labeled	
  training	
  data	
  
  Assume	
  attacks	
  are	
  rare	
  
  Outlier	
  Detection	
  
  Isolation	
  Forests	
  
  Clustering	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
9	
  
Supervised:	
  Binary	
  Classification	
  
Given	
  a	
  population	
  of	
  two	
  types	
  of	
  “things”,	
  can	
  I	
  find	
  a	
  
function	
  that	
  separates	
  them	
  into	
  two	
  classes?	
  
	
  
Maybe	
  it’s	
  a	
  line,	
  maybe	
  it’s	
  not.	
  
	
  
Nothing’s	
  perfect,	
  but	
  how	
  close	
  can	
  we	
  get?	
  
	
  
If	
  we	
  derive	
  a	
  function	
  that	
  does	
  reasonably	
  well	
  at	
  
separating	
  the	
  two	
  classes,	
  that’s	
  our	
  binary	
  classifier!	
  
	
  
Fortunately,	
  Python	
  has	
  pantsloads	
  of	
  libraries	
  that	
  can	
  
do	
  this	
  for	
  us.	
  	
  The	
  machine	
  can	
  learn	
  the	
  function	
  
given	
  enough	
  samples	
  of	
  each	
  class.	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
10	
  
Classification	
  With	
  Random	
  Forests	
  
1.  Identify	
  positive	
  and	
  negative	
  sample	
  datasets	
  
2.  Clean	
  &	
  normalize	
  the	
  data	
  
3.  Partition	
  the	
  data	
  into	
  training	
  &	
  testing	
  datasets	
  
4.  Select	
  &	
  compute	
  some	
  interesting	
  features	
  
5.  Train	
  a	
  model	
  
6.  Test	
  the	
  model	
  
7.  Evaluate	
  the	
  results	
  
8.  .	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
11	
  
Generating	
  synthetic	
  abnormal	
  data	
  	
  
	
  Perhaps	
  we	
  don’t	
  have	
  any	
  malware	
  data,	
  but	
  we	
  
have	
  normal	
  data.	
  
	
  
	
  If	
  we	
  could	
  make	
  some	
  synthetic	
  abnormal	
  data,	
  
we	
  could	
  still	
  use	
  the	
  same	
  methods	
  
	
  
	
  One-­‐class	
  classification	
  
	
  
	
  How	
  should	
  we	
  create	
  the	
  data?	
  
	
  
	
  One	
  option:	
  ‘Noise-­‐contrastive	
  estimation’:	
  
Generate	
  noise	
  data	
  that	
  looks	
  real-­‐ish,	
  but	
  has	
  no	
  
real	
  structure	
  and	
  contrast	
  that	
  to	
  the	
  normal	
  data	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
12	
  
Decision	
  Trees	
  
Greedily	
  grow	
  tree	
  by	
  choosing	
  feature	
  that	
  
explains	
  the	
  class	
  the	
  most	
  
	
  
Split	
  the	
  training	
  set	
  into	
  two	
  sets,	
  repeat	
  
	
  
Form	
  a	
  classifier	
  by	
  “walking	
  down	
  the	
  tree”	
  
	
  
Issue:	
  overfitting	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
13	
  
Random	
  Forests	
  
Sample	
  training	
  set	
  with	
  replacement	
  
	
  
Fit	
  a	
  decision	
  tree	
  to	
  the	
  sample	
  
	
  
Repeat	
  n	
  times	
  
	
  
Form	
  a	
  classifier	
  by	
  averaging	
  the	
  n	
  decision	
  trees	
  
	
  
http://www.rhaensch.de/vrf.html	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
14	
  
Unsupervised:	
  Outlier	
  Detection	
  
	
  Given	
  a	
  population	
  of	
  “things”,	
  can	
  I	
  find	
  a	
  
function	
  that	
  tells	
  me	
  which	
  ones	
  look	
  
weird?	
  
	
  
	
  Can	
  also	
  pretend	
  to	
  be	
  a	
  classifier	
  	
  
(class	
  0	
  =	
  normal,	
  class	
  1	
  =	
  	
  weird)	
  
	
  
	
  Loads	
  of	
  ways	
  to	
  accomplish	
  this:	
  distance	
  
to	
  your	
  neighbors,	
  angle-­‐based	
  methods,	
  
isolation-­‐based	
  methods	
  
	
  
	
  
	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
15	
  
Isolation	
  Forests	
  [Liu,	
  Ting,	
  Zhao]	
  
http://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf	
  
Pick	
  a	
  dimension	
  at	
  random.	
  Pick	
  a	
  value	
  at	
  random.	
  
	
  
Make	
  a	
  tree	
  by	
  splitting	
  the	
  set	
  into	
  two	
  sets,	
  repeat.	
  
Stop	
  when	
  the	
  set	
  is	
  a	
  single	
  point.	
  
	
  
Do	
  this	
  for	
  many	
  trees.	
  
	
  
Form	
  an	
  outlier	
  detector	
  by	
  the	
  average	
  depth	
  that	
  a	
  
point	
  is	
  isolated	
  in	
  each	
  tree	
  (deeper	
  is	
  more	
  inlier-­‐y)	
  
	
  
Issue:	
  enumerated	
  types	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
16	
  
A	
  quick	
  note	
  about	
  parameters	
  
Choosing	
  parameters	
  can	
  be	
  important	
  
	
  
Can	
  use	
  expert	
  knowledge	
  or	
  ad-­‐hoc	
  methods	
  
	
  
Dimitar	
  Karev	
  (MIT	
  RSI	
  Intern)	
  tested	
  a	
  range	
  of	
  
parameters	
  for	
  Clearcut	
  iforests	
  using	
  
exhaustive	
  search	
  (for	
  forest	
  params)	
  and	
  a	
  
genetic	
  algorithm	
  (for	
  features)	
  
	
  
Result	
  was	
  a	
  huge	
  improvement	
  in	
  F1	
  (see	
  ROC	
  
curves)	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
17	
  
Classification	
  With	
  Isolation	
  Forests	
  
1.  Identify	
  positive	
  and	
  negative	
  sample	
  datasets	
  
2.  Clean	
  &	
  normalize	
  the	
  data	
  
3.  Partition	
  the	
  data	
  into	
  training	
  &	
  testing	
  datasets	
  
4.  Select	
  &	
  compute	
  some	
  interesting	
  features	
  
5.  Train	
  a	
  model	
  
6.  Test	
  the	
  model	
  
7.  Evaluate	
  the	
  results	
  
8.  	
  	
  
9.  Notice	
  similarities	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
18	
  
The	
  beauty	
  of	
  scikit-­‐learn	
  &	
  python	
  
Gists	
  to	
  perform	
  many	
  types	
  are	
  learning	
  are	
  simple	
  and	
  consistent	
  	
  
  Take	
  same	
  data	
  as	
  input	
  (supervised	
  requires	
  an	
  extra	
  column)	
  
  Signatures	
  of	
  methods	
  are	
  the	
  same	
  
  Example:	
  RF’s	
  vs	
  iForests	
  
  Changed	
  a	
  few	
  lines	
  of	
  code	
  for	
  training	
  
  Classes	
  are	
  a	
  bit	
  different	
  (0/1	
  vs	
  1/-­‐1)	
  
  Can	
  re-­‐use	
  the	
  analysis	
  script	
  with	
  nearly	
  no	
  change	
  
	
  
#RF	
  
clf	
  =	
  RandomForestClassifier(n_jobs=4,	
  
	
  n_estimators=opts.numtrees,	
  oob_score=True)	
  
y,	
  _	
  =	
  pd.factorize(train['class'])	
  
	
  
clf.fit(train.drop('class',	
  axis=1),	
  y)	
  
test['prediction']	
  =	
  clf.predict(testnoclass)	
  
#iF	
  
clf	
  =	
  IsolationForest(n_estimators=opts.numtrees)	
  
	
  
	
  
	
  
clf.fit(train.drop('class',	
  axis=1))	
  
test['prediction']	
  =	
  clf.predict(testnoclass)	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
19	
  
Identifying	
  Training	
  &	
  Test	
  Data	
  
Malicious	
  
Data	
  
All	
  
Labeled	
  
Data	
  
Training	
  
Data	
  
Test	
  
Data	
  
Label	
  =	
  normal	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
20	
  
Feature	
  extraction	
  
Many	
  classifiers	
  want	
  to	
  work	
  with	
  numeric	
  features.	
  
We	
  use	
  a	
  ‘flow	
  enhancing’	
  step	
  to	
  add	
  some	
  
convenience	
  columns	
  to	
  the	
  data	
  
	
  
Some	
  columns	
  are	
  already	
  numeric	
  
	
  
Some	
  columns	
  have	
  easy-­‐to-­‐extract	
  numeric	
  info:	
  
number	
  of	
  dots	
  in	
  URL,	
  entropy	
  in	
  TLD	
  
	
  
Categorical	
  columns	
  can	
  be	
  converted	
  to	
  “Bag	
  of	
  
words”	
  (BOW):	
  N	
  binary	
  features,	
  one	
  for	
  each	
  category	
  
	
  
Text-­‐y	
  columns	
  can	
  be	
  converted	
  to	
  BOW	
  or	
  Bag-­‐of-­‐
Ngrams	
  (BON)	
  
	
  
Use	
  TF-­‐IDF	
  to	
  determine	
  which	
  features	
  to	
  keep	
  
The quick brown fox….
The q ck br
The q daofj wrgwg ck br wrgwr gwrgg
1 0 0 1 0 0
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
21	
  
Training,	
  Testing	
  &	
  Evaluating	
  a	
  Model	
  
% ./train_flows_rf.py -o data/http-malware.log http-training.log
Reading normal training data
Reading malicious training data
Building Vectorizers
Training
Predicting (class 0 is normal, class 1 is malicious)
class prediction
0 0 12428
1 15
1 0 19
1 9563
dtype: int64
F1 = 0.998225469729
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
22	
  
Training,	
  Testing	
  &	
  Evaluating	
  a	
  Model	
  
% ./train_flows_rf.py -o data/http-malware.log http-training.log
Reading normal training data
Reading malicious training data
Building Vectorizers
Training
Predicting (class 0 is normal, class 1 is malicious)
class prediction
0 0 12428
1 15
1 0 19
1 9563
dtype: int64
F1 = 0.998225469729
Read	
  the	
  Bro	
  data	
  files	
  into	
  a	
  Pandas	
  data	
  
frame.	
  	
  	
  
	
  
Each	
  row	
  is	
  labeled	
  either	
  ‘benign’	
  or	
  
‘malicious’.	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
23	
  
Training,	
  Testing	
  &	
  Evaluating	
  a	
  Model	
  
% ./train_flows_rf.py -o data/http-malware.log http-training.log
Reading normal training data
Reading malicious training data
Building Vectorizers
Training
Predicting (class 0 is normal, class 1 is malicious)
class prediction
0 0 12428
1 15
1 0 19
1 9563
dtype: int64
F1 = 0.998225469729
Random	
  Forest	
  requires	
  numeric	
  data,	
  so	
  we	
  
have	
  to	
  convert	
  strings.	
  
	
  
Primarily	
  two	
  methods:	
  
●  Bag	
  of	
  Words	
  (method,	
  status	
  code)	
  
●  Bag	
  of	
  N-­‐Grams	
  (domain,	
  user	
  agent)	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
24	
  
Training,	
  Testing	
  &	
  Evaluating	
  a	
  Model	
  
% ./train_flows_rf.py -o data/http-malware.log http-training.log
Reading normal training data
Reading malicious training data
Building Vectorizers
Training
Predicting (class 0 is normal, class 1 is malicious)
class prediction
0 0 12428
1 15
1 0 19
1 9563
dtype: int64
F1 = 0.998225469729
Split	
  all	
  the	
  labeled	
  data	
  into	
  ‘training’	
  (80%)	
  
and	
  ‘test’	
  (20%)	
  datasets.	
  
	
  
Now	
  feed	
  all	
  the	
  training	
  data	
  through	
  the	
  
Random	
  Forest	
  to	
  produce	
  a	
  trained	
  model.	
  
	
  
At	
  this	
  point,	
  we	
  do	
  nothing	
  with	
  the	
  test	
  
data.	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
25	
  
Training,	
  Testing	
  &	
  Evaluating	
  a	
  Model	
  
% ./train_flows_rf.py -o data/http-malware.log http-training.log
Reading normal training data
Reading malicious training data
Building Vectorizers
Training
Predicting (class 0 is normal, class 1 is malicious)
class prediction
0 0 12428
1 15
1 0 19
1 9563
dtype: int64
F1 = 0.998225469729
Now	
  we	
  run	
  the	
  ‘test’	
  data	
  through	
  the	
  
trained	
  model.	
  It’s	
  still	
  labeled,	
  so	
  we	
  know	
  
what	
  the	
  answer	
  should	
  be.	
  
	
  
We	
  compare	
  the	
  expected	
  results	
  with	
  the	
  
actual	
  prediction	
  and	
  create	
  a	
  little	
  table.	
  
	
  
We	
  don’t	
  expect	
  perfect	
  results,	
  but	
  we’d	
  like	
  
to	
  see	
  most	
  of	
  the	
  data	
  in	
  the	
  0/0	
  and	
  
1/1	
  rows.	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
26	
  
Training,	
  Testing	
  &	
  Evaluating	
  a	
  Model	
  
% ./train_flows_rf.py -o data/http-malware.log http-training.log
Reading normal training data
Reading malicious training data
Building Vectorizers
Training
Predicting (class 0 is normal, class 1 is malicious)
class prediction
0 0 12428
1 15
1 0 19
1 9563
dtype: int64
F1 = 0.998225469729
It’s	
  hard	
  to	
  compare	
  two	
  tables	
  to	
  see	
  how	
  
different	
  models	
  compare	
  (due	
  to	
  different	
  
datasets	
  or	
  feature	
  choices).	
  	
  	
  
	
  
The	
  F1	
  value	
  is	
  a	
  useful	
  single-­‐number	
  
measure	
  for	
  comparison,	
  combining	
  TP	
  &	
  FP	
  
rates.	
  
	
  
Anything	
  over	
  about	
  0.9	
  is	
  considered	
  good,	
  
but	
  beware	
  very	
  high	
  values	
  (“overfitting”)!	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
27	
  
Bonus:	
  Most	
  Influential	
  Features	
  with	
  ‘-­‐v’	
  
Feature ranking:
1. feature user_agent.mac os (0.047058)
2. feature user_agent. os x 1 (0.044084)
3. feature user_agent.; intel (0.042387)
4. feature user_agent.ac os x (0.037192)
5. feature user_agent.os x 10 (0.031616)
[...]
46. feature userAgentEntropy (0.009144)
47. feature subdomainEntropy (0.007699)
48. feature browser_string.browser (0.007263)
49. feature response_body_len (0.006410)
50. feature request_body_len (0.005506)
51. feature domainNameDots (0.005054)
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
28	
  
Analyzing	
  Log	
  Files	
  
Percentage	
  of	
  original	
  
file	
  left	
  to	
  review.	
  % ./analyze_flows.py http-production-2016-05-02.log
Loading HTTP data
Loading trained model
Calculating features
Analyzing
detected 298 anomalies out of 180520 total rows (0.17%)
-----------------------------------------
line 2393
Co7qtw35sGLX6RiG79,80,HEAD,download.virtualbox.org,/virtualbox/5.0.20/
Oracle_VM_VirtualBox_Extension_Pack-5.0.20.vbox-extpack,-,Mozilla/5.0 (AgnosticOS;
Blend) IPRT/64.42,0,0,200,80,Unknown Browser,,,download,virtualbox
-----------------------------------------
line 2394
ChpL1u2Ia64utWrd9j,80,GET,download.virtualbox.org,/virtualbox/5.0.20/
Oracle_VM_VirtualBox_Extension_Pack-5.0.20.vbox-extpack,-,Mozilla/5.0 (AgnosticOS;
Blend) IPRT/64.42,0,16421439,200,80,Unknown Browser,,,download,virtualbox
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
29	
  
Bonus:	
  Classifier	
  Explanations	
  with	
  ‘-­‐v’	
  
line 431
C9WQArVvgv1BjvJG7,80,GET,apt.spideroak.com,/spideroak_one_rpm/stable/
repodata/repomd.xml,-,PackageKit-hawkey,0,2969,200,80,Unknown
Browser,,,apt,spideroak
Top feature contributions to class 1:
userAgentLength 0.0831734141875
response_body_len 0.0719766424091
domainNameLength 0.056790435921
user_agent.mac os 0.0272829846513
user_agent. os x 1 0.0252803447682
user_agent.os x 10 0.0251306287983
user_agent.ac os x 0.0244848247673
user_agent.; intel 0.0241743906069
user_agent. intel 0.0236921809876
tld.apple 0.020090459858
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
30	
  
Ideas	
  for	
  improvement	
  
More	
  diverse	
  malware	
  samples	
  
	
  
Better	
  filtering	
  for	
  connectivity	
  checks	
  in	
  
the	
  malware	
  data	
  
	
  
Incrementally	
  retraining	
  the	
  forest	
  	
  
(‘warm	
  start’)	
  
	
  
Log	
  type	
  “plugins”	
  
	
  
K-­‐class	
  classifier	
  
	
  
	
  
	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
31	
  
Adapting	
  to	
  other	
  log	
  sources	
  
Change	
  log	
  input:	
  clearcut_utils.load_brofile	
  
Import	
  your	
  data	
  into	
  a	
  pandas	
  data	
  frame	
  
	
  
Change	
  flow	
  enhancer:	
  flowenhancer.enhance_flow	
  
Add	
  any	
  columns	
  that	
  might	
  make	
  featurizing	
  
easier	
  
	
  
Change	
  feature	
  generator:	
  
featurizer.build_vectorizers	
  	
  
Make	
  any	
  BOW	
  and	
  BON	
  vectorizers	
  that	
  you	
  want	
  
Use	
  featurizers	
  to	
  make	
  BOW/BON	
  features	
  
Add	
  any	
  other	
  features	
  you	
  think	
  might	
  be	
  
important	
  
http://www.orwellloghomes.com/greybg.jpg	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
32	
  
Takeaways	
  
  Pandas	
  and	
  scikit-­‐learn	
  are	
  highly	
  active	
  python	
  
projects	
  that	
  are	
  bringing	
  data	
  science	
  and	
  machine	
  
learning	
  tools	
  to	
  the	
  masses	
  
  Security	
  technologists	
  can	
  (should?)	
  leverage	
  these	
  
tools	
  as	
  black	
  or	
  grey	
  boxes	
  
  Today,	
  implementing	
  ‘standard’	
  ML	
  algorithms	
  is	
  not	
  
the	
  long	
  pole	
  in	
  the	
  tent	
  
  Snag	
  Clearcut	
  for	
  an	
  example	
  
	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
33	
  
The	
  Sqrrl	
  Threat	
  Hunting	
  Platform	
  
SECURITY	
  DATA	
  
NETWORK	
  DATA	
  
ENDPOINT/IDENTITY	
  
DATA	
  
Firewall	
  
/	
  IDS	
  
Threat	
  
Intel	
  
Processes	
  
HR	
  
Bro	
  
SIEM	
  
Alerts	
  
Netflow	
  Proxy	
  
Authentication	
  
How	
  To	
  Learn	
  More?	
  
	
  
Go	
  to	
  sqrrl.com	
  to…	
  
  Download	
  Sqrrl’s	
  Threat	
  
Hunting	
  eBook	
  
  Download	
  the	
  Sqrrl	
  White	
  
Paper	
  on	
  Threat	
  Hunting	
  
Platforms	
  
  Request	
  a	
  Sqrrl	
  Test	
  Drive	
  
VM	
  
  Download	
  Sqrrl’s	
  Product	
  
Paper	
  
  Reach	
  out	
  to	
  us	
  at	
  
info@sqrrl.com	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
34	
  
More	
  Info	
  
Chris	
  McCubbin	
  
Director	
  of	
  Data	
  Science	
  
@_SecretStache_	
  
chris@sqrrl.com	
  
David	
  J.	
  Bianco	
  
Security	
  Technologist	
  
@DavidJBianco	
  
dbianco@sqrrl.com	
  
Clearcut	
  
Machine	
  Learning	
  for	
  Log	
  Review	
  
	
  
https://github.com/DavidJBianco/Clearcut	
  
(iforest	
  branch	
  for	
  iforests)	
  

Contenu connexe

Tendances

Leveraging Threat Intelligence to Guide Your Hunts
Leveraging Threat Intelligence to Guide Your HuntsLeveraging Threat Intelligence to Guide Your Hunts
Leveraging Threat Intelligence to Guide Your HuntsSqrrl
 
Reducing Mean Time to Know
Reducing Mean Time to KnowReducing Mean Time to Know
Reducing Mean Time to KnowSqrrl
 
The Art and Science of Alert Triage
The Art and Science of Alert TriageThe Art and Science of Alert Triage
The Art and Science of Alert TriageSqrrl
 
Threat Hunting for Command and Control Activity
Threat Hunting for Command and Control ActivityThreat Hunting for Command and Control Activity
Threat Hunting for Command and Control ActivitySqrrl
 
Threat Hunting vs. UEBA: Similarities, Differences, and How They Work Together
Threat Hunting vs. UEBA: Similarities, Differences, and How They Work Together Threat Hunting vs. UEBA: Similarities, Differences, and How They Work Together
Threat Hunting vs. UEBA: Similarities, Differences, and How They Work Together Sqrrl
 
Sqrrl and IBM: Threat Hunting for QRadar Users
Sqrrl and IBM: Threat Hunting for QRadar UsersSqrrl and IBM: Threat Hunting for QRadar Users
Sqrrl and IBM: Threat Hunting for QRadar UsersSqrrl
 
October 2014 Webinar: Cybersecurity Threat Detection
October 2014 Webinar: Cybersecurity Threat DetectionOctober 2014 Webinar: Cybersecurity Threat Detection
October 2014 Webinar: Cybersecurity Threat DetectionSqrrl
 
Sqrrl May Webinar: Data-Centric Security
Sqrrl May Webinar: Data-Centric SecuritySqrrl May Webinar: Data-Centric Security
Sqrrl May Webinar: Data-Centric SecuritySqrrl
 
Sqrrl Overview for Stac Research
Sqrrl Overview for Stac ResearchSqrrl Overview for Stac Research
Sqrrl Overview for Stac ResearchSqrrl
 
Grace Hopper Open Source Day Findings | Thorn & Cloudera Cares
Grace Hopper Open Source Day Findings | Thorn & Cloudera CaresGrace Hopper Open Source Day Findings | Thorn & Cloudera Cares
Grace Hopper Open Source Day Findings | Thorn & Cloudera CaresCloudera, Inc.
 
Sqrrl Enterprise: Big Data Security Analytics Use Case
Sqrrl Enterprise: Big Data Security Analytics Use CaseSqrrl Enterprise: Big Data Security Analytics Use Case
Sqrrl Enterprise: Big Data Security Analytics Use CaseSqrrl
 
Jisheng Wang at AI Frontiers: Deep Learning in Security
Jisheng Wang at AI Frontiers: Deep Learning in SecurityJisheng Wang at AI Frontiers: Deep Learning in Security
Jisheng Wang at AI Frontiers: Deep Learning in SecurityAI Frontiers
 
Sqrrl February Webinar: Breaking Down Data Silos
Sqrrl February Webinar: Breaking Down Data SilosSqrrl February Webinar: Breaking Down Data Silos
Sqrrl February Webinar: Breaking Down Data SilosSqrrl
 
Security Insights at Scale
Security Insights at ScaleSecurity Insights at Scale
Security Insights at ScaleRaffael Marty
 
Building a Successful Threat Hunting Program
Building a Successful Threat Hunting ProgramBuilding a Successful Threat Hunting Program
Building a Successful Threat Hunting ProgramCarl C. Manion
 
Fighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache SpotFighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache Spotmarkgrover
 
Imperative Induced Innovation - Patrick W. Dowd, Ph. D
Imperative Induced Innovation - Patrick W. Dowd, Ph. DImperative Induced Innovation - Patrick W. Dowd, Ph. D
Imperative Induced Innovation - Patrick W. Dowd, Ph. Dscoopnewsgroup
 
Visualization in the Age of Big Data
Visualization in the Age of Big DataVisualization in the Age of Big Data
Visualization in the Age of Big DataRaffael Marty
 
Cyber Threat Hunting with Phirelight
Cyber Threat Hunting with PhirelightCyber Threat Hunting with Phirelight
Cyber Threat Hunting with PhirelightHostway|HOSTING
 
Big Data Analytics to Enhance Security
Big Data Analytics to Enhance SecurityBig Data Analytics to Enhance Security
Big Data Analytics to Enhance SecurityData Science Thailand
 

Tendances (20)

Leveraging Threat Intelligence to Guide Your Hunts
Leveraging Threat Intelligence to Guide Your HuntsLeveraging Threat Intelligence to Guide Your Hunts
Leveraging Threat Intelligence to Guide Your Hunts
 
Reducing Mean Time to Know
Reducing Mean Time to KnowReducing Mean Time to Know
Reducing Mean Time to Know
 
The Art and Science of Alert Triage
The Art and Science of Alert TriageThe Art and Science of Alert Triage
The Art and Science of Alert Triage
 
Threat Hunting for Command and Control Activity
Threat Hunting for Command and Control ActivityThreat Hunting for Command and Control Activity
Threat Hunting for Command and Control Activity
 
Threat Hunting vs. UEBA: Similarities, Differences, and How They Work Together
Threat Hunting vs. UEBA: Similarities, Differences, and How They Work Together Threat Hunting vs. UEBA: Similarities, Differences, and How They Work Together
Threat Hunting vs. UEBA: Similarities, Differences, and How They Work Together
 
Sqrrl and IBM: Threat Hunting for QRadar Users
Sqrrl and IBM: Threat Hunting for QRadar UsersSqrrl and IBM: Threat Hunting for QRadar Users
Sqrrl and IBM: Threat Hunting for QRadar Users
 
October 2014 Webinar: Cybersecurity Threat Detection
October 2014 Webinar: Cybersecurity Threat DetectionOctober 2014 Webinar: Cybersecurity Threat Detection
October 2014 Webinar: Cybersecurity Threat Detection
 
Sqrrl May Webinar: Data-Centric Security
Sqrrl May Webinar: Data-Centric SecuritySqrrl May Webinar: Data-Centric Security
Sqrrl May Webinar: Data-Centric Security
 
Sqrrl Overview for Stac Research
Sqrrl Overview for Stac ResearchSqrrl Overview for Stac Research
Sqrrl Overview for Stac Research
 
Grace Hopper Open Source Day Findings | Thorn & Cloudera Cares
Grace Hopper Open Source Day Findings | Thorn & Cloudera CaresGrace Hopper Open Source Day Findings | Thorn & Cloudera Cares
Grace Hopper Open Source Day Findings | Thorn & Cloudera Cares
 
Sqrrl Enterprise: Big Data Security Analytics Use Case
Sqrrl Enterprise: Big Data Security Analytics Use CaseSqrrl Enterprise: Big Data Security Analytics Use Case
Sqrrl Enterprise: Big Data Security Analytics Use Case
 
Jisheng Wang at AI Frontiers: Deep Learning in Security
Jisheng Wang at AI Frontiers: Deep Learning in SecurityJisheng Wang at AI Frontiers: Deep Learning in Security
Jisheng Wang at AI Frontiers: Deep Learning in Security
 
Sqrrl February Webinar: Breaking Down Data Silos
Sqrrl February Webinar: Breaking Down Data SilosSqrrl February Webinar: Breaking Down Data Silos
Sqrrl February Webinar: Breaking Down Data Silos
 
Security Insights at Scale
Security Insights at ScaleSecurity Insights at Scale
Security Insights at Scale
 
Building a Successful Threat Hunting Program
Building a Successful Threat Hunting ProgramBuilding a Successful Threat Hunting Program
Building a Successful Threat Hunting Program
 
Fighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache SpotFighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache Spot
 
Imperative Induced Innovation - Patrick W. Dowd, Ph. D
Imperative Induced Innovation - Patrick W. Dowd, Ph. DImperative Induced Innovation - Patrick W. Dowd, Ph. D
Imperative Induced Innovation - Patrick W. Dowd, Ph. D
 
Visualization in the Age of Big Data
Visualization in the Age of Big DataVisualization in the Age of Big Data
Visualization in the Age of Big Data
 
Cyber Threat Hunting with Phirelight
Cyber Threat Hunting with PhirelightCyber Threat Hunting with Phirelight
Cyber Threat Hunting with Phirelight
 
Big Data Analytics to Enhance Security
Big Data Analytics to Enhance SecurityBig Data Analytics to Enhance Security
Big Data Analytics to Enhance Security
 

Similaire à Machine Learning for Incident Detection: Getting Started

Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9Roger Barga
 
Andrew NG machine learning
Andrew NG machine learningAndrew NG machine learning
Andrew NG machine learningShareDocView.com
 
Model evaluation in the land of deep learning
Model evaluation in the land of deep learningModel evaluation in the land of deep learning
Model evaluation in the land of deep learningPramit Choudhary
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-stepsShesha R
 
introduction to machine learning
introduction to machine learningintroduction to machine learning
introduction to machine learningJohnson Ubah
 
Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning CCG
 
Analysis using r
Analysis using rAnalysis using r
Analysis using rPriya Mohan
 
Machine learning: A Walk Through School Exams
Machine learning: A Walk Through School ExamsMachine learning: A Walk Through School Exams
Machine learning: A Walk Through School ExamsRamsha Ijaz
 
Barga Data Science lecture 4
Barga Data Science lecture 4Barga Data Science lecture 4
Barga Data Science lecture 4Roger Barga
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial IndustrySubrat Panda, PhD
 
The Incredible Disappearing Data Scientist
The Incredible Disappearing Data ScientistThe Incredible Disappearing Data Scientist
The Incredible Disappearing Data ScientistRebecca Bilbro
 
Machine learning for sensor Data Analytics
Machine learning for sensor Data AnalyticsMachine learning for sensor Data Analytics
Machine learning for sensor Data AnalyticsMATLABISRAEL
 
detailed Presentation on supervised learning
 detailed Presentation on supervised learning detailed Presentation on supervised learning
detailed Presentation on supervised learningZAMANCHBWN
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data MiningKai Koenig
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Data Science  & AI Road Map by Python & Computer science tutor in MalaysiaData Science  & AI Road Map by Python & Computer science tutor in Malaysia
Data Science & AI Road Map by Python & Computer science tutor in MalaysiaAhmed Elmalla
 
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHESIMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHESVikash Kumar
 
Machine learning 101
Machine learning 101Machine learning 101
Machine learning 101AmmarChalifah
 

Similaire à Machine Learning for Incident Detection: Getting Started (20)

Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9
 
Andrew NG machine learning
Andrew NG machine learningAndrew NG machine learning
Andrew NG machine learning
 
Model evaluation in the land of deep learning
Model evaluation in the land of deep learningModel evaluation in the land of deep learning
Model evaluation in the land of deep learning
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-steps
 
introduction to machine learning
introduction to machine learningintroduction to machine learning
introduction to machine learning
 
Ml masterclass
Ml masterclassMl masterclass
Ml masterclass
 
Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning
 
Analysis using r
Analysis using rAnalysis using r
Analysis using r
 
Machine learning: A Walk Through School Exams
Machine learning: A Walk Through School ExamsMachine learning: A Walk Through School Exams
Machine learning: A Walk Through School Exams
 
Barga Data Science lecture 4
Barga Data Science lecture 4Barga Data Science lecture 4
Barga Data Science lecture 4
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
 
The Incredible Disappearing Data Scientist
The Incredible Disappearing Data ScientistThe Incredible Disappearing Data Scientist
The Incredible Disappearing Data Scientist
 
Machine learning for sensor Data Analytics
Machine learning for sensor Data AnalyticsMachine learning for sensor Data Analytics
Machine learning for sensor Data Analytics
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
detailed Presentation on supervised learning
 detailed Presentation on supervised learning detailed Presentation on supervised learning
detailed Presentation on supervised learning
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Data Science  & AI Road Map by Python & Computer science tutor in MalaysiaData Science  & AI Road Map by Python & Computer science tutor in Malaysia
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
 
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHESIMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
 
Machine learning 101
Machine learning 101Machine learning 101
Machine learning 101
 
Machine learning
 Machine learning Machine learning
Machine learning
 

Plus de Sqrrl

How to Hunt for Lateral Movement on Your Network
How to Hunt for Lateral Movement on Your NetworkHow to Hunt for Lateral Movement on Your Network
How to Hunt for Lateral Movement on Your NetworkSqrrl
 
Building a Next-Generation Security Operations Center (SOC)
Building a Next-Generation Security Operations Center (SOC)Building a Next-Generation Security Operations Center (SOC)
Building a Next-Generation Security Operations Center (SOC)Sqrrl
 
User and Entity Behavior Analytics using the Sqrrl Behavior Graph
User and Entity Behavior Analytics using the Sqrrl Behavior GraphUser and Entity Behavior Analytics using the Sqrrl Behavior Graph
User and Entity Behavior Analytics using the Sqrrl Behavior GraphSqrrl
 
Leveraging DNS to Surface Attacker Activity
Leveraging DNS to Surface Attacker ActivityLeveraging DNS to Surface Attacker Activity
Leveraging DNS to Surface Attacker ActivitySqrrl
 
The Linked Data Advantage
The Linked Data AdvantageThe Linked Data Advantage
The Linked Data AdvantageSqrrl
 
Sqrrl Enterprise: Integrate, Explore, Analyze
Sqrrl Enterprise: Integrate, Explore, AnalyzeSqrrl Enterprise: Integrate, Explore, Analyze
Sqrrl Enterprise: Integrate, Explore, AnalyzeSqrrl
 
Sqrrl Datasheet: Cyber Hunting
Sqrrl Datasheet: Cyber HuntingSqrrl Datasheet: Cyber Hunting
Sqrrl Datasheet: Cyber HuntingSqrrl
 
Benchmarking The Apache Accumulo Distributed Key–Value Store
Benchmarking The Apache Accumulo Distributed Key–Value StoreBenchmarking The Apache Accumulo Distributed Key–Value Store
Benchmarking The Apache Accumulo Distributed Key–Value StoreSqrrl
 
Scalable Graph Clustering with Pregel
Scalable Graph Clustering with PregelScalable Graph Clustering with Pregel
Scalable Graph Clustering with PregelSqrrl
 
What's Next for Google's BigTable
What's Next for Google's BigTableWhat's Next for Google's BigTable
What's Next for Google's BigTableSqrrl
 
Performance Models for Apache Accumulo
Performance Models for Apache AccumuloPerformance Models for Apache Accumulo
Performance Models for Apache AccumuloSqrrl
 
Sqrrl November Webinar: Encryption and Security in Accumulo
Sqrrl November Webinar: Encryption and Security in AccumuloSqrrl November Webinar: Encryption and Security in Accumulo
Sqrrl November Webinar: Encryption and Security in AccumuloSqrrl
 
Sqrrl October Webinar: Data Modeling and Indexing
Sqrrl October Webinar: Data Modeling and IndexingSqrrl October Webinar: Data Modeling and Indexing
Sqrrl October Webinar: Data Modeling and IndexingSqrrl
 

Plus de Sqrrl (13)

How to Hunt for Lateral Movement on Your Network
How to Hunt for Lateral Movement on Your NetworkHow to Hunt for Lateral Movement on Your Network
How to Hunt for Lateral Movement on Your Network
 
Building a Next-Generation Security Operations Center (SOC)
Building a Next-Generation Security Operations Center (SOC)Building a Next-Generation Security Operations Center (SOC)
Building a Next-Generation Security Operations Center (SOC)
 
User and Entity Behavior Analytics using the Sqrrl Behavior Graph
User and Entity Behavior Analytics using the Sqrrl Behavior GraphUser and Entity Behavior Analytics using the Sqrrl Behavior Graph
User and Entity Behavior Analytics using the Sqrrl Behavior Graph
 
Leveraging DNS to Surface Attacker Activity
Leveraging DNS to Surface Attacker ActivityLeveraging DNS to Surface Attacker Activity
Leveraging DNS to Surface Attacker Activity
 
The Linked Data Advantage
The Linked Data AdvantageThe Linked Data Advantage
The Linked Data Advantage
 
Sqrrl Enterprise: Integrate, Explore, Analyze
Sqrrl Enterprise: Integrate, Explore, AnalyzeSqrrl Enterprise: Integrate, Explore, Analyze
Sqrrl Enterprise: Integrate, Explore, Analyze
 
Sqrrl Datasheet: Cyber Hunting
Sqrrl Datasheet: Cyber HuntingSqrrl Datasheet: Cyber Hunting
Sqrrl Datasheet: Cyber Hunting
 
Benchmarking The Apache Accumulo Distributed Key–Value Store
Benchmarking The Apache Accumulo Distributed Key–Value StoreBenchmarking The Apache Accumulo Distributed Key–Value Store
Benchmarking The Apache Accumulo Distributed Key–Value Store
 
Scalable Graph Clustering with Pregel
Scalable Graph Clustering with PregelScalable Graph Clustering with Pregel
Scalable Graph Clustering with Pregel
 
What's Next for Google's BigTable
What's Next for Google's BigTableWhat's Next for Google's BigTable
What's Next for Google's BigTable
 
Performance Models for Apache Accumulo
Performance Models for Apache AccumuloPerformance Models for Apache Accumulo
Performance Models for Apache Accumulo
 
Sqrrl November Webinar: Encryption and Security in Accumulo
Sqrrl November Webinar: Encryption and Security in AccumuloSqrrl November Webinar: Encryption and Security in Accumulo
Sqrrl November Webinar: Encryption and Security in Accumulo
 
Sqrrl October Webinar: Data Modeling and Indexing
Sqrrl October Webinar: Data Modeling and IndexingSqrrl October Webinar: Data Modeling and Indexing
Sqrrl October Webinar: Data Modeling and Indexing
 

Dernier

Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareJim McKeeth
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024Mind IT Systems
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
SHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationSHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationShrmpro
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...masabamasaba
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba
 
%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durbanmasabamasaba
 
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburgmasabamasaba
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...Nitya salvi
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park masabamasaba
 

Dernier (20)

Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
SHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationSHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions Presentation
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban
 
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 

Machine Learning for Incident Detection: Getting Started

  • 1. Getting Started with Machine Learning for Incident Detection August 2016 | Target. Hunt. Disrupt. Chris McCubbin, Director of Data Science, Sqrrl David J. Bianco, Security Technologist, Sqrrl
  • 2. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     2   A  story  we  all  know:  Regular  expressions     “Good  theory  leads  to  good  programs”     Who  here  has  implemented  and  optimized  a  Nondeterministic   Finite  Automata  compiler?     You  probably  use  one  every  day     Regex:  Grep,  perl     You  don’t  care  how  it  works  inside     But  you  might  need  to  know  some  quirks     Regex  can’t  count  (google  up  “regex  HTML”  on  stackoverflow)   Grep  has  no  ‘bad  cases’     Perl  is  more  powerful  (lazy,  backreferences)     But  it  is  helpful  to  know  what  it’s  good  for,  how  to  use  it,  etc.  
  • 3. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     3   Agenda     What  is  Machine  Learning  (ML)  good  at?   How  does  ML  work?  What  are  the  quirks  of  useful  Machine  Learning  techniques?     Can  I  use  Machine  Learning  easily?     How  can  you  customize  &  improve  our  examples?    
  • 4. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     4   When’s  the  last  time  you  heard…?   “It’s  a  Best  Practice  to  review  your  logs  every   day.”  
  • 5. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     5   Machine-­‐Assisted  Analysis   Practical  Cyborgism  for  Security  Operations   ●  Bad  at  context  and   understanding   ●  Good  at  repetition   and  drudgery   ●  Algorithms  work   cheap!   ●  Contextual  analysis   experts  who  love   patterns   ●  Possess  curiosity  &   intuition   ●  Business  knowledge   ●  Good  results  from   massive  amounts  of   data   ●  Agile  investigations   ●  Quickly  turn   questions  into  insight   COMPUTERS   EMPOWERED   ANALYSTS   PEOPLE  
  • 6. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     6   Problem  Statement:  HTTP  Proxy  Logs  
  • 7. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     7   Our  solution:  Clearcut!  
  • 8. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     8   Two  different  types  of  machine  learning     Supervised     Have  labeled  training  data?     Classification  algorithms     Random  Forests     Unsupervised     No  labeled  training  data     Assume  attacks  are  rare     Outlier  Detection     Isolation  Forests     Clustering  
  • 9. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     9   Supervised:  Binary  Classification   Given  a  population  of  two  types  of  “things”,  can  I  find  a   function  that  separates  them  into  two  classes?     Maybe  it’s  a  line,  maybe  it’s  not.     Nothing’s  perfect,  but  how  close  can  we  get?     If  we  derive  a  function  that  does  reasonably  well  at   separating  the  two  classes,  that’s  our  binary  classifier!     Fortunately,  Python  has  pantsloads  of  libraries  that  can   do  this  for  us.    The  machine  can  learn  the  function   given  enough  samples  of  each  class.  
  • 10. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     10   Classification  With  Random  Forests   1.  Identify  positive  and  negative  sample  datasets   2.  Clean  &  normalize  the  data   3.  Partition  the  data  into  training  &  testing  datasets   4.  Select  &  compute  some  interesting  features   5.  Train  a  model   6.  Test  the  model   7.  Evaluate  the  results   8.  .  
  • 11. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     11   Generating  synthetic  abnormal  data      Perhaps  we  don’t  have  any  malware  data,  but  we   have  normal  data.      If  we  could  make  some  synthetic  abnormal  data,   we  could  still  use  the  same  methods      One-­‐class  classification      How  should  we  create  the  data?      One  option:  ‘Noise-­‐contrastive  estimation’:   Generate  noise  data  that  looks  real-­‐ish,  but  has  no   real  structure  and  contrast  that  to  the  normal  data  
  • 12. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     12   Decision  Trees   Greedily  grow  tree  by  choosing  feature  that   explains  the  class  the  most     Split  the  training  set  into  two  sets,  repeat     Form  a  classifier  by  “walking  down  the  tree”     Issue:  overfitting  
  • 13. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     13   Random  Forests   Sample  training  set  with  replacement     Fit  a  decision  tree  to  the  sample     Repeat  n  times     Form  a  classifier  by  averaging  the  n  decision  trees     http://www.rhaensch.de/vrf.html  
  • 14. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     14   Unsupervised:  Outlier  Detection    Given  a  population  of  “things”,  can  I  find  a   function  that  tells  me  which  ones  look   weird?      Can  also  pretend  to  be  a  classifier     (class  0  =  normal,  class  1  =    weird)      Loads  of  ways  to  accomplish  this:  distance   to  your  neighbors,  angle-­‐based  methods,   isolation-­‐based  methods        
  • 15. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     15   Isolation  Forests  [Liu,  Ting,  Zhao]   http://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf   Pick  a  dimension  at  random.  Pick  a  value  at  random.     Make  a  tree  by  splitting  the  set  into  two  sets,  repeat.   Stop  when  the  set  is  a  single  point.     Do  this  for  many  trees.     Form  an  outlier  detector  by  the  average  depth  that  a   point  is  isolated  in  each  tree  (deeper  is  more  inlier-­‐y)     Issue:  enumerated  types  
  • 16. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     16   A  quick  note  about  parameters   Choosing  parameters  can  be  important     Can  use  expert  knowledge  or  ad-­‐hoc  methods     Dimitar  Karev  (MIT  RSI  Intern)  tested  a  range  of   parameters  for  Clearcut  iforests  using   exhaustive  search  (for  forest  params)  and  a   genetic  algorithm  (for  features)     Result  was  a  huge  improvement  in  F1  (see  ROC   curves)  
  • 17. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     17   Classification  With  Isolation  Forests   1.  Identify  positive  and  negative  sample  datasets   2.  Clean  &  normalize  the  data   3.  Partition  the  data  into  training  &  testing  datasets   4.  Select  &  compute  some  interesting  features   5.  Train  a  model   6.  Test  the  model   7.  Evaluate  the  results   8.      9.  Notice  similarities  
  • 18. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     18   The  beauty  of  scikit-­‐learn  &  python   Gists  to  perform  many  types  are  learning  are  simple  and  consistent       Take  same  data  as  input  (supervised  requires  an  extra  column)     Signatures  of  methods  are  the  same     Example:  RF’s  vs  iForests     Changed  a  few  lines  of  code  for  training     Classes  are  a  bit  different  (0/1  vs  1/-­‐1)     Can  re-­‐use  the  analysis  script  with  nearly  no  change     #RF   clf  =  RandomForestClassifier(n_jobs=4,    n_estimators=opts.numtrees,  oob_score=True)   y,  _  =  pd.factorize(train['class'])     clf.fit(train.drop('class',  axis=1),  y)   test['prediction']  =  clf.predict(testnoclass)   #iF   clf  =  IsolationForest(n_estimators=opts.numtrees)         clf.fit(train.drop('class',  axis=1))   test['prediction']  =  clf.predict(testnoclass)  
  • 19. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     19   Identifying  Training  &  Test  Data   Malicious   Data   All   Labeled   Data   Training   Data   Test   Data   Label  =  normal  
  • 20. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     20   Feature  extraction   Many  classifiers  want  to  work  with  numeric  features.   We  use  a  ‘flow  enhancing’  step  to  add  some   convenience  columns  to  the  data     Some  columns  are  already  numeric     Some  columns  have  easy-­‐to-­‐extract  numeric  info:   number  of  dots  in  URL,  entropy  in  TLD     Categorical  columns  can  be  converted  to  “Bag  of   words”  (BOW):  N  binary  features,  one  for  each  category     Text-­‐y  columns  can  be  converted  to  BOW  or  Bag-­‐of-­‐ Ngrams  (BON)     Use  TF-­‐IDF  to  determine  which  features  to  keep   The quick brown fox…. The q ck br The q daofj wrgwg ck br wrgwr gwrgg 1 0 0 1 0 0
  • 21. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     21   Training,  Testing  &  Evaluating  a  Model   % ./train_flows_rf.py -o data/http-malware.log http-training.log Reading normal training data Reading malicious training data Building Vectorizers Training Predicting (class 0 is normal, class 1 is malicious) class prediction 0 0 12428 1 15 1 0 19 1 9563 dtype: int64 F1 = 0.998225469729
  • 22. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     22   Training,  Testing  &  Evaluating  a  Model   % ./train_flows_rf.py -o data/http-malware.log http-training.log Reading normal training data Reading malicious training data Building Vectorizers Training Predicting (class 0 is normal, class 1 is malicious) class prediction 0 0 12428 1 15 1 0 19 1 9563 dtype: int64 F1 = 0.998225469729 Read  the  Bro  data  files  into  a  Pandas  data   frame.         Each  row  is  labeled  either  ‘benign’  or   ‘malicious’.  
  • 23. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     23   Training,  Testing  &  Evaluating  a  Model   % ./train_flows_rf.py -o data/http-malware.log http-training.log Reading normal training data Reading malicious training data Building Vectorizers Training Predicting (class 0 is normal, class 1 is malicious) class prediction 0 0 12428 1 15 1 0 19 1 9563 dtype: int64 F1 = 0.998225469729 Random  Forest  requires  numeric  data,  so  we   have  to  convert  strings.     Primarily  two  methods:   ●  Bag  of  Words  (method,  status  code)   ●  Bag  of  N-­‐Grams  (domain,  user  agent)  
  • 24. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     24   Training,  Testing  &  Evaluating  a  Model   % ./train_flows_rf.py -o data/http-malware.log http-training.log Reading normal training data Reading malicious training data Building Vectorizers Training Predicting (class 0 is normal, class 1 is malicious) class prediction 0 0 12428 1 15 1 0 19 1 9563 dtype: int64 F1 = 0.998225469729 Split  all  the  labeled  data  into  ‘training’  (80%)   and  ‘test’  (20%)  datasets.     Now  feed  all  the  training  data  through  the   Random  Forest  to  produce  a  trained  model.     At  this  point,  we  do  nothing  with  the  test   data.  
  • 25. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     25   Training,  Testing  &  Evaluating  a  Model   % ./train_flows_rf.py -o data/http-malware.log http-training.log Reading normal training data Reading malicious training data Building Vectorizers Training Predicting (class 0 is normal, class 1 is malicious) class prediction 0 0 12428 1 15 1 0 19 1 9563 dtype: int64 F1 = 0.998225469729 Now  we  run  the  ‘test’  data  through  the   trained  model.  It’s  still  labeled,  so  we  know   what  the  answer  should  be.     We  compare  the  expected  results  with  the   actual  prediction  and  create  a  little  table.     We  don’t  expect  perfect  results,  but  we’d  like   to  see  most  of  the  data  in  the  0/0  and   1/1  rows.  
  • 26. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     26   Training,  Testing  &  Evaluating  a  Model   % ./train_flows_rf.py -o data/http-malware.log http-training.log Reading normal training data Reading malicious training data Building Vectorizers Training Predicting (class 0 is normal, class 1 is malicious) class prediction 0 0 12428 1 15 1 0 19 1 9563 dtype: int64 F1 = 0.998225469729 It’s  hard  to  compare  two  tables  to  see  how   different  models  compare  (due  to  different   datasets  or  feature  choices).         The  F1  value  is  a  useful  single-­‐number   measure  for  comparison,  combining  TP  &  FP   rates.     Anything  over  about  0.9  is  considered  good,   but  beware  very  high  values  (“overfitting”)!  
  • 27. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     27   Bonus:  Most  Influential  Features  with  ‘-­‐v’   Feature ranking: 1. feature user_agent.mac os (0.047058) 2. feature user_agent. os x 1 (0.044084) 3. feature user_agent.; intel (0.042387) 4. feature user_agent.ac os x (0.037192) 5. feature user_agent.os x 10 (0.031616) [...] 46. feature userAgentEntropy (0.009144) 47. feature subdomainEntropy (0.007699) 48. feature browser_string.browser (0.007263) 49. feature response_body_len (0.006410) 50. feature request_body_len (0.005506) 51. feature domainNameDots (0.005054)
  • 28. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     28   Analyzing  Log  Files   Percentage  of  original   file  left  to  review.  % ./analyze_flows.py http-production-2016-05-02.log Loading HTTP data Loading trained model Calculating features Analyzing detected 298 anomalies out of 180520 total rows (0.17%) ----------------------------------------- line 2393 Co7qtw35sGLX6RiG79,80,HEAD,download.virtualbox.org,/virtualbox/5.0.20/ Oracle_VM_VirtualBox_Extension_Pack-5.0.20.vbox-extpack,-,Mozilla/5.0 (AgnosticOS; Blend) IPRT/64.42,0,0,200,80,Unknown Browser,,,download,virtualbox ----------------------------------------- line 2394 ChpL1u2Ia64utWrd9j,80,GET,download.virtualbox.org,/virtualbox/5.0.20/ Oracle_VM_VirtualBox_Extension_Pack-5.0.20.vbox-extpack,-,Mozilla/5.0 (AgnosticOS; Blend) IPRT/64.42,0,16421439,200,80,Unknown Browser,,,download,virtualbox
  • 29. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     29   Bonus:  Classifier  Explanations  with  ‘-­‐v’   line 431 C9WQArVvgv1BjvJG7,80,GET,apt.spideroak.com,/spideroak_one_rpm/stable/ repodata/repomd.xml,-,PackageKit-hawkey,0,2969,200,80,Unknown Browser,,,apt,spideroak Top feature contributions to class 1: userAgentLength 0.0831734141875 response_body_len 0.0719766424091 domainNameLength 0.056790435921 user_agent.mac os 0.0272829846513 user_agent. os x 1 0.0252803447682 user_agent.os x 10 0.0251306287983 user_agent.ac os x 0.0244848247673 user_agent.; intel 0.0241743906069 user_agent. intel 0.0236921809876 tld.apple 0.020090459858
  • 30. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     30   Ideas  for  improvement   More  diverse  malware  samples     Better  filtering  for  connectivity  checks  in   the  malware  data     Incrementally  retraining  the  forest     (‘warm  start’)     Log  type  “plugins”     K-­‐class  classifier        
  • 31. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     31   Adapting  to  other  log  sources   Change  log  input:  clearcut_utils.load_brofile   Import  your  data  into  a  pandas  data  frame     Change  flow  enhancer:  flowenhancer.enhance_flow   Add  any  columns  that  might  make  featurizing   easier     Change  feature  generator:   featurizer.build_vectorizers     Make  any  BOW  and  BON  vectorizers  that  you  want   Use  featurizers  to  make  BOW/BON  features   Add  any  other  features  you  think  might  be   important   http://www.orwellloghomes.com/greybg.jpg  
  • 32. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     32   Takeaways     Pandas  and  scikit-­‐learn  are  highly  active  python   projects  that  are  bringing  data  science  and  machine   learning  tools  to  the  masses     Security  technologists  can  (should?)  leverage  these   tools  as  black  or  grey  boxes     Today,  implementing  ‘standard’  ML  algorithms  is  not   the  long  pole  in  the  tent     Snag  Clearcut  for  an  example    
  • 33. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     33   The  Sqrrl  Threat  Hunting  Platform   SECURITY  DATA   NETWORK  DATA   ENDPOINT/IDENTITY   DATA   Firewall   /  IDS   Threat   Intel   Processes   HR   Bro   SIEM   Alerts   Netflow  Proxy   Authentication   How  To  Learn  More?     Go  to  sqrrl.com  to…     Download  Sqrrl’s  Threat   Hunting  eBook     Download  the  Sqrrl  White   Paper  on  Threat  Hunting   Platforms     Request  a  Sqrrl  Test  Drive   VM     Download  Sqrrl’s  Product   Paper     Reach  out  to  us  at   info@sqrrl.com  
  • 34. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     34   More  Info   Chris  McCubbin   Director  of  Data  Science   @_SecretStache_   chris@sqrrl.com   David  J.  Bianco   Security  Technologist   @DavidJBianco   dbianco@sqrrl.com   Clearcut   Machine  Learning  for  Log  Review     https://github.com/DavidJBianco/Clearcut   (iforest  branch  for  iforests)