2. 2/ 17
®
Whois
● Antonio Costa – Cooler
● Just another System analyst
● Github CoolerVoid
●
● https://github.com/CoolerVoid
Contact: acosta@conviso.com.br
coolerlair@gmail.com
3. 3/ 17
®
How it works
● Anti-Spam - The common way
● Get E-mails POP3 / IMAP ...
● Validate
● Clean all and tokenization
● BoW (Bag-of-words), SoW(Set-of-Words)...
● tf–idf (term frequency–inverse document
frequency)...
● Supervised learning
● Classification (SVM, KNN, NB, Random forest... )
4. 4/ 17
®
How it works
● Anti-Spam - The common way
● Get E-mails POP3 / IMAP
● Validate
– Country-based filtering
– DNS-based blacklists
– Enforcing RFC standards
– SMTP callback verification
7. 7/ 17
®
How it works
● Anti-Spam - The common way
● Get E-mails POP3 / IMAP ... - INPUT STRING
● Validate
● Clean all and tokenization
● BoW (Bag-of-words), SoW(Set-of-Words), tf–idf
(term frequency–inverse document frequency)...
Create MATRIX
● Supervised learning – USING MATRIX
● Classification (SVM, KNN, NB, Random forest... )
8. 8/ 17
®
Bag-of-words
[ 1 ] - “Luan likes to make hacking. Josimar likes to make
hacking too.”
[ 2 ] - “Luan also likes to web hacking.”
● Create array of words ( tokenize... )
{ “Luan”,”likes”,”to”,”make”,”hacking”,”Josimar”,”too”,
”also”,”web”} Total of 9 elements
● Count number of appers !
[0] – { 1, 2, 2, 2, 2, 1, 1, 0, 0 }
[1] – { 1, 1, 1, 0, 1, 0, 0, 1, 1 }
10. 10/ 17
®
The common way
Why naive bayes ?
● At my tests !
KNN 96% Slow
Super simple, you're just doing a bunch of counts. Naive Bayes is
an eager learning classifier and it is much faster than KNN.
Nodaways it could be used for prediction in real time.
Classifier Accuracy Performance
SVM 92% Medium
NB 94% Fast
11. 11/ 17
®
My way
Automatos like a Match Rules
● Gain Accuracy !
● Gain Performance !
● Because can match to SPAM before to use classifier !
● www.site.com/www.bank.com/
● URL/malware.exe rule like URL/[a-zA-Z]*.exe ...
● Rule like to detect IP at URL
● Deterministic finite automaton to detect
● Use ranking !
NB 94% +4% Fast
12. 12/ 17
®
My way
Automatos like a Match Rules
● Gain Accuracy !
● Gain Performance !
● Because can match to SPAM before to use classifier !
● Deterministic finite automaton at Rules to detect
● www.site.com/www.bank.com/
● URL/malware.exe rule like URL/[a-zA-Z]*.exe ...
● Rule like to detect IP at URL
● Rule to detect Phishing
● Use Ranking !
NB 94% +4% Fast
13. 13/ 17
®
Why Ranking ?
Automatos like a Match Rules
● Gain Accuracy !
NB 94% +4% Fast
14. 14/ 17
®
E-mail audit
The project !
● C++ at all source code ! 100% Open Source !
● IMAP – communication
● Blacklists – DNS, bad domains, e-mail address...
● Deterministic Finite Automaton – Filters
● Tf–idf (term frequency–inverse document
frequency)
● Naive bayes – classifier
15. 15/ 17
®
My way
Automatos like a Match Rules
● Gain Accuracy !
● Gain Performance !
● Because can match to SPAM before to use classifier !
● www.site.com/www.bank.com/
● URL/malware.exe rule like URL/[a-zA-Z]*.exe ...
● Rule like to detect IP at URL
● Deterministic finite automaton to detect
● Use ranking !
NB 94% +4% Fast
16. 16/ 17
®
E-mail audit
The project !
● At the future, using GPU to use KNN and automatons...
● Results with GPU turns all fast...
● Next step 100% of accuracy ?
https://github.com/CoolerVoid/email_audit