Local Image Descriptor Based Phishing Web Page Recognition as an Open-Set Problem

Local Image Descriptor Based Phishing Web Page
Recognition as an Open-Set Problem
Ahmet Selman Bozkır, Esra Eroglu and Murat Aydos
Hacettepe University Department of Computer Engineering, TURKEY
Baskent University, Department of Management Sciences, TURKEY

Topics
• What is phishing?
• Types of phishing
• Facts about phishing
• Existing approaches
• Why vision based scheme?
• Proposed Method
• Details of Phish-IRIS dataset
• Experiments and results
• Conclusion

What is phishing?
• Phishing is a scamming activity which is based on
creating a visual illusion on innocent users by
providing fake web pages which mimic their
legitimate targets in order to steal valuable digital
data such as credit card information or e-mail
passwords.
Phone phreaking + fishing -> «phishing»

Types of Phishing
• Spear phishing (Person focused)
• Clone phishing (Bulk)
• Whaling
(Person/Institution Focused)
• Rogue WIFI (Mitm)

Facts and figures
* Source: PhishLabs 2016 Phishing Trends & Intelligence Report

Facts and figures
• In 2017, 700.ooo unique
phishing attacks have
been reported*
* Source: 2017 Quarter 3 Phishing Reports of APWG

Facts and figures
Average life time of phishing
pages is 32 hours
• Risk of zero-day attacks
getting higher due to not
being discovered by
blacklists
32h
* Source: APWG, Phishing activity trends paper. [Online].
Available at http://www/antiphishing.org/resources/apwg-papers/

Facts and figures
Consumer-oriented phishing
attacks targeted
• financial institutions
• cloud storage/file hosting sites
• webmail and online services
• ecommerce sites
• payment services.
90%

Facts and figures
• financial institutions
• payment services.
• cloud storage/file hosting sites
• cryptocurrency

Existing Anti-Phishing Approaches
Content & Blacklist
CANTINA [1]
SpoofGuard[2]
NetCraft [3]
DOM based
Medvet et al.[4]
Zhang et al. [5]
Fu et al. [6]
Vision based
Maurer et al.[7]
Verilog [8]
DeltaPhish [10]
Other
Chen et al.[9]

Why a vision based scheme?
• Substition of textual HTML elements with <IMG> or applet like rich
internet application (RIA) contents
• Zero day attacks need pro-active solutions
• Dynamic / AJAX type content loading
• Different DOM organizations between legitimate and target
phishing version.
• Robustness against complex backgrounds or page layouts
• And the most important is vision based solutions are in
concordance with human perception

Our Proposal:
Use of SIFT and DAISY descriptors in Bag of Visual Words Representation
Scale Invariant Feature Transform
• Lowe, 2004
• Local patch based key point driven
sampling
• Scale Invariance
• Rotation Invariance
• Robustness against illumination

Our Proposal:
Use of SIFT and DAISY descriptors in Bag of Visual Words Representation
DAISY Descriptor
• Tola et al., 2010
• Local Patch based dense sampling
• Scale Invariance
• Robustness against illumination

Phish-IRIS Data Set
Publicly available at https://web.cs.hacettepe.edu.tr/~selman/phish-iris-dataset

Phish-IRIS Data Set
• Lack of a common phishing dataset tailored for vision based
antiphishing
• Based on real world observation and literature.
• 14 heavily phished target brands + legitimate samples
• Collected between March and May 2018, Phishtank + Openphish
• Open-set problem -> Collect “other” legitimate samples
• Distinct screenshots
• 1313 Training + 1539 Testing samples
• Screenshots were collected via a specially implemented Java based
wrapper equipped with Selenium Web Driver

Phish-IRIS Data Set
Brand Name Training Samples Testing Samples
Adobe 43 27
Alibaba 50 26
Amazon 18 11
Apple 49 15
Bank of America 81 35
Chase Bank 74 37
Dhl 67 42
Dropbox 75 40
Facebook 87 57
Linkedin 24 14
Microsoft 65 53
Paypal 121 93
Wellsfargo 89 45
Yahoo 70 44
Other 400 1000
Total 1313 1539
• Ratio of train/test for brands:
• 2/3 (roughly)
• Ratio of “unknown/other”:
• 4/10

Experiments
• We made variour experiments with 2 types of descriptors on
SVM, XGBoost and Random Forest algorithms
• We have tested different visual word counts such as 50, 100,
200 and 400 in order to understand whether sparsity or
weak/strong features affect the prediction quality
• Assessments have been carried out on test images via built
machine learning models trained on training images
• Evaluations were carried out regarding to True Positive Rate,
False Positive Rate and F1 measures

Results – 1: SIFT based prediction
Word count Algorithm Train acc Test acc TPR FPR F1
bov-50 svm 0.611 0.7732 0.7732 0.016 0.77
bov-50 xgboost 0.725 0.8187 0.8187 0.012 0.82
bov-50 random forest 0.729 0.842 0.842 0.112 0.84
bov-100 svm 0.674 0.803 0.803 0.014 0.80
bov-100 xgboost 0.762 0.846 0.8466 0.010 0.85
bov-200 svm 0.747 0.837 0.837 0.011 0.84
bov-200 xgboost 0.799 0.8589 0.8589 0.01 0.86
bov-400 svm 0.821 0.8758 0.875 0.008 0.88
bov-400 xgboost 0.827 0.8934 0.893 0.0076 0.89

Results – 2: DAISY based prediction
Word count Algorithm Train acc Test acc TPR FPR F1
bov-50 svm 0,648 0,7465 0,746 0,018 0,74
bov-50 xgboost 0,678 0,7849 0,784 0,015 0,78
bov-50 random forest 0,699 0,816 0,816 0,013 0,8
bov-100 svm 0,709 0,7758 0,775 0,016 0,77
bov-100 xgboost 0,709 0,7953 0,795 0,014 0,79
bov-200 svm 0,725 0,7901 0,79 0,014 0,79
bov-200 xgboost 0,722 0,8174 0,817 0,013 0,81
bov-400 svm 0,725 0,818 0,818 0,818 0,81
bov-400 xgboost 0,725 0,8122 0,812 0,013 0,8

Comparison with (Bozkir et al. 2018*)
Method # of Features Learner TPR FPR F1
JCD 5040 random forest 0.891 0.131 0.886
FCTH 5760 random forest 0.895 0.114 0.891
CEDD 4320 random forest 0.895 0.137 0.89
CLD 360 random forest 0.878 0.146 0.873
SCD 3584 svm 0.906 0.085 0.905
Our best (Sift) 400 xgboost 0.893 0.0076 0.89
* Dalgic, F.C., Bozkir A.S., Aydos, M., “Phish-IRIS: A New Approach for Vision Based Brand Prediction of
Phishing Web Pages via Compact Visual Descriptors”, ISMSIT, Kızılcahamam, Ankara, 2018

Conclusion
• SIFT features based prediction outperforms the DAISY based recognition
• We have found that required time for computation of SIFT is less than DAISY
• One key finding we discovered is the importance of sampling strategy. Key point based
sampling yields a better result
• More the visual words we extract, more the accuracy we achieve. So, sparsity has not
been found as a problem for SIFT and DAISY. Inference takes only 0.32 seconds on a PC
equipped with Intel 8750 + 16 GB Memory
• Scalable Color Descriptor still surpasses SIFT in terms of TPR and FPR meaning that
color information is important as edge / contour / textures. However, SIFT achieves a
better false positive rate with much less number of features
• Consequently, SIFT is a suitable candidate for phishing web page detection/recognition
• Future work: Use of Deep Convolutional Neural Networks

References
1. Y. Zhang, J. Hong, L. Cranor, CANTINA: A Content-Based Approach to Detecting Phishing Web Sites, WWW 2007
2. Chou, N., R. Ledesma, Y. Teraguchi, D. Boneh, and J.C. Mitchell. Client-Side Defense against Web-Based Identity Theft.
In Proceedings of The 11th Annual Network and Distributed System Security Symposium (NDSS '04).
3. Netcraft, Netcraft Anti-Phishing Toolbar. Visited: April 20, 2016. http://toolbar.netcraft.com/
4. E. Medvet, E. Kirda and C. Krueger, Visual-Similarity-Based Phishing Detection, Securecomm ’08 International
Conference on Security and Privacy in Communication Networks, 2008
5. W. Zhang, H. Lu, B. Xu and H. Yang, Web Phishing Detection Based on Page Spatial Layout Similarity, Informatica, vol.
37, pp. 231-244, 2013.
6. A.Y. Fu, L. Wenyin and X. Deng, Detecting Phishing Web Pages with Visual Similarity Assesment based Earth
Mover’s Distance (EMD), IEEE Transactions on Dependable and Secure Computing, pp. 301-311, 2006.
7. M.E. Maurer and D. Herzner, Using visual website similarity for phishing detection and reporting, In CHI’12
Extended Abstacts on Human Factors in Computing Systems, 2012.
8. G. Wang, H. Liu, S. Becerra, K. Wang, Verilog: Proactive Phishing Detection via Logo Recognition, Technical
Report CS2011-0669, UC San Diego, 2011.
9. T. Chen, S. Dick, J. Miller, Detecting Visually Similar Web Pages: Application to Phishing Detection, ACM
Transactions on Internet and Technology, 10(2), 2010

Local Image Descriptor Based Phishing Web Page Recognition as an Open-Set Problem

Recommandé

Recommandé

Contenu connexe

Plus de Selman Bozkır

Plus de Selman Bozkır (7)

Dernier

Dernier (20)

Local Image Descriptor Based Phishing Web Page Recognition as an Open-Set Problem