With the advent of e-commerce, digital services and social media, scammers have changed their way to gain illegal benefits in various forms such as capturing the credit card information or exploiting personal cloud accounts which is termed as phishing. For this reason, against this cyber crime, last two decades have witnessed a variety of combatting methodologies like HTML content based similarity analysis, URL based classification and recently visual similarity based matching since phishing web pages visually mimic to their legitimate counterparts in order to create an illusion to deceive innocent users. To this end, in this study, we propose a computer vision and machine learning based approach in order to classify whether a suspicious web page is phishing and further recognize its original brand name. In this regard, we have utilized and investigated two different local image descriptors namely Scale Invariant Feature Transform (SIFT) and DAISY. Apart from their common properties such as scale invariance, the aforementioned descriptors have apparent differences such that in addition to rotational invariance, SIFT employs key-point based sampling whereas DAISY applies dense sampling by default. Therefore, we first aimed to investigate the feasibility of these two local image descriptors in addition to revealing the effects of sampling strategy and rotational invariance in problem domain. Furthermore, in order to create a discriminative representation of a web page, we followed the bag of visual words (BOVW) approach having different vocabulary sizes such as 50, 100, 200 and 400. In order to evaluate the proposed approach, we have utilized a publicly available phishing dataset including snapshots of webpages sampled from both 14 different highly phished brands and ordinary legitimate web pages yielding a challenging open-set problem. The aforementioned dataset involves 1313 training and 1539 testing image samples in total. The visual features extracted via SIFT and DAISY were first transformed to a BOVW histogram and fed to three different machine learning methods such as SVM, Random Forest and XGBoost. According to the conducted experiments, based on a 400-D visual vocabulary, SIFT descriptor along with XGBoost has been found as the best descriptor-learner configuration having reached up to 89.34% validation accuracy with 0.76% false positive rate. Moreover, SIFT has outperformed DAISY descriptor in all settings. As a result, it has been shown that SIFT descriptors equipped with BOVW representation can be effectively used for brand identification of phishing web pages.
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
Local Image Descriptor Based Phishing Web Page Recognition as an Open-Set Problem
1. Local Image Descriptor Based Phishing Web Page
Recognition as an Open-Set Problem
Ahmet Selman Bozkır, Esra Eroglu and Murat Aydos
Hacettepe University Department of Computer Engineering, TURKEY
Baskent University, Department of Management Sciences, TURKEY
2. Topics
• What is phishing?
• Types of phishing
• Facts about phishing
• Existing approaches
• Why vision based scheme?
• Proposed Method
• Details of Phish-IRIS dataset
• Experiments and results
• Conclusion
3. What is phishing?
• Phishing is a scamming activity which is based on
creating a visual illusion on innocent users by
providing fake web pages which mimic their
legitimate targets in order to steal valuable digital
data such as credit card information or e-mail
passwords.
Phone phreaking + fishing -> «phishing»
6. Facts and figures
• In 2017, 700.ooo unique
phishing attacks have
been reported*
* Source: 2017 Quarter 3 Phishing Reports of APWG
7. Facts and figures
Average life time of phishing
pages is 32 hours
• Risk of zero-day attacks
getting higher due to not
being discovered by
blacklists
32h
* Source: APWG, Phishing activity trends paper. [Online].
Available at http://www/antiphishing.org/resources/apwg-papers/
10. Existing Anti-Phishing Approaches
Content & Blacklist
CANTINA [1]
SpoofGuard[2]
NetCraft [3]
DOM based
Medvet et al.[4]
Zhang et al. [5]
Fu et al. [6]
Vision based
Maurer et al.[7]
Verilog [8]
DeltaPhish [10]
Other
Chen et al.[9]
11. Why a vision based scheme?
• Substition of textual HTML elements with <IMG> or applet like rich
internet application (RIA) contents
• Zero day attacks need pro-active solutions
• Dynamic / AJAX type content loading
• Different DOM organizations between legitimate and target
phishing version.
• Robustness against complex backgrounds or page layouts
• And the most important is vision based solutions are in
concordance with human perception
12. Our Proposal:
Use of SIFT and DAISY descriptors in Bag of Visual Words Representation
Scale Invariant Feature Transform
• Lowe, 2004
• Local patch based key point driven
sampling
• Scale Invariance
• Rotation Invariance
• Robustness against illumination
13. Our Proposal:
Use of SIFT and DAISY descriptors in Bag of Visual Words Representation
DAISY Descriptor
• Tola et al., 2010
• Local Patch based dense sampling
• Scale Invariance
• Robustness against illumination
16. Phish-IRIS Data Set
• Lack of a common phishing dataset tailored for vision based
antiphishing
• Based on real world observation and literature.
• 14 heavily phished target brands + legitimate samples
• Collected between March and May 2018, Phishtank + Openphish
• Open-set problem -> Collect “other” legitimate samples
• Distinct screenshots
• 1313 Training + 1539 Testing samples
• Screenshots were collected via a specially implemented Java based
wrapper equipped with Selenium Web Driver
17. Phish-IRIS Data Set
Brand Name Training Samples Testing Samples
Adobe 43 27
Alibaba 50 26
Amazon 18 11
Apple 49 15
Bank of America 81 35
Chase Bank 74 37
Dhl 67 42
Dropbox 75 40
Facebook 87 57
Linkedin 24 14
Microsoft 65 53
Paypal 121 93
Wellsfargo 89 45
Yahoo 70 44
Other 400 1000
Total 1313 1539
• Ratio of train/test for brands:
• 2/3 (roughly)
• Ratio of “unknown/other”:
• 4/10
18. Experiments
• We made variour experiments with 2 types of descriptors on
SVM, XGBoost and Random Forest algorithms
• We have tested different visual word counts such as 50, 100,
200 and 400 in order to understand whether sparsity or
weak/strong features affect the prediction quality
• Assessments have been carried out on test images via built
machine learning models trained on training images
• Evaluations were carried out regarding to True Positive Rate,
False Positive Rate and F1 measures
21. Comparison with (Bozkir et al. 2018*)
Method # of Features Learner TPR FPR F1
JCD 5040 random forest 0.891 0.131 0.886
FCTH 5760 random forest 0.895 0.114 0.891
CEDD 4320 random forest 0.895 0.137 0.89
CLD 360 random forest 0.878 0.146 0.873
SCD 3584 svm 0.906 0.085 0.905
Our best (Sift) 400 xgboost 0.893 0.0076 0.89
* Dalgic, F.C., Bozkir A.S., Aydos, M., “Phish-IRIS: A New Approach for Vision Based Brand Prediction of
Phishing Web Pages via Compact Visual Descriptors”, ISMSIT, Kızılcahamam, Ankara, 2018
22. Conclusion
• SIFT features based prediction outperforms the DAISY based recognition
• We have found that required time for computation of SIFT is less than DAISY
• One key finding we discovered is the importance of sampling strategy. Key point based
sampling yields a better result
• More the visual words we extract, more the accuracy we achieve. So, sparsity has not
been found as a problem for SIFT and DAISY. Inference takes only 0.32 seconds on a PC
equipped with Intel 8750 + 16 GB Memory
• Scalable Color Descriptor still surpasses SIFT in terms of TPR and FPR meaning that
color information is important as edge / contour / textures. However, SIFT achieves a
better false positive rate with much less number of features
• Consequently, SIFT is a suitable candidate for phishing web page detection/recognition
• Future work: Use of Deep Convolutional Neural Networks
23. References
1. Y. Zhang, J. Hong, L. Cranor, CANTINA: A Content-Based Approach to Detecting Phishing Web Sites, WWW 2007
2. Chou, N., R. Ledesma, Y. Teraguchi, D. Boneh, and J.C. Mitchell. Client-Side Defense against Web-Based Identity Theft.
In Proceedings of The 11th Annual Network and Distributed System Security Symposium (NDSS '04).
3. Netcraft, Netcraft Anti-Phishing Toolbar. Visited: April 20, 2016. http://toolbar.netcraft.com/
4. E. Medvet, E. Kirda and C. Krueger, Visual-Similarity-Based Phishing Detection, Securecomm ’08 International
Conference on Security and Privacy in Communication Networks, 2008
5. W. Zhang, H. Lu, B. Xu and H. Yang, Web Phishing Detection Based on Page Spatial Layout Similarity, Informatica, vol.
37, pp. 231-244, 2013.
6. A.Y. Fu, L. Wenyin and X. Deng, Detecting Phishing Web Pages with Visual Similarity Assesment based Earth
Mover’s Distance (EMD), IEEE Transactions on Dependable and Secure Computing, pp. 301-311, 2006.
7. M.E. Maurer and D. Herzner, Using visual website similarity for phishing detection and reporting, In CHI’12
Extended Abstacts on Human Factors in Computing Systems, 2012.
8. G. Wang, H. Liu, S. Becerra, K. Wang, Verilog: Proactive Phishing Detection via Logo Recognition, Technical
Report CS2011-0669, UC San Diego, 2011.
9. T. Chen, S. Dick, J. Miller, Detecting Visually Similar Web Pages: Application to Phishing Detection, ACM
Transactions on Internet and Technology, 10(2), 2010