SlideShare utilise les cookies pour améliorer les fonctionnalités et les performances, et également pour vous montrer des publicités pertinentes. Si vous continuez à naviguer sur ce site, vous acceptez l’utilisation de cookies. Consultez nos Conditions d’utilisation et notre Politique de confidentialité.
SlideShare utilise les cookies pour améliorer les fonctionnalités et les performances, et également pour vous montrer des publicités pertinentes. Si vous continuez à naviguer sur ce site, vous acceptez l’utilisation de cookies. Consultez notre Politique de confidentialité et nos Conditions d’utilisation pour en savoir plus.
Presentation of PolaritySpam, a graph-based ranking algorithm intended to demote the spam web pages in the ranking provided by a web search engine.
F. Javier Ortega; Craig Macdonald; José A. Troyano; Fermín L. Cruz. “Spam Detection with a Content-based Random-Walk Algorithm”. Proceedings of the Second International Workshop on Search and Mining User-Generated Contents, International Conference on Information and Knowledge Management. 2010. Toronto, Canadá
Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010)
Spam detection witha content-based random-walk algorithm F. Javier Ortega Craig Macdonald email@example.com firstname.lastname@example.org José A. Troyano Fermín Cruz email@example.com firstname.lastname@example.org
Index ♦ Introduction ♦ Related work ♦ Content-based ♦ Link-based ♦ Our Approach ♦ Random-walk algorithm ♦ Content-based metrics ♦ Selection of seeds ♦ Experiments ♦ Future work ♦ References
Introduction♦ Web Spam: phenomenon where a number of web pages are created for the purpose of making a search engine deliver undesirable results for a given query.
Introduction♦ Self-Promotion: gaining high relevance for a search engine, mainly based on the textual content. i.e.: including a number of keywords in the web page.
Introduction♦ Mutual-Promotion: gaining high score by focusing the attention on the out-links and in-links of a web page. i.e.: a web page with lots of in-links can be considered relevant by a search engine.
Introduction♦ Web Spam characteristics: ♦ Textual content: large amount of invisible content, a set of words with high frequency, lots of hyperlinks with large anchor texts, very long words, etc. ♦ Link-farms: large number of pages pointing one to another, in order to improve their scores by increasing the number of in-links to them. ♦ Good pages usually point to good pages. ♦ Spam pages mainly point to other spam pages (link- farms). They rarely point to good pages.
Related work: Content-based♦ Content-based techniques classify the web pages as spam or not-spam according to their textual content.♦ Heuristics to determine the spam likehood of a web page. ♦ Meta tag content, anchor texts, URL of the page, average lenght of the words, compression rate, etc. [10, 12] ♦ Inclusion of link-based scores and metrics into a classifier ♦ Link-based techniques exploit the relations between web pages to obtain a rank of pages, ordered according to their spam likelihood.♦ Random-Walk algorithms that penalizes spam-like behaviors. ♦ Dont take into account the nearest neighbours  ♦ Take only the scores received from a specific set of good or bad pages. [7,11]
Our Approach♦ Our approach combines both techniques: ♦ A set of content-based metrics, that obtains information from each single web page. ♦ A link-based algorithm, that processes the relations between web pages.♦ The goal is to obtain a ranking of web pages, in which spam web pages are demoted according to their spam likelihood.
Our Approach Web Content- Selection of pages based metrics Seeds Random-walk algorithm Web graph
Our Approach: random-walk algorithm♦ We propose a random-walk algorithm that computes two scores for each web page: ♦PR⁺: relevance of a web page ♦PR⁻: spam likelihood of a web page♦ PR⁻(b), changes according to the relation of b with spam-like web pages. Analogous with PR⁺. The higher PR⁺(a), the higher PR⁺(b). a b The higher PR⁻(a), the higher PR⁻(b).
Our Approach: random-walk algorithm♦ Formula:♦ Intuition: High PR⁺ High PR⁻ Higher PR⁺!! Higher PR⁻!!
Our Approach: content-based metrics♦ Content-based metrics are intended to extract some a-priori information from the textual content of the web pages.♦ Content-based metrics must be: ♦ Easy to obtain: save the performance! ♦ Accurate: precision is preferred over recall.
Our Approach: content-based metrics♦ Selected metrics: ♦ Compressibility: fraction of the sizes of a web page, before and after being compressed. ♦ Fraction of globally popular words: a web page with a high fraction of words within the most popular words in the entire corpus, is likely to be a spam. ♦ Average length of words: non-spam web pages have a bell-shaped distribution of average word lengths, while malicious pages have much higher values.
Our Approach: selection of seeds♦ Seeds: set of relevant nodes, in terms of spam (negative seeds) or not-spam likelihood (positive seeds).♦ The algorithm gives more relevance to the seeds.♦ Spam-biased algorithm
Our Approach: selection of seeds♦ Unsupervised method: content-based metrics as features to choose the seeds.♦ Pros: ♦Human intervention is not needed. ♦Larger number of seeds can be considered. ♦Inclusion of text content into a link-based method.♦ Due to the lack of human intervention... ♦“False positives”.
Our Approach: selection of seeds♦ Obtaining a-priori score for a node, a:♦ Selecting seeds: ♦ Pos/Neg Approach: ♦ Pos/Neg Metrics Approach: ♦ Metric-based Approach
Experiments ♦ Dataset: WEBSPAM-UK2006* ♦ ~98 million pages ♦ 11,402 hand-labeled hosts ♦ 7,423 labeled as spam. ♦ ~10 million spam web pages ♦ Terrier IR Platform ♦ Random-walk algorithm parameters: ♦ Damping factor = 0.85 ♦ Threshold = 0.01* C. Castillo, D. Donato, L. Becchetti, P. Boldi, S. Leonardi, M. Santini, and S. Vigna. A reference collection forweb spam. SIGIR Forum, 40(2):11–24, December 2006.
Experiments♦ Baseline: TrustRank ♦ Link-based technique. ♦ Seeds chosen in a semi-supervised way: • Hand-picked set of good pages. • Top pages according to an inverse PageRank. ♦ Random-walk algorithm, biased according to the seedsZ. Gyongyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. Technical Report 2004-17, Stanford InfoLab, March 2004
Conclusions and future work♦ Novel web spam detection technique, that combines concepts from link and content-based methods. ♦ Content-based metrics as an unsupervised seed selection method. ♦ Random-walk algorithm to compute two scores for each web page: spam and not-spam likelihood.♦ Future work: ♦ Including new content-based heuristics. ♦ Improving the spam-biased selection of the seeds, taking into account the links to/from each node. ♦ Content-based metrics to characterize also the edges of the web graph.
References L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Link-based characterization and detection of web spam. In AIRWeb’06: Adversarial Information Retrieval on the Web, 2006. A. A. Benczur, K. Csalogany, T. Sarlos, M. Uher, and M. Uher. Spamrank - fully automatic link spam detection. In In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb, 2005. C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: web spam detection using the web topology. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 423–430, New York, NY, USA, 2007. ACM. G. V. Cormack, M. D. Smucker, and C. L. A. Clarke. Efficient and effective spam filtering and re-ranking for large web datasets. Computing Research Repository, 2010. L. da F. Costa, F. A. Rodrigues, G. Travieso, and P. R. V. Boas. Characterization of complex networks: A survey of measurements. Advances in Physics, 56(1):167–242, January 2005. D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In WebDB ’04: Proceedings of the 7th International Workshop on the Web and Databases, pages 1–6, New York, NY, USA, 2004. ACM. Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. Technical Report 2004-17, Stanford InfoLab, March 2004. T. H. Haveliwala. Topic-sensitive pagerank: A context-sensitive ranking algorithm for web search. Technical Report 2003- 29, 2003.2. G. Jeh and J. Widom. Simrank: a measure of structural-context similarity. In KDD ’02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 538–543, New York, NY, USA, 2002. ACM. P. Kolari, T. Finin, and A. Joshi. Svms for the blogosphere: Blog identification and splog detection. In AAAI Spring Symposium on Computational Approaches to Analysing Weblogs. Computer Science and Electrical Engineering, University of Maryland, Baltimore County, March 2006. V. Krishnan. Web spam detection with anti-trustrank. In ACM SIGIR workshop on Adversarial Information Retrieval on the Web, Seattle, Washington, USA, 2006. A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In WWW ’06: Proceedings of the 15th international conference on World Wide Web, pages 83–92, New York, NY, USA, 2006. ACM. L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web, 1999. B. Wu, V. Goel, and B. D. Davison. Propagating trust and distrust to demote web spam. In Proceedings of Models of Trust for the Web (MTW), a workshop at the 15th International World Wide Web Conference, Edinburgh, Scotland, 2006.
Thanks for your attention!! Questions? F. Javier Ortega Craig Macdonald email@example.com firstname.lastname@example.org José A. Troyano Fermín Cruz email@example.com firstname.lastname@example.org