Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010)

Spam detection with
a content-based random-walk
algorithm
F. Javier Ortega Craig Macdonald
javierortega@us.es craigm@dcs.gla.ac.uk
José A. Troyano Fermín Cruz
troyano@us.es fcruz@us.es

Index

♦ Introduction
♦ Related work
♦ Content-based
♦ Link-based
♦ Our Approach
♦ Random-walk algorithm
♦ Content-based metrics
♦ Selection of seeds
♦ Experiments
♦ Future work
♦ References

Introduction

♦ Web Spam: phenomenon where a number
of web pages are created for the purpose
of making a search engine deliver
undesirable results for a given query.

Introduction
♦ Self-Promotion: gaining high relevance for a
search engine, mainly based on the textual
content.
i.e.: including a number of keywords in the web page.

Introduction

♦ Mutual-Promotion: gaining high score by
focusing the attention on the out-links and in-links
of a web page.
i.e.: a web page with lots of in-links
can be considered relevant by a search
engine.

Introduction
♦ Web Spam characteristics:
♦ Textual content: large amount of invisible
content, a set of words with high frequency,
lots of hyperlinks with large anchor texts, very
long words, etc.

♦ Link-farms: large number of pages pointing
one to another, in order to improve their scores
by increasing the number of in-links to them.
♦ Good pages usually point to good pages.
♦ Spam pages mainly point to other spam pages (link-
farms). They rarely point to good pages.

Related work: Content-based

♦ Content-based techniques classify the web pages as spam or
not-spam according to their textual content.
♦ Heuristics to determine the spam likehood of a web page.
♦ Meta tag content, anchor texts, URL of the page, average lenght of
the words, compression rate, etc. [10, 12]
♦ Inclusion of link-based scores and metrics into a classifier [3]

♦ Link-based techniques exploit the relations between web pages
to obtain a rank of pages, ordered according to their spam
likelihood.
♦ Random-Walk algorithms that penalizes spam-like behaviors.
♦ Don't take into account the nearest neighbours [1]
♦ Take only the scores received from a specific set of good or bad pages.
[7,11]

Our Approach

♦ Our approach combines both techniques:
♦ A set of content-based metrics, that
obtains information from each single web
page.
♦ A link-based algorithm, that processes the
relations between web pages.

♦ The goal is to obtain a ranking of web
pages, in which spam web pages are
demoted according to their spam
likelihood.

Our Approach

Web Content- Selection of
pages based metrics Seeds

Random-walk
algorithm

Web graph

Our Approach: random-walk algorithm

♦ We propose a random-walk algorithm that
computes two scores for each web page:
♦PR⁺: relevance of a web page
♦PR⁻: spam likelihood of a web page

♦ PR⁻(b), changes according to the relation of
b with spam-like web pages. Analogous with
PR⁺.
The higher PR⁺(a), the higher PR⁺(b).
a b
The higher PR⁻(a), the higher PR⁻(b).

Our Approach: random-walk algorithm

♦ Formula:

♦ Intuition:
High PR⁺ High PR⁻

Higher PR⁺!! Higher PR⁻!!

Our Approach: content-based metrics

♦ Content-based metrics are intended to
extract some a-priori information from the
textual content of the web pages.

♦ Content-based metrics must be:
♦ Easy to obtain: save the performance!
♦ Accurate: precision is preferred over recall.

Our Approach: content-based metrics

♦ Selected metrics:
♦ Compressibility: fraction of the sizes of a web
page, before and after being compressed.
♦ Fraction of globally popular words: a web
page with a high fraction of words within the
most popular words in the entire corpus, is
likely to be a spam.
♦ Average length of words: non-spam web
pages have a bell-shaped distribution of
average word lengths, while malicious pages
have much higher values.

Our Approach: selection of seeds

♦ Seeds: set of relevant nodes, in terms of
spam (negative seeds) or not-spam
likelihood (positive seeds).

♦ The algorithm gives more relevance to the
seeds.

♦ Spam-biased algorithm


♦ Unsupervised method: content-based
metrics as features to choose the seeds.

♦ Pros:
♦Human intervention is not needed.
♦Larger number of seeds can be considered.
♦Inclusion of text content into a link-based
method.
♦ Due to the lack of human intervention...
♦“False positives”.

♦ Obtaining a-priori score for a node, a:

♦ Selecting seeds:
♦ Pos/Neg Approach:

♦ Pos/Neg Metrics Approach:

♦ Metric-based Approach

Experiments
♦ Dataset: WEBSPAM-UK2006*
♦ ~98 million pages
♦ 11,402 hand-labeled hosts
♦ 7,423 labeled as spam.
♦ ~10 million spam web pages

♦ Terrier IR Platform

♦ Random-walk algorithm parameters:
♦ Damping factor = 0.85
♦ Threshold = 0.01
* C. Castillo, D. Donato, L. Becchetti, P. Boldi, S. Leonardi, M. Santini, and S. Vigna. A reference collection for
web spam. SIGIR Forum, 40(2):11–24, December 2006.

Experiments

♦ Evaluation: PR-buckets
Buckets Total Pages

1 14
PageRank
2 54
} PR-bucket 1 3 144

Relevance } PR-bucket 2 4 437

} PR-bucket 3
5
6
1070
2130

} PR-bucket 4
...
7
8
...
2664
2778
...
17 16M
18 28M
19 28M
20 28M
Total PR =

Experiments

♦ Baseline: TrustRank
♦ Link-based technique.
♦ Seeds chosen in a semi-supervised way:
• Hand-picked set of good pages.
• Top pages according to an inverse PageRank.
♦ Random-walk algorithm, biased according to the
seeds

Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. Combating web
spam with trustrank. Technical Report 2004-17, Stanford
InfoLab, March 2004

Experiments
TrustRank Pos/Neg Approach

Pos/Neg Metrics Approach Metric-based Approach

Experiments

1000

100

10

1
1 2 3 4 5 6 7 8 9 10

TrustRank Pos/Neg Pos/Neg Metrics MetricsBased

Conclusions and future work
♦ Novel web spam detection technique, that combines
concepts from link and content-based methods.
♦ Content-based metrics as an unsupervised seed
selection method.
♦ Random-walk algorithm to compute two scores for each
web page: spam and not-spam likelihood.

♦ Future work:
♦ Including new content-based heuristics.
♦ Improving the spam-biased selection of the seeds,
taking into account the links to/from each node.
♦ Content-based metrics to characterize also the edges of
the web graph.

References
[1] L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Link-based characterization and detection of web
spam. In AIRWeb’06: Adversarial Information Retrieval on the Web, 2006.
[2] A. A. Benczur, K. Csalogany, T. Sarlos, M. Uher, and M. Uher. Spamrank - fully automatic link spam detection. In In
Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb, 2005.
[3] C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: web spam detection using the web
topology. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on Research and
development in information retrieval, pages 423–430, New York, NY, USA, 2007. ACM.
[4] G. V. Cormack, M. D. Smucker, and C. L. A. Clarke. Efficient and effective spam filtering and re-ranking for large web
datasets. Computing Research Repository, 2010.
[5] L. da F. Costa, F. A. Rodrigues, G. Travieso, and P. R. V. Boas. Characterization of complex networks: A survey of
measurements. Advances in Physics, 56(1):167–242, January 2005.
[6] D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: using statistical analysis to locate spam web
pages. In WebDB ’04: Proceedings of the 7th International Workshop on the Web and Databases, pages 1–6, New York,
NY, USA, 2004. ACM.
[7] Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. Technical Report 2004-17, Stanford
InfoLab, March 2004.
[8] T. H. Haveliwala. Topic-sensitive pagerank: A context-sensitive ranking algorithm for web search. Technical Report 2003-
29, 2003.2.
[9] G. Jeh and J. Widom. Simrank: a measure of structural-context similarity. In KDD ’02: Proceedings of the eighth ACM
SIGKDD international conference on Knowledge discovery and data mining, pages 538–543, New York, NY, USA, 2002.
ACM.
[10] P. Kolari, T. Finin, and A. Joshi. Svms for the blogosphere: Blog identification and splog detection. In AAAI Spring
Symposium on Computational Approaches to Analysing Weblogs. Computer Science and Electrical Engineering, University
of Maryland, Baltimore County, March 2006.
[11] V. Krishnan. Web spam detection with anti-trustrank. In ACM SIGIR workshop on Adversarial Information Retrieval on the
Web, Seattle, Washington, USA, 2006.
[12] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In WWW ’06:
Proceedings of the 15th international conference on World Wide Web, pages 83–92, New York, NY, USA, 2006. ACM.
[13] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web, 1999.
[14] B. Wu, V. Goel, and B. D. Davison. Propagating trust and distrust to demote web spam. In Proceedings of Models of Trust
for the Web (MTW), a workshop at the 15th International World Wide Web Conference, Edinburgh, Scotland, 2006.

Thanks for your attention!!

Questions?

F. Javier Ortega Craig Macdonald
javierortega@us.es craigm@dcs.gla.ac.uk
José A. Troyano Fermín Cruz
troyano@us.es fcruz@us.es

Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010)

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (19)

Similaire à Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010)

Similaire à Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010) (20)

Dernier

Dernier (20)

Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010)