SlideShare une entreprise Scribd logo
1  sur  24
Télécharger pour lire hors ligne
Spam detection with
a content-based random-walk
          algorithm
  F. Javier Ortega     Craig Macdonald
  javierortega@us.es   craigm@dcs.gla.ac.uk
  José A. Troyano      Fermín Cruz
  troyano@us.es        fcruz@us.es
Index

   ♦ Introduction
   ♦ Related work
        ♦ Content-based
        ♦ Link-based
   ♦ Our Approach
        ♦ Random-walk algorithm
        ♦ Content-based metrics
        ♦ Selection of seeds
   ♦ Experiments
   ♦ Future work
   ♦ References
Introduction

♦ Web Spam: phenomenon where a number
  of web pages are created for the purpose
  of making a search engine deliver
  undesirable results for a given query.
Introduction
♦ Self-Promotion: gaining high relevance for a
  search engine, mainly based on the textual
  content.
     i.e.: including a number of keywords in the web page.
Introduction

♦ Mutual-Promotion: gaining high score by
  focusing the attention on the out-links and in-links
  of a web page.
   i.e.: a web page with lots of in-links
   can be considered relevant by a search
   engine.
Introduction
♦ Web Spam characteristics:
  ♦ Textual content: large amount of invisible
    content, a set of words with high frequency,
    lots of hyperlinks with large anchor texts, very
    long words, etc.


  ♦ Link-farms: large number of pages pointing
    one to another, in order to improve their scores
    by increasing the number of in-links to them.
     ♦ Good pages usually point to good pages.
     ♦ Spam pages mainly point to other spam pages (link-
       farms). They rarely point to good pages.
Related work: Content-based

♦ Content-based techniques classify the web pages as spam or
  not-spam according to their textual content.
♦ Heuristics to determine the spam likehood of a web page.
   ♦ Meta tag content, anchor texts, URL of the page, average lenght of
     the words, compression rate, etc. [10, 12]
   ♦ Inclusion of link-based scores and metrics into a classifier [3]



♦ Link-based techniques exploit the relations between web pages
  to obtain a rank of pages, ordered according to their spam
  likelihood.
♦ Random-Walk algorithms that penalizes spam-like behaviors.
   ♦ Don't take into account the nearest neighbours [1]
   ♦ Take only the scores received from a specific set of good or bad pages.
     [7,11]
Our Approach

♦ Our approach combines both techniques:
  ♦ A set of content-based metrics, that
    obtains information from each single web
    page.
  ♦ A link-based algorithm, that processes the
    relations between web pages.


♦ The goal is to obtain a ranking of web
  pages, in which spam web pages are
  demoted according to their spam
  likelihood.
Our Approach


    Web          Content-       Selection of
   pages       based metrics      Seeds




                               Random-walk
                                algorithm


           Web graph
Our Approach: random-walk algorithm

♦ We propose a random-walk algorithm that
  computes two scores for each web page:
       ♦PR⁺: relevance of a web page
       ♦PR⁻: spam likelihood of a web page


♦ PR⁻(b), changes according to the relation of
  b with spam-like web pages. Analogous with
  PR⁺.
                     The higher PR⁺(a), the higher PR⁺(b).
   a          b
                     The higher PR⁻(a), the higher PR⁻(b).
Our Approach: random-walk algorithm

♦ Formula:



♦ Intuition:
   High PR⁺                         High PR⁻




          Higher PR⁺!!   Higher PR⁻!!
Our Approach: content-based metrics

♦ Content-based metrics are intended to
  extract some a-priori information from the
  textual content of the web pages.

♦ Content-based metrics must be:
  ♦ Easy to obtain: save the performance!
  ♦ Accurate: precision is preferred over recall.
Our Approach: content-based metrics

♦ Selected metrics:
  ♦ Compressibility: fraction of the sizes of a web
    page, before and after being compressed.
  ♦ Fraction of globally popular words: a web
    page with a high fraction of words within the
    most popular words in the entire corpus, is
    likely to be a spam.
  ♦ Average length of words: non-spam web
    pages have a bell-shaped distribution of
    average word lengths, while malicious pages
    have much higher values.
Our Approach: selection of seeds

♦ Seeds: set of relevant nodes, in terms of
  spam (negative seeds) or not-spam
  likelihood (positive seeds).

♦ The algorithm gives more relevance to the
  seeds.



♦ Spam-biased algorithm
Our Approach: selection of seeds

♦ Unsupervised method: content-based
  metrics as features to choose the seeds.

♦ Pros:
     ♦Human intervention is not needed.
     ♦Larger number of seeds can be considered.
     ♦Inclusion of text content into a link-based
      method.
♦ Due to the lack of human intervention...
     ♦“False positives”.
Our Approach: selection of seeds
♦ Obtaining a-priori score for a node, a:



♦ Selecting seeds:
  ♦ Pos/Neg Approach:



  ♦ Pos/Neg Metrics Approach:


  ♦ Metric-based Approach
Experiments
    ♦ Dataset: WEBSPAM-UK2006*
          ♦ ~98 million pages
          ♦ 11,402 hand-labeled hosts
          ♦ 7,423 labeled as spam.
          ♦ ~10 million spam web pages


    ♦ Terrier IR Platform


    ♦ Random-walk algorithm parameters:
          ♦ Damping factor = 0.85
          ♦ Threshold = 0.01
* C. Castillo, D. Donato, L. Becchetti, P. Boldi, S. Leonardi, M. Santini, and S. Vigna. A reference collection for
web spam. SIGIR Forum, 40(2):11–24, December 2006.
Experiments

♦ Evaluation: PR-buckets
                                      Buckets Total Pages

                                             1          14
          PageRank
                                             2          54
                    }   PR-bucket 1          3        144

   Relevance        }   PR-bucket 2          4        437

                    }   PR-bucket 3
                                             5
                                             6
                                                     1070
                                                     2130


                    }   PR-bucket 4
                           ...
                                             7
                                             8
                                            ...
                                                     2664
                                                     2778
                                                        ...
                                           17         16M
                                           18         28M
                                           19         28M
                                           20         28M
       Total PR =
Experiments

♦ Baseline: TrustRank
   ♦ Link-based technique.
   ♦ Seeds chosen in a semi-supervised way:
      • Hand-picked set of good pages.
      • Top pages according to an inverse PageRank.
   ♦ Random-walk algorithm, biased according to the
     seeds


Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. Combating web
   spam with trustrank. Technical Report 2004-17, Stanford
   InfoLab, March 2004
Experiments
       TrustRank               Pos/Neg Approach




    Pos/Neg Metrics Approach   Metric-based Approach
Experiments

    1000




     100




      10




       1
           1   2     3         4         5          6         7           8   9   10


                   TrustRank   Pos/Neg   Pos/Neg Metrics   MetricsBased
Conclusions and future work
♦ Novel web spam detection technique, that combines
  concepts from link and content-based methods.
   ♦ Content-based metrics as an unsupervised seed
     selection method.
   ♦ Random-walk algorithm to compute two scores for each
     web page: spam and not-spam likelihood.


♦ Future work:
   ♦ Including new content-based heuristics.
   ♦ Improving the spam-biased selection of the seeds,
     taking into account the links to/from each node.
   ♦ Content-based metrics to characterize also the edges of
     the web graph.
References
[1] L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Link-based characterization and detection of web
      spam. In AIRWeb’06: Adversarial Information Retrieval on the Web, 2006.
[2] A. A. Benczur, K. Csalogany, T. Sarlos, M. Uher, and M. Uher. Spamrank - fully automatic link spam detection. In In
     Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb, 2005.
[3] C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: web spam detection using the web
     topology. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on Research and
     development in information retrieval, pages 423–430, New York, NY, USA, 2007. ACM.
[4] G. V. Cormack, M. D. Smucker, and C. L. A. Clarke. Efficient and effective spam filtering and re-ranking for large web
     datasets. Computing Research Repository, 2010.
[5] L. da F. Costa, F. A. Rodrigues, G. Travieso, and P. R. V. Boas. Characterization of complex networks: A survey of
      measurements. Advances in Physics, 56(1):167–242, January 2005.
[6] D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: using statistical analysis to locate spam web
     pages. In WebDB ’04: Proceedings of the 7th International Workshop on the Web and Databases, pages 1–6, New York,
     NY, USA, 2004. ACM.
[7] Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. Technical Report 2004-17, Stanford
     InfoLab, March 2004.
[8] T. H. Haveliwala. Topic-sensitive pagerank: A context-sensitive ranking algorithm for web search. Technical Report 2003-
     29, 2003.2.
[9] G. Jeh and J. Widom. Simrank: a measure of structural-context similarity. In KDD ’02: Proceedings of the eighth ACM
     SIGKDD international conference on Knowledge discovery and data mining, pages 538–543, New York, NY, USA, 2002.
     ACM.
[10] P. Kolari, T. Finin, and A. Joshi. Svms for the blogosphere: Blog identification and splog detection. In AAAI Spring
     Symposium on Computational Approaches to Analysing Weblogs. Computer Science and Electrical Engineering, University
     of Maryland, Baltimore County, March 2006.
[11] V. Krishnan. Web spam detection with anti-trustrank. In ACM SIGIR workshop on Adversarial Information Retrieval on the
     Web, Seattle, Washington, USA, 2006.
[12] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In WWW ’06:
     Proceedings of the 15th international conference on World Wide Web, pages 83–92, New York, NY, USA, 2006. ACM.
[13] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web, 1999.
[14] B. Wu, V. Goel, and B. D. Davison. Propagating trust and distrust to demote web spam. In Proceedings of Models of Trust
     for the Web (MTW), a workshop at the 15th International World Wide Web Conference, Edinburgh, Scotland, 2006.
Thanks for your attention!!

                 Questions?



  F. Javier Ortega      Craig Macdonald
  javierortega@us.es    craigm@dcs.gla.ac.uk
  José A. Troyano       Fermín Cruz
  troyano@us.es         fcruz@us.es

Contenu connexe

En vedette

Hileman Group: Marketing Automation Matters
Hileman Group: Marketing Automation MattersHileman Group: Marketing Automation Matters
Hileman Group: Marketing Automation MattersKyle Chandler
 
Power guineu 1[1]
Power guineu 1[1]Power guineu 1[1]
Power guineu 1[1]43705656K
 
What is your earliest memory
What is your earliest memoryWhat is your earliest memory
What is your earliest memorymarco_fro19
 
Microbes and technology f.1.
Microbes and technology f.1.Microbes and technology f.1.
Microbes and technology f.1.linamontero
 
2011 Global Social Work Student Conference - Silver School of Social Work – N...
2011 Global Social Work Student Conference - Silver School of Social Work – N...2011 Global Social Work Student Conference - Silver School of Social Work – N...
2011 Global Social Work Student Conference - Silver School of Social Work – N...IFSW
 
PhD Thesis presentation
PhD Thesis presentationPhD Thesis presentation
PhD Thesis presentationJavier Ortega
 
28th Social Work Day at the United Nations 2011
28th Social Work Day at the  United Nations 201128th Social Work Day at the  United Nations 2011
28th Social Work Day at the United Nations 2011IFSW
 
User eXitus - Nenechte sve navstevniky odchazet BarCamp 2011 Ostrava
User eXitus - Nenechte sve navstevniky odchazet BarCamp 2011 OstravaUser eXitus - Nenechte sve navstevniky odchazet BarCamp 2011 Ostrava
User eXitus - Nenechte sve navstevniky odchazet BarCamp 2011 Ostravajirikomar
 
Carlos lenin estrada
Carlos lenin estradaCarlos lenin estrada
Carlos lenin estradacarloslenin19
 
What do We Know about Drag Kings?
What do We Know about Drag Kings?What do We Know about Drag Kings?
What do We Know about Drag Kings?Teila123
 
Victoriamolinatp1 110601071455-phpapp01
Victoriamolinatp1 110601071455-phpapp01Victoriamolinatp1 110601071455-phpapp01
Victoriamolinatp1 110601071455-phpapp01Pilii Ise Gelsi
 
Financial terms
Financial terms Financial terms
Financial terms Tanu Bansal
 
Francais orthographe
Francais orthographeFrancais orthographe
Francais orthographezouhaer
 
Real Estate Impacts of Alternative Energy Technology
Real Estate Impacts of Alternative Energy TechnologyReal Estate Impacts of Alternative Energy Technology
Real Estate Impacts of Alternative Energy TechnologyZeroNet-Energy-Solutions
 

En vedette (19)

Hileman Group: Marketing Automation Matters
Hileman Group: Marketing Automation MattersHileman Group: Marketing Automation Matters
Hileman Group: Marketing Automation Matters
 
La moral kantiana( què he de fer
La moral kantiana( què he de ferLa moral kantiana( què he de fer
La moral kantiana( què he de fer
 
Dani h
Dani hDani h
Dani h
 
Power guineu 1[1]
Power guineu 1[1]Power guineu 1[1]
Power guineu 1[1]
 
What is your earliest memory
What is your earliest memoryWhat is your earliest memory
What is your earliest memory
 
Microbes and technology f.1.
Microbes and technology f.1.Microbes and technology f.1.
Microbes and technology f.1.
 
UI Prototype
UI PrototypeUI Prototype
UI Prototype
 
2011 Global Social Work Student Conference - Silver School of Social Work – N...
2011 Global Social Work Student Conference - Silver School of Social Work – N...2011 Global Social Work Student Conference - Silver School of Social Work – N...
2011 Global Social Work Student Conference - Silver School of Social Work – N...
 
PhD Thesis presentation
PhD Thesis presentationPhD Thesis presentation
PhD Thesis presentation
 
28th Social Work Day at the United Nations 2011
28th Social Work Day at the  United Nations 201128th Social Work Day at the  United Nations 2011
28th Social Work Day at the United Nations 2011
 
User eXitus - Nenechte sve navstevniky odchazet BarCamp 2011 Ostrava
User eXitus - Nenechte sve navstevniky odchazet BarCamp 2011 OstravaUser eXitus - Nenechte sve navstevniky odchazet BarCamp 2011 Ostrava
User eXitus - Nenechte sve navstevniky odchazet BarCamp 2011 Ostrava
 
Carlos lenin estrada
Carlos lenin estradaCarlos lenin estrada
Carlos lenin estrada
 
TP 13
TP 13TP 13
TP 13
 
What do We Know about Drag Kings?
What do We Know about Drag Kings?What do We Know about Drag Kings?
What do We Know about Drag Kings?
 
Victoriamolinatp1 110601071455-phpapp01
Victoriamolinatp1 110601071455-phpapp01Victoriamolinatp1 110601071455-phpapp01
Victoriamolinatp1 110601071455-phpapp01
 
Financial terms
Financial terms Financial terms
Financial terms
 
Francais orthographe
Francais orthographeFrancais orthographe
Francais orthographe
 
Slide
SlideSlide
Slide
 
Real Estate Impacts of Alternative Energy Technology
Real Estate Impacts of Alternative Energy TechnologyReal Estate Impacts of Alternative Energy Technology
Real Estate Impacts of Alternative Energy Technology
 

Similaire à Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010)

A Generalization of the PageRank Algorithm : NOTES
A Generalization of the PageRank Algorithm : NOTESA Generalization of the PageRank Algorithm : NOTES
A Generalization of the PageRank Algorithm : NOTESSubhajit Sahu
 
Algorithmic Web Spam detection - Matt Peters MozCon
Algorithmic Web Spam detection - Matt Peters MozConAlgorithmic Web Spam detection - Matt Peters MozCon
Algorithmic Web Spam detection - Matt Peters MozConmattthemathman
 
PageRank_algorithm_Nfaoui_El_Habib
PageRank_algorithm_Nfaoui_El_HabibPageRank_algorithm_Nfaoui_El_Habib
PageRank_algorithm_Nfaoui_El_HabibEl Habib NFAOUI
 
Modified naive bayes model for improved web page classification
Modified naive bayes model for improved web page classificationModified naive bayes model for improved web page classification
Modified naive bayes model for improved web page classificationHammad Haleem
 
Web2.0.2012 - lesson 8 - Google world
Web2.0.2012 - lesson 8 - Google worldWeb2.0.2012 - lesson 8 - Google world
Web2.0.2012 - lesson 8 - Google worldCarlo Vaccari
 
Identifying Extension-based Ad Injection via Fine-grained Web Content Provenance
Identifying Extension-based Ad Injection via Fine-grained Web Content ProvenanceIdentifying Extension-based Ad Injection via Fine-grained Web Content Provenance
Identifying Extension-based Ad Injection via Fine-grained Web Content ProvenanceSajjad "JJ" Arshad
 
Link analysis for web search
Link analysis for web searchLink analysis for web search
Link analysis for web searchEmrullah Delibas
 
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A ReviewIRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A ReviewIRJET Journal
 
Done rerea dlink spam alliances good
Done rerea dlink spam alliances goodDone rerea dlink spam alliances good
Done rerea dlink spam alliances goodJames Arnold
 
Aggregate rank bringing order to web sites
Aggregate rank  bringing order to web sitesAggregate rank  bringing order to web sites
Aggregate rank bringing order to web sitesOUM SAOKOSAL
 
IRJET- Page Ranking Algorithms – A Comparison
IRJET- Page Ranking Algorithms – A ComparisonIRJET- Page Ranking Algorithms – A Comparison
IRJET- Page Ranking Algorithms – A ComparisonIRJET Journal
 
Web Mining.pptx
Web Mining.pptxWeb Mining.pptx
Web Mining.pptxScrbifPt
 
Web mining: Concepts and applications
Web mining: Concepts and applicationsWeb mining: Concepts and applications
Web mining: Concepts and applicationsUtkarsh Sharma
 
Introduction to Web Mining and Spatial Data Mining
Introduction to Web Mining and Spatial Data MiningIntroduction to Web Mining and Spatial Data Mining
Introduction to Web Mining and Spatial Data MiningAarshDhokai
 

Similaire à Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010) (20)

TrustRank.PDF
TrustRank.PDFTrustRank.PDF
TrustRank.PDF
 
A Generalization of the PageRank Algorithm : NOTES
A Generalization of the PageRank Algorithm : NOTESA Generalization of the PageRank Algorithm : NOTES
A Generalization of the PageRank Algorithm : NOTES
 
Web mining
Web miningWeb mining
Web mining
 
Algorithmic Web Spam detection - Matt Peters MozCon
Algorithmic Web Spam detection - Matt Peters MozConAlgorithmic Web Spam detection - Matt Peters MozCon
Algorithmic Web Spam detection - Matt Peters MozCon
 
PageRank_algorithm_Nfaoui_El_Habib
PageRank_algorithm_Nfaoui_El_HabibPageRank_algorithm_Nfaoui_El_Habib
PageRank_algorithm_Nfaoui_El_Habib
 
Modified naive bayes model for improved web page classification
Modified naive bayes model for improved web page classificationModified naive bayes model for improved web page classification
Modified naive bayes model for improved web page classification
 
Web mining tools
Web mining toolsWeb mining tools
Web mining tools
 
Web2.0.2012 - lesson 8 - Google world
Web2.0.2012 - lesson 8 - Google worldWeb2.0.2012 - lesson 8 - Google world
Web2.0.2012 - lesson 8 - Google world
 
Identifying Extension-based Ad Injection via Fine-grained Web Content Provenance
Identifying Extension-based Ad Injection via Fine-grained Web Content ProvenanceIdentifying Extension-based Ad Injection via Fine-grained Web Content Provenance
Identifying Extension-based Ad Injection via Fine-grained Web Content Provenance
 
Link analysis for web search
Link analysis for web searchLink analysis for web search
Link analysis for web search
 
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A ReviewIRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
 
Done rerea dlink spam alliances good
Done rerea dlink spam alliances goodDone rerea dlink spam alliances good
Done rerea dlink spam alliances good
 
Transitivity of Trust
Transitivity of TrustTransitivity of Trust
Transitivity of Trust
 
Gaurav web mining
Gaurav web miningGaurav web mining
Gaurav web mining
 
Aggregate rank bringing order to web sites
Aggregate rank  bringing order to web sitesAggregate rank  bringing order to web sites
Aggregate rank bringing order to web sites
 
IRJET- Page Ranking Algorithms – A Comparison
IRJET- Page Ranking Algorithms – A ComparisonIRJET- Page Ranking Algorithms – A Comparison
IRJET- Page Ranking Algorithms – A Comparison
 
Web mining
Web miningWeb mining
Web mining
 
Web Mining.pptx
Web Mining.pptxWeb Mining.pptx
Web Mining.pptx
 
Web mining: Concepts and applications
Web mining: Concepts and applicationsWeb mining: Concepts and applications
Web mining: Concepts and applications
 
Introduction to Web Mining and Spatial Data Mining
Introduction to Web Mining and Spatial Data MiningIntroduction to Web Mining and Spatial Data Mining
Introduction to Web Mining and Spatial Data Mining
 

Dernier

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 

Dernier (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 

Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010)

  • 1. Spam detection with a content-based random-walk algorithm F. Javier Ortega Craig Macdonald javierortega@us.es craigm@dcs.gla.ac.uk José A. Troyano Fermín Cruz troyano@us.es fcruz@us.es
  • 2. Index ♦ Introduction ♦ Related work ♦ Content-based ♦ Link-based ♦ Our Approach ♦ Random-walk algorithm ♦ Content-based metrics ♦ Selection of seeds ♦ Experiments ♦ Future work ♦ References
  • 3. Introduction ♦ Web Spam: phenomenon where a number of web pages are created for the purpose of making a search engine deliver undesirable results for a given query.
  • 4. Introduction ♦ Self-Promotion: gaining high relevance for a search engine, mainly based on the textual content. i.e.: including a number of keywords in the web page.
  • 5. Introduction ♦ Mutual-Promotion: gaining high score by focusing the attention on the out-links and in-links of a web page. i.e.: a web page with lots of in-links can be considered relevant by a search engine.
  • 6. Introduction ♦ Web Spam characteristics: ♦ Textual content: large amount of invisible content, a set of words with high frequency, lots of hyperlinks with large anchor texts, very long words, etc. ♦ Link-farms: large number of pages pointing one to another, in order to improve their scores by increasing the number of in-links to them. ♦ Good pages usually point to good pages. ♦ Spam pages mainly point to other spam pages (link- farms). They rarely point to good pages.
  • 7. Related work: Content-based ♦ Content-based techniques classify the web pages as spam or not-spam according to their textual content. ♦ Heuristics to determine the spam likehood of a web page. ♦ Meta tag content, anchor texts, URL of the page, average lenght of the words, compression rate, etc. [10, 12] ♦ Inclusion of link-based scores and metrics into a classifier [3] ♦ Link-based techniques exploit the relations between web pages to obtain a rank of pages, ordered according to their spam likelihood. ♦ Random-Walk algorithms that penalizes spam-like behaviors. ♦ Don't take into account the nearest neighbours [1] ♦ Take only the scores received from a specific set of good or bad pages. [7,11]
  • 8. Our Approach ♦ Our approach combines both techniques: ♦ A set of content-based metrics, that obtains information from each single web page. ♦ A link-based algorithm, that processes the relations between web pages. ♦ The goal is to obtain a ranking of web pages, in which spam web pages are demoted according to their spam likelihood.
  • 9. Our Approach Web Content- Selection of pages based metrics Seeds Random-walk algorithm Web graph
  • 10. Our Approach: random-walk algorithm ♦ We propose a random-walk algorithm that computes two scores for each web page: ♦PR⁺: relevance of a web page ♦PR⁻: spam likelihood of a web page ♦ PR⁻(b), changes according to the relation of b with spam-like web pages. Analogous with PR⁺. The higher PR⁺(a), the higher PR⁺(b). a b The higher PR⁻(a), the higher PR⁻(b).
  • 11. Our Approach: random-walk algorithm ♦ Formula: ♦ Intuition: High PR⁺ High PR⁻ Higher PR⁺!! Higher PR⁻!!
  • 12. Our Approach: content-based metrics ♦ Content-based metrics are intended to extract some a-priori information from the textual content of the web pages. ♦ Content-based metrics must be: ♦ Easy to obtain: save the performance! ♦ Accurate: precision is preferred over recall.
  • 13. Our Approach: content-based metrics ♦ Selected metrics: ♦ Compressibility: fraction of the sizes of a web page, before and after being compressed. ♦ Fraction of globally popular words: a web page with a high fraction of words within the most popular words in the entire corpus, is likely to be a spam. ♦ Average length of words: non-spam web pages have a bell-shaped distribution of average word lengths, while malicious pages have much higher values.
  • 14. Our Approach: selection of seeds ♦ Seeds: set of relevant nodes, in terms of spam (negative seeds) or not-spam likelihood (positive seeds). ♦ The algorithm gives more relevance to the seeds. ♦ Spam-biased algorithm
  • 15. Our Approach: selection of seeds ♦ Unsupervised method: content-based metrics as features to choose the seeds. ♦ Pros: ♦Human intervention is not needed. ♦Larger number of seeds can be considered. ♦Inclusion of text content into a link-based method. ♦ Due to the lack of human intervention... ♦“False positives”.
  • 16. Our Approach: selection of seeds ♦ Obtaining a-priori score for a node, a: ♦ Selecting seeds: ♦ Pos/Neg Approach: ♦ Pos/Neg Metrics Approach: ♦ Metric-based Approach
  • 17. Experiments ♦ Dataset: WEBSPAM-UK2006* ♦ ~98 million pages ♦ 11,402 hand-labeled hosts ♦ 7,423 labeled as spam. ♦ ~10 million spam web pages ♦ Terrier IR Platform ♦ Random-walk algorithm parameters: ♦ Damping factor = 0.85 ♦ Threshold = 0.01 * C. Castillo, D. Donato, L. Becchetti, P. Boldi, S. Leonardi, M. Santini, and S. Vigna. A reference collection for web spam. SIGIR Forum, 40(2):11–24, December 2006.
  • 18. Experiments ♦ Evaluation: PR-buckets Buckets Total Pages 1 14 PageRank 2 54 } PR-bucket 1 3 144 Relevance } PR-bucket 2 4 437 } PR-bucket 3 5 6 1070 2130 } PR-bucket 4 ... 7 8 ... 2664 2778 ... 17 16M 18 28M 19 28M 20 28M Total PR =
  • 19. Experiments ♦ Baseline: TrustRank ♦ Link-based technique. ♦ Seeds chosen in a semi-supervised way: • Hand-picked set of good pages. • Top pages according to an inverse PageRank. ♦ Random-walk algorithm, biased according to the seeds Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. Technical Report 2004-17, Stanford InfoLab, March 2004
  • 20. Experiments TrustRank Pos/Neg Approach Pos/Neg Metrics Approach Metric-based Approach
  • 21. Experiments 1000 100 10 1 1 2 3 4 5 6 7 8 9 10 TrustRank Pos/Neg Pos/Neg Metrics MetricsBased
  • 22. Conclusions and future work ♦ Novel web spam detection technique, that combines concepts from link and content-based methods. ♦ Content-based metrics as an unsupervised seed selection method. ♦ Random-walk algorithm to compute two scores for each web page: spam and not-spam likelihood. ♦ Future work: ♦ Including new content-based heuristics. ♦ Improving the spam-biased selection of the seeds, taking into account the links to/from each node. ♦ Content-based metrics to characterize also the edges of the web graph.
  • 23. References [1] L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Link-based characterization and detection of web spam. In AIRWeb’06: Adversarial Information Retrieval on the Web, 2006. [2] A. A. Benczur, K. Csalogany, T. Sarlos, M. Uher, and M. Uher. Spamrank - fully automatic link spam detection. In In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb, 2005. [3] C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: web spam detection using the web topology. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 423–430, New York, NY, USA, 2007. ACM. [4] G. V. Cormack, M. D. Smucker, and C. L. A. Clarke. Efficient and effective spam filtering and re-ranking for large web datasets. Computing Research Repository, 2010. [5] L. da F. Costa, F. A. Rodrigues, G. Travieso, and P. R. V. Boas. Characterization of complex networks: A survey of measurements. Advances in Physics, 56(1):167–242, January 2005. [6] D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In WebDB ’04: Proceedings of the 7th International Workshop on the Web and Databases, pages 1–6, New York, NY, USA, 2004. ACM. [7] Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. Technical Report 2004-17, Stanford InfoLab, March 2004. [8] T. H. Haveliwala. Topic-sensitive pagerank: A context-sensitive ranking algorithm for web search. Technical Report 2003- 29, 2003.2. [9] G. Jeh and J. Widom. Simrank: a measure of structural-context similarity. In KDD ’02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 538–543, New York, NY, USA, 2002. ACM. [10] P. Kolari, T. Finin, and A. Joshi. Svms for the blogosphere: Blog identification and splog detection. In AAAI Spring Symposium on Computational Approaches to Analysing Weblogs. Computer Science and Electrical Engineering, University of Maryland, Baltimore County, March 2006. [11] V. Krishnan. Web spam detection with anti-trustrank. In ACM SIGIR workshop on Adversarial Information Retrieval on the Web, Seattle, Washington, USA, 2006. [12] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In WWW ’06: Proceedings of the 15th international conference on World Wide Web, pages 83–92, New York, NY, USA, 2006. ACM. [13] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web, 1999. [14] B. Wu, V. Goel, and B. D. Davison. Propagating trust and distrust to demote web spam. In Proceedings of Models of Trust for the Web (MTW), a workshop at the 15th International World Wide Web Conference, Edinburgh, Scotland, 2006.
  • 24. Thanks for your attention!! Questions? F. Javier Ortega Craig Macdonald javierortega@us.es craigm@dcs.gla.ac.uk José A. Troyano Fermín Cruz troyano@us.es fcruz@us.es