SlideShare a Scribd company logo
1 of 167
Download to read offline
Link Analysis for
Web Information
    Retrieval

   C. Castillo

Hypothesis
                    Link Analysis for Web Information Retrieval
Levels of link
analysis
                             With Applications to Adversarial IR
Ranking

Web spam

                                      Carlos Castillo1
... detection

... links
                                  chato@yahoo-inc.com
... contents
                     With: R. Baeza-Yates1,3 , L. Becchetti2 , P. Boldi5 ,
... both
                     D. Donato1 , A. Gionis1 , S. Leonardi2 , V.Murdock1 ,
Summary
                           M. Santini5 , F. Silvestri4 , S. Vigna5

                           1. Yahoo! Research Barcelona – Catalunya, Spain
                          2. Universit` di Roma “La Sapienza” – Rome, Italy
                                      a
                                 3. Yahoo! Research Santiago – Chile
                                         4. ISTI-CNR –Pisa,Italy
                            5. Universit` degli Studi di Milano – Milan, Italy
                                        a
Link Analysis for
                    When you have a hammer
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    Everything looks like a graph!
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
                      Hypothesis
                    1
analysis
                      Levels of link analysis
                    2
Ranking
                      Ranking
                    3
Web spam
                      Web spam
                    4
... detection
                      ... detection
                    5
... links
                      ... links
                    6
... contents
                    7 ... contents
... both
                    8 ... both
Summary             9 Summary
Link Analysis for
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis
                          Links are not placed at random
Ranking

Web spam

... detection

... links
                    Topical locality hypothesis
... contents
                    Link endorsement hypothesis
... both

Summary
Link Analysis for
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis
                          Links are not placed at random
Ranking

Web spam

... detection

... links
                    Topical locality hypothesis
... contents
                    Link endorsement hypothesis
... both

Summary
Link Analysis for
                    Topical locality hypothesis
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

                        “We found that pages are significantly more likely to
... detection

                        be related topically to pages to which they are
... links

                        linked, as opposed to other pages selected at
... contents

                        random or other nearby pages.” [Davison, 2000]
... both

Summary
Link Analysis for
Web Information
    Retrieval

                                                    0.7
   C. Castillo




                          Average text similarity
Hypothesis
                                                    0.6
Levels of link
analysis

Ranking
                                                    0.5
Web spam

... detection
                                                    0.4
... links

... contents
                                                    0.3
... both

Summary
                                                    0.2
                                                          1   2         3         4   5
                                                                  Link distance

                    [Baeza-Yates et al., 2006], data from UK 2006
Link Analysis for
                    Link similarity cases
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection
                        Link (geodesic) distance
... links
                        Co-citation
... contents

                        Bibliographic coupling
... both

Summary
Link Analysis for
                    Co-citation
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    Bibliographic coupling
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    (Both can be generalized)
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection
                    (Both co-citation and bibliographic coupling can be
... links
                    generalized. E.g.: SimRank [Jeh and Widom, 2002]:
... contents
                    generalizes the idea of co-citation to several levels)
... both

Summary
Link Analysis for
                    Link endorsement hypothesis
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis
                    Links are assumed to be endorsements (votes, positive
Ranking
                    opinions) [Li, 1998]
Web spam

... detection
                    But they can represent:
... links
                        Disagreement
... contents

                        Self citations
... both

Summary
                        Nepotism
                        Citations to methodological documents
                        etc.
Link Analysis for
                    Link endorsement hypothesis
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis
                    Links are assumed to be endorsements (votes, positive
Ranking
                    opinions) [Li, 1998]
Web spam

... detection
                    But they can represent:
... links
                        Disagreement
... contents

                        Self citations
... both

Summary
                        Nepotism
                        Citations to methodological documents
                        etc.
Link Analysis for
                    Furthermore
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis
                        They measure quantity not quality (e.g.: “Stop the
Ranking
                        numbers game!” in ACM communications a few months
Web spam
                        ago)
... detection

                        Self-citations are frequent
... links

... contents
                        In some topics there is more linking
... both
                        Citations go from newer to older
Summary

                        New documents get few
                        citations [Baeza-Yates et al., 2002]
                        Many of the citations are irrelevant
Link Analysis for
                    Furthermore
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis
                        They measure quantity not quality (e.g.: “Stop the
Ranking
                        numbers game!” in ACM communications a few months
Web spam
                        ago)
... detection

                        Self-citations are frequent
... links

... contents
                        In some topics there is more linking
... both
                        Citations go from newer to older
Summary

                        New documents get few
                        citations [Baeza-Yates et al., 2002]
                        Many of the citations are irrelevant
Link Analysis for
                    Furthermore
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis
                        They measure quantity not quality (e.g.: “Stop the
Ranking
                        numbers game!” in ACM communications a few months
Web spam
                        ago)
... detection

                        Self-citations are frequent
... links

... contents
                        In some topics there is more linking
... both
                        Citations go from newer to older
Summary

                        New documents get few
                        citations [Baeza-Yates et al., 2002]
                        Many of the citations are irrelevant
Link Analysis for
                    Furthermore
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis
                        They measure quantity not quality (e.g.: “Stop the
Ranking
                        numbers game!” in ACM communications a few months
Web spam
                        ago)
... detection

                        Self-citations are frequent
... links

... contents
                        In some topics there is more linking
... both
                        Citations go from newer to older
Summary

                        New documents get few
                        citations [Baeza-Yates et al., 2002]
                        Many of the citations are irrelevant
Link Analysis for
                    Furthermore
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis
                        They measure quantity not quality (e.g.: “Stop the
Ranking
                        numbers game!” in ACM communications a few months
Web spam
                        ago)
... detection

                        Self-citations are frequent
... links

... contents
                        In some topics there is more linking
... both
                        Citations go from newer to older
Summary

                        New documents get few
                        citations [Baeza-Yates et al., 2002]
                        Many of the citations are irrelevant
Link Analysis for
                    Furthermore
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis
                        They measure quantity not quality (e.g.: “Stop the
Ranking
                        numbers game!” in ACM communications a few months
Web spam
                        ago)
... detection

                        Self-citations are frequent
... links

... contents
                        In some topics there is more linking
... both
                        Citations go from newer to older
Summary

                        New documents get few
                        citations [Baeza-Yates et al., 2002]
                        Many of the citations are irrelevant
Link Analysis for
                    Nevertheless
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking
                    Both the topical locality hypothesis and the link endorsement
Web spam
                    hypothesis are meaningful on the Web
... detection

... links

                    Analogy with Economy
... contents

... both
                    Think on the hypothesis requiring many buyers/sellers, zero
Summary
                    transaction costs, perfect information, etc. in economic
                    sciences
Link Analysis for
                    Nevertheless
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking
                    Both the topical locality hypothesis and the link endorsement
Web spam
                    hypothesis are meaningful on the Web
... detection

... links

                    Analogy with Economy
... contents

... both
                    Think on the hypothesis requiring many buyers/sellers, zero
Summary
                    transaction costs, perfect information, etc. in economic
                    sciences
Link Analysis for
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
                      Hypothesis
                    1
analysis
                      Levels of link analysis
                    2
Ranking
                      Ranking
                    3
Web spam
                      Web spam
                    4
... detection
                      ... detection
                    5
... links
                      ... links
                    6
... contents
                    7 ... contents
... both
                    8 ... both
Summary             9 Summary
Link Analysis for
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    How to find meaningful patterns?
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam
                    Several levels of analysis:
... detection

                         Macroscopic view: overall structure
... links

... contents
                         Microscopic view: nodes
... both
                         Mesoscopic view: regions
Summary
Link Analysis for
                    How to find meaningful patterns?
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam
                    Several levels of analysis:
... detection

                         Macroscopic view: overall structure
... links

... contents
                         Microscopic view: nodes
... both
                         Mesoscopic view: regions
Summary
Link Analysis for
                    How to find meaningful patterns?
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam
                    Several levels of analysis:
... detection

                         Macroscopic view: overall structure
... links

... contents
                         Microscopic view: nodes
... both
                         Mesoscopic view: regions
Summary
Link Analysis for
                    Macroscopic view, e.g. Bow-tie
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary




                    [Broder et al., 2000]
Link Analysis for
                    Macroscopic view, e.g. Jellyfish
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary




                    [Tauro et al., 2001] - Internet Autonomous Systems (AS)
                    Topology
Link Analysis for
                    Macroscopic view, e.g. Jellyfish
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    Microscopic view, e.g. Degree
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary




                    [Barab´si, 2002] and others
                          a
Link Analysis for
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
                    “While entirely of human design, the emerging
                    network appears to have more in common with a cell
                    or an ecological system than with a Swiss
                    watch.” [Barab´si, 2002]
                                   a
Link Analysis for
                    Other scale-free networks
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam
                        Power grid designs
... detection
                        Sexual partners in humans
... links

                        Collaboration of movie actors in films
... contents

... both
                        Citations in scientific publications
Summary
                        Protein interactions
Link Analysis for
                    Microscopic view, e.g. Degree
Web Information
    Retrieval

   C. Castillo
                                  Greece                       Chile
Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents
                                  Spain                        Korea
... both

Summary




                    [Baeza-Yates et al., 2007] - compares this distribution in 8
                    countries . . . guess what is the result?
Link Analysis for
                    Mesoscopic view, e.g. Hop-plot
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    Mesoscopic view, e.g. Hop-plot
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    Mesoscopic view, e.g. Hop-plot
Web Information
    Retrieval

   C. Castillo

                                    .it (40M pages)                                         .uk (18M pages)
Hypothesis
                                   0.3                                                     0.3
Levels of link
analysis
                                   0.2                                                     0.2
Ranking
                       Frequency




                                                                               Frequency
Web spam
                                   0.1                                                     0.1
... detection

... links                          0.0                                                     0.0
                                         5   10     15       20   25   30                        5   10     15       20   25   30
... contents                                      Distance                                                Distance

                        .eu.int (800K pages)                                Synthetic graph (100K pages)
... both

Summary                            0.3                                                     0.3


                                   0.2                                                     0.2
                       Frequency




                                                                               Frequency
                                   0.1                                                     0.1


                                   0.0                                                     0.0
                                         5   10     15     20     25   30                        5   10     15     20     25   30
                                                  Distance                                                Distance

                    [Baeza-Yates et al., 2006]
Link Analysis for
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    Models
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection
                       Preferential attachment
... links
                       Copy model
... contents

                       Hybrid models
... both

Summary
Link Analysis for
                    Models
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection
                       Preferential attachment
... links
                       Copy model
... contents

                       Hybrid models
... both

Summary
Link Analysis for
                    Models
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection
                       Preferential attachment
... links
                       Copy model
... contents

                       Hybrid models
... both

Summary
Link Analysis for
                    Preferential attachment
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis
                        “A common property of many large networks is that
Ranking
                        the vertex connectivities follow a scale-free
Web spam
                        power-law distribution. This feature was found to be
... detection
                        a consequence of two generic mechanisms: (i)
... links
                        networks expand continuously by the addition of
... contents
                        new vertices, and (ii) new vertices attach
... both
                        preferentially to sites that are already well
Summary
                        connected.” [Barab´si and Albert, 1999]
                                             a


                    “rich get richer”
Link Analysis for
                    Preferential attachment
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis
                        “A common property of many large networks is that
Ranking
                        the vertex connectivities follow a scale-free
Web spam
                        power-law distribution. This feature was found to be
... detection
                        a consequence of two generic mechanisms: (i)
... links
                        networks expand continuously by the addition of
... contents
                        new vertices, and (ii) new vertices attach
... both
                        preferentially to sites that are already well
Summary
                        connected.” [Barab´si and Albert, 1999]
                                             a


                    “rich get richer”
Link Analysis for
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
                      Hypothesis
                    1
analysis
                      Levels of link analysis
                    2
Ranking
                      Ranking
                    3
Web spam
                      Web spam
                    4
... detection
                      ... detection
                    5
... links
                      ... links
                    6
... contents
                    7 ... contents
... both
                    8 ... both
Summary             9 Summary
Link Analysis for
                    Counting in-links does not work
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

                        “With a simple program, huge numbers of pages can
Ranking

Web spam
                        be created easily, artificially inflating citation counts.
... detection
                        Because the Web environment contains profit
... links
                        seeking ventures, attention getting strategies evolve
... contents
                        in response to search engine algorithms. For this
... both
                        reason, any evaluation strategy which counts
Summary
                        replicable features of web pages is prone to
                        manipulation” [Page et al., 1998]
Link Analysis for
                    PageRank: simplified version
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam
                                                               PageRank ′ (v )
                                PageRank ′ (u) =
... detection
                                                                  |Γ+ (v )|
... links
                                                   v ∈Γ− (u)
... contents

... both
                    Γ− (·): in-links
Summary

                    Γ+ (·): out-links
Link Analysis for
                    Iterations with pseudo-PageRank
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    Iterations with pseudo-PageRank
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    So far, so good, but ...
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

                        The Web includes many pages with no out-links, these
... detection

                        will accumulate all of the score
... links

... contents
                        We would like Web pages to accumulate ranking
... both
                        We add random jumps (teleportation)
Summary
Link Analysis for
                    PageRank
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam
                                         ǫ                         PageRank(v )
                         PageRank(u) =     + (1 − ǫ)
... detection
                                                                     |Γ+ (v )|
                                         N
                                                       v ∈Γ− (u)
... links

... contents

... both
                    Γ− (·): in-links
Summary
                    Γ+ (·): out-links
                    ǫ/N: jump to a random page with probability ǫ ≈ 0.15
Link Analysis for
                    HITS
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary




                    Two scores per page: “hub score” and “authority score”.
Link Analysis for
                    HITS
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary




                    Two scores per page: “hub score” and “authority score”.
Link Analysis for
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    Iterations
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking
                                            Initialize:
Web spam
                                   hub(u, 0) = auth(u, 0) = 0
... detection

... links

... contents
                                                Iterate:
... both                                                       auth(v ,t−1)
                                 hub(u, t) =      v ∈Γ+ (u)      |Γ− (v )|
Summary


                                                               hub(v ,t−1)
                                 auth(u, t) =                    |Γ+ (v )|
                                                   v ∈Γ− (u)
Link Analysis for
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
                      Hypothesis
                    1
analysis
                      Levels of link analysis
                    2
Ranking
                      Ranking
                    3
Web spam
                      Web spam
                    4
... detection
                      ... detection
                    5
... links
                      ... links
                    6
... contents
                    7 ... contents
... both
                    8 ... both
Summary             9 Summary
Link Analysis for
                    What is on the Web?
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    What is on the Web [2.0]?
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    What else is on the Web?
Web Information
    Retrieval

   C. Castillo
                    “The sum of all human knowledge plus porn” – Robert Gilbert
Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    What’s happening on the Web?
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

                     There is a fierce competition
... detection

... links

... contents

                          for your attention
... both

Summary
Link Analysis for
                    What’s happening on the Web?
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam
                           Search engines are to some extent
... detection

... links
                              arbiters of this competition
... contents

... both
                      and they must watch it closely, otherwise ...
Summary
Link Analysis for
                    Some cheating occurs
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary




                                 1986 FIFA World Cup, Argentina vs England
Link Analysis for
                    Simple web spam
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    Hidden text
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    Made for advertising
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    Search engine?
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    Fake search engine
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    “Normal” content in link farms
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    “Normal” content in link farms
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    Cloaking
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    Redirection
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    Redirects using Javascript
Web Information
    Retrieval

   C. Castillo

Hypothesis

                    Simple redirect
Levels of link
analysis
                    <script>
Ranking

                    document.location=quot;http://www.topsearch10.com/quot;;
Web spam

                    </script>
... detection

... links

                    “Hidden” redirect
... contents

... both
                    <script>
Summary
                    var1=24; var2=var1;
                    if(var1==var2) {
                      document.location=quot;http://www.topsearch10.com/quot;;
                    }
                    </script>
Link Analysis for
                    Problem: obfuscated code
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
                    Obfuscated redirect
analysis

Ranking
                    <script>
Web spam
                    var a1=quot;winquot;,a2=quot;dowquot;,a3=quot;locaquot;,a4=quot;tion.quot;,
... detection
                    a5=quot;replacequot;,a6=quot;(’http://www.top10search.com/’)quot;;
... links
                    var i,str=quot;quot;;
... contents
                    for(i=1;i<=6;i++)
... both
                    {
Summary
                      str += eval(quot;aquot;+i);
                    }
                    eval(str);
                    </script>
Link Analysis for
                    Problem: really obfuscated code
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking
                    Encoded javascript
Web spam

                    <script>
... detection

                    var s = quot;%5CBE0D%5C%05GDHJ BDE%16...%04%0Equot;;
... links

                    var e = ’’, i;
... contents

                    eval(unescape(’s%eDunescape%28s%29%3Bfor...%3B’));
... both

Summary
                    </script>
                    More examples: [Chellapilla and Maykov, 2007]
Link Analysis for
                    There are many attempts of cheating on the Web
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam
                    Most of these are spam:
... detection

                        1,630,000 results for “free mp3 hilton viagra” in SE1
... links

... contents
                        1,760,000 results for “credit vicodin loan” in SE2
... both
                        1,320,000 results for “porn mortgage” in SE3
Summary
Link Analysis for
                    Costs
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking
                    Costs:
Web spam

                      X Costs for users: lower precision for some queries
... detection

... links
                      X Costs for search engines: wasted storage space,
... contents
                        network resources, and processing cycles
... both

                      X Costs for the publishers: resources invested in cheating
Summary

                        and not in improving their contents
Link Analysis for
                    Adversarial IR Issues on the Web
Web Information
    Retrieval

   C. Castillo

Hypothesis
                        Link spam
Levels of link
                        Content spam
analysis

Ranking
                        Cloaking
Web spam
                        Comment/forum/wiki spam
... detection

                        Spam-oriented blogging
... links

... contents
                        Click fraud ×2
... both
                        Reverse engineering of ranking algorithms
Summary

                        Web content filtering
                        Advertisement blocking
                        Stealth crawling
                        Malicious tagging
                        . . . more?
Link Analysis for
                    Adversarial IR Issues on the Web
Web Information
    Retrieval

   C. Castillo

Hypothesis
                        Link spam
Levels of link
                        Content spam
analysis

Ranking
                        Cloaking
Web spam
                        Comment/forum/wiki spam
... detection

                        Spam-oriented blogging
... links

... contents
                        Click fraud ×2
... both
                        Reverse engineering of ranking algorithms
Summary

                        Web content filtering
                        Advertisement blocking
                        Stealth crawling
                        Malicious tagging
                        . . . more?
Link Analysis for
                    Adversarial IR Issues on the Web
Web Information
    Retrieval

   C. Castillo

Hypothesis
                        Link spam
Levels of link
                        Content spam
analysis

Ranking
                        Cloaking
Web spam
                        Comment/forum/wiki spam
... detection

                        Spam-oriented blogging
... links

... contents
                        Click fraud ×2
... both
                        Reverse engineering of ranking algorithms
Summary

                        Web content filtering
                        Advertisement blocking
                        Stealth crawling
                        Malicious tagging
                        . . . more?
Link Analysis for
                    Adversarial IR Issues on the Web
Web Information
    Retrieval

   C. Castillo

Hypothesis
                        Link spam
Levels of link
                        Content spam
analysis

Ranking
                        Cloaking
Web spam
                        Comment/forum/wiki spam
... detection

                        Spam-oriented blogging
... links

... contents
                        Click fraud ×2
... both
                        Reverse engineering of ranking algorithms
Summary

                        Web content filtering
                        Advertisement blocking
                        Stealth crawling
                        Malicious tagging
                        . . . more?
Link Analysis for
                    Adversarial IR Issues on the Web
Web Information
    Retrieval

   C. Castillo

Hypothesis
                        Link spam
Levels of link
                        Content spam
analysis

Ranking
                        Cloaking
Web spam
                        Comment/forum/wiki spam
... detection

                        Spam-oriented blogging
... links

... contents
                        Click fraud ×2
... both
                        Reverse engineering of ranking algorithms
Summary

                        Web content filtering
                        Advertisement blocking
                        Stealth crawling
                        Malicious tagging
                        . . . more?
Link Analysis for
                    Adversarial IR Issues on the Web
Web Information
    Retrieval

   C. Castillo

Hypothesis
                        Link spam
Levels of link
                        Content spam
analysis

Ranking
                        Cloaking
Web spam
                        Comment/forum/wiki spam
... detection

                        Spam-oriented blogging
... links

... contents
                        Click fraud ×2
... both
                        Reverse engineering of ranking algorithms
Summary

                        Web content filtering
                        Advertisement blocking
                        Stealth crawling
                        Malicious tagging
                        . . . more?
Link Analysis for
                    Adversarial IR Issues on the Web
Web Information
    Retrieval

   C. Castillo

Hypothesis
                        Link spam
Levels of link
                        Content spam
analysis

Ranking
                        Cloaking
Web spam
                        Comment/forum/wiki spam
... detection

                        Spam-oriented blogging
... links

... contents
                        Click fraud ×2
... both
                        Reverse engineering of ranking algorithms
Summary

                        Web content filtering
                        Advertisement blocking
                        Stealth crawling
                        Malicious tagging
                        . . . more?
Link Analysis for
                    Adversarial IR Issues on the Web
Web Information
    Retrieval

   C. Castillo

Hypothesis
                        Link spam
Levels of link
                        Content spam
analysis

Ranking
                        Cloaking
Web spam
                        Comment/forum/wiki spam
... detection

                        Spam-oriented blogging
... links

... contents
                        Click fraud ×2
... both
                        Reverse engineering of ranking algorithms
Summary

                        Web content filtering
                        Advertisement blocking
                        Stealth crawling
                        Malicious tagging
                        . . . more?
Link Analysis for
                    Adversarial IR Issues on the Web
Web Information
    Retrieval

   C. Castillo

Hypothesis
                        Link spam
Levels of link
                        Content spam
analysis

Ranking
                        Cloaking
Web spam
                        Comment/forum/wiki spam
... detection

                        Spam-oriented blogging
... links

... contents
                        Click fraud ×2
... both
                        Reverse engineering of ranking algorithms
Summary

                        Web content filtering
                        Advertisement blocking
                        Stealth crawling
                        Malicious tagging
                        . . . more?
Link Analysis for
                    Adversarial IR Issues on the Web
Web Information
    Retrieval

   C. Castillo

Hypothesis
                        Link spam
Levels of link
                        Content spam
analysis

Ranking
                        Cloaking
Web spam
                        Comment/forum/wiki spam
... detection

                        Spam-oriented blogging
... links

... contents
                        Click fraud ×2
... both
                        Reverse engineering of ranking algorithms
Summary

                        Web content filtering
                        Advertisement blocking
                        Stealth crawling
                        Malicious tagging
                        . . . more?
Link Analysis for
                    Adversarial IR Issues on the Web
Web Information
    Retrieval

   C. Castillo

Hypothesis
                        Link spam
Levels of link
                        Content spam
analysis

Ranking
                        Cloaking
Web spam
                        Comment/forum/wiki spam
... detection

                        Spam-oriented blogging
... links

... contents
                        Click fraud ×2
... both
                        Reverse engineering of ranking algorithms
Summary

                        Web content filtering
                        Advertisement blocking
                        Stealth crawling
                        Malicious tagging
                        . . . more?
Link Analysis for
                    Adversarial IR Issues on the Web
Web Information
    Retrieval

   C. Castillo

Hypothesis
                        Link spam
Levels of link
                        Content spam
analysis

Ranking
                        Cloaking
Web spam
                        Comment/forum/wiki spam
... detection

                        Spam-oriented blogging
... links

... contents
                        Click fraud ×2
... both
                        Reverse engineering of ranking algorithms
Summary

                        Web content filtering
                        Advertisement blocking
                        Stealth crawling
                        Malicious tagging
                        . . . more?
Link Analysis for
                    Opportunities for Web spam
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

                      X Spamdexing
Ranking

                             Keyword stuffing
Web spam

                             Link farms
... detection

                             Spam blogs (splogs)
... links

                             Cloaking
... contents

... both

                    Adversarial relationship
Summary


                    Every undeserved gain in ranking for a spammer, is a loss of
                    precision for the search engine.
Link Analysis for
                    Opportunities for Web spam
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

                      X Spamdexing
Ranking

                             Keyword stuffing
Web spam

                             Link farms
... detection

                             Spam blogs (splogs)
... links

                             Cloaking
... contents

... both

                    Adversarial relationship
Summary


                    Every undeserved gain in ranking for a spammer, is a loss of
                    precision for the search engine.
Link Analysis for
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
                      Hypothesis
                    1
analysis
                      Levels of link analysis
                    2
Ranking
                      Ranking
                    3
Web spam
                      Web spam
                    4
... detection
                      ... detection
                    5
... links
                      ... links
                    6
... contents
                    7 ... contents
... both
                    8 ... both
Summary             9 Summary
Link Analysis for
                    Motivation
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

                    [Fetterly et al., 2004] hypothesized that studying the
Web spam

                    distribution of statistics about pages could be a good way of
... detection

                    detecting spam pages:
... links

... contents

                    “in a number of these distributions, outlier values are
... both

                    associated with web spam”
Summary
Link Analysis for
                    Machine Learning
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    Training of a Decision Tree
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    Decision Tree (error = 15%)
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    Decision Tree (error = 15% → 12%)
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    Machine Learning (cont.)
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    Feature Extraction
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    Challenges: Machine Learning
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam
                    Machine Learning Challenges:
... detection

                        Instances are not really independent (graph)
... links

... contents
                        Learning with few examples
... both
                        Scalability
Summary
Link Analysis for
                    Challenges: Machine Learning
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam
                    Machine Learning Challenges:
... detection

                        Instances are not really independent (graph)
... links

... contents
                        Learning with few examples
... both
                        Scalability
Summary
Link Analysis for
                    Challenges: Machine Learning
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam
                    Machine Learning Challenges:
... detection

                        Instances are not really independent (graph)
... links

... contents
                        Learning with few examples
... both
                        Scalability
Summary
Link Analysis for
                    Challenges: Information Retrieval
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking
                    Information Retrieval Challenges:
Web spam

                        Feature extraction: which features?
... detection

... links
                        Feature aggregation: page/host/domain
... contents
                        Feature propagation (graph)
... both

                        Recall/precision tradeoffs
Summary

                        Scalability
Link Analysis for
                    Challenges: Information Retrieval
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking
                    Information Retrieval Challenges:
Web spam

                        Feature extraction: which features?
... detection

... links
                        Feature aggregation: page/host/domain
... contents
                        Feature propagation (graph)
... both

                        Recall/precision tradeoffs
Summary

                        Scalability
Link Analysis for
                    Challenges: Information Retrieval
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking
                    Information Retrieval Challenges:
Web spam

                        Feature extraction: which features?
... detection

... links
                        Feature aggregation: page/host/domain
... contents
                        Feature propagation (graph)
... both

                        Recall/precision tradeoffs
Summary

                        Scalability
Link Analysis for
                    Challenges: Information Retrieval
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking
                    Information Retrieval Challenges:
Web spam

                        Feature extraction: which features?
... detection

... links
                        Feature aggregation: page/host/domain
... contents
                        Feature propagation (graph)
... both

                        Recall/precision tradeoffs
Summary

                        Scalability
Link Analysis for
                    Challenges: Information Retrieval
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking
                    Information Retrieval Challenges:
Web spam

                        Feature extraction: which features?
... detection

... links
                        Feature aggregation: page/host/domain
... contents
                        Feature propagation (graph)
... both

                        Recall/precision tradeoffs
Summary

                        Scalability
Link Analysis for
                    Challenges: Data
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam
                        Data is difficult to collect
... detection

                        Data is expensive to label
... links

... contents
                        Labels are sparse
... both
                        Humans do not always agree
Summary
Link Analysis for
                    Agreement
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    Results
Web Information
    Retrieval

   C. Castillo


                                                 Labels
Hypothesis

                                  Label           Frequency    Percentage
Levels of link
analysis
                                 Normal             4,046       61.75%
Ranking
                                Borderline           709        10.82%
Web spam
                                  Spam              1,447       22.08%
... detection
                              Can not classify       350         5.34%
... links

... contents

                                              Agreement
... both

                              Category     Kappa Interpretation
Summary

                              normal       0.62   Substantial agreement
                              spam         0.63   Substantial agreement
                              borderline   0.11   Slight agreement
                              global       0.56   Moderate agreement


                    Reference collection [Castillo et al., 2006]
Link Analysis for
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
                      Hypothesis
                    1
analysis
                      Levels of link analysis
                    2
Ranking
                      Ranking
                    3
Web spam
                      Web spam
                    4
... detection
                      ... detection
                    5
... links
                      ... links
                    6
... contents
                    7 ... contents
... both
                    8 ... both
Summary             9 Summary
Link Analysis for
                    Topological spam: link farms
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary




                    Single-level farms can be detected by searching groups of
                    nodes sharing their out-links [Gibson et al., 2005]
Link Analysis for
                    Topological spam: link farms
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary




                    Single-level farms can be detected by searching groups of
                    nodes sharing their out-links [Gibson et al., 2005]
Link Analysis for
                    Handling large graphs
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam
                    For large graphs, random access is not possible.
... detection

... links
                    Large graphs do not fit in main memory
... contents

... both
                    Streaming model of computation
Summary
Link Analysis for
                    Handling large graphs
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam
                    For large graphs, random access is not possible.
... detection

... links
                    Large graphs do not fit in main memory
... contents

... both
                    Streaming model of computation
Summary
Link Analysis for
                    Handling large graphs
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam
                    For large graphs, random access is not possible.
... detection

... links
                    Large graphs do not fit in main memory
... contents

... both
                    Streaming model of computation
Summary
Link Analysis for
                    Semi-streaming model
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection
                        Memory size enough to hold some data per-node
... links
                        Disk size enough to hold some data per-edge
... contents

                        A small number of passes over the data
... both

Summary
Link Analysis for
                    Restriction
Web Information
    Retrieval

   C. Castillo


                    Semi-streaming model: graph on disk
Hypothesis

Levels of link
                     1: for node : 1 . . . N do
analysis

                          INITIALIZE-MEM(node)
                     2:
Ranking

                     3: end for
Web spam

                     4: for distance : 1 . . . d do {Iteration step}
... detection

                          for src : 1 . . . N do {Follow links in the graph}
... links
                     5:
... contents
                             for all links from src to dest do
                     6:
... both
                                COMPUTE(src,dest)
                     7:
Summary
                             end for
                     8:
                          end for
                     9:
                          NORMALIZE
                    10:
                    11: end for
                    12: POST-PROCESS
                    13: return Something
Link Analysis for
                    Restriction
Web Information
    Retrieval

   C. Castillo


                    Semi-streaming model: graph on disk
Hypothesis

Levels of link
                     1: for node : 1 . . . N do
analysis

                          INITIALIZE-MEM(node)
                     2:
Ranking

                     3: end for
Web spam

                     4: for distance : 1 . . . d do {Iteration step}
... detection

                          for src : 1 . . . N do {Follow links in the graph}
... links
                     5:
... contents
                             for all links from src to dest do
                     6:
... both
                                COMPUTE(src,dest)
                     7:
Summary
                             end for
                     8:
                          end for
                     9:
                          NORMALIZE
                    10:
                    11: end for
                    12: POST-PROCESS
                    13: return Something
Link Analysis for
                    Restriction
Web Information
    Retrieval

   C. Castillo


                    Semi-streaming model: graph on disk
Hypothesis

Levels of link
                     1: for node : 1 . . . N do
analysis

                          INITIALIZE-MEM(node)
                     2:
Ranking

                     3: end for
Web spam

                     4: for distance : 1 . . . d do {Iteration step}
... detection

                          for src : 1 . . . N do {Follow links in the graph}
... links
                     5:
... contents
                             for all links from src to dest do
                     6:
... both
                                COMPUTE(src,dest)
                     7:
Summary
                             end for
                     8:
                          end for
                     9:
                          NORMALIZE
                    10:
                    11: end for
                    12: POST-PROCESS
                    13: return Something
Link Analysis for
                    Link-Based Features
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

                        Degree-related measures
Web spam

                        PageRank
... detection

... links
                        TrustRank [Gy¨ngyi et al., 2004]
                                     o
... contents
                        Truncated PageRank [Becchetti et al., 2006]
... both

                        Estimation of supporters [Becchetti et al., 2006]
Summary


                              140 features per host (2 pages per host)
Link Analysis for
                    Degree-Based
Web Information
    Retrieval

   C. Castillo

Hypothesis                         0.12
                                                                                                Normal
                                                                                                 Spam


Levels of link                     0.10

analysis
                                   0.08

Ranking
                                   0.06

Web spam
                                   0.04
... detection
                                   0.02
... links

... contents                       0.00
                                          4    18    76   323   1380   5899   25212   107764   460609    1968753


                                   0.14
... both                                                                                        Normal
                                                                                                 Spam

                                   0.12
Summary
                                   0.10



                                   0.08



                                   0.06



                                   0.04



                                   0.02



                                   0.00
                                       0.0    0.0   0.0   0.1   0.6    4.9    40.0    327.9    2686.5    22009.9
Link Analysis for
                    TrustRank
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking
                    TrustRank [Gy¨ngyi et al., 2004]
                                 o
Web spam
                    A node with high PageRank, but far away from a core set of
... detection
                    “trusted nodes” is suspicious
... links

... contents
                    Start from a set of trusted nodes, then do a random walk,
... both
                    returning to the set of trusted nodes with probability 1 − α at
Summary
                    each step

                    i Trusted nodes:   data from http://www.dmoz.org/
Link Analysis for
                    TrustRank
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking
                    TrustRank [Gy¨ngyi et al., 2004]
                                 o
Web spam
                    A node with high PageRank, but far away from a core set of
... detection
                    “trusted nodes” is suspicious
... links

... contents
                    Start from a set of trusted nodes, then do a random walk,
... both
                    returning to the set of trusted nodes with probability 1 − α at
Summary
                    each step

                    i Trusted nodes:   data from http://www.dmoz.org/
Link Analysis for
                    TrustRank Idea
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    TrustRank / PageRank
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking                                    1.00
                                                                                                        Normal
                                                                                                         Spam
Web spam                                   0.90


                                           0.80

... detection                              0.70



... links                                  0.60


                                           0.50

... contents                               0.40


                                           0.30
... both
                                           0.20


Summary                                    0.10


                                           0.00
                                                  0.4   1   4   1e+01   4e+01   1e+02   3e+02   1e+03   3e+03    9e+03
Link Analysis for
                    High and low-ranked pages are different
Web Information
    Retrieval

   C. Castillo
                                                    4
                                                 x 10
Hypothesis
                                                                       Top 0%−10%
                                            12
Levels of link                                                         Top 40%−50%
analysis
                                                                       Top 60%−70%
                                            10
Ranking

                          Number of Nodes
Web spam

                                            8
... detection

... links
                                            6
... contents

... both
                                            4
Summary


                                            2


                                            0
                                             1          5     10       15        20
                                                            Distance
                    Areas below the curves are equal if we are in the same
                    strongly-connected component
Link Analysis for
                    High and low-ranked pages are different
Web Information
    Retrieval

   C. Castillo
                                                    4
                                                 x 10
Hypothesis
                                                                       Top 0%−10%
                                            12
Levels of link                                                         Top 40%−50%
analysis
                                                                       Top 60%−70%
                                            10
Ranking

                          Number of Nodes
Web spam

                                            8
... detection

... links
                                            6
... contents

... both
                                            4
Summary


                                            2


                                            0
                                             1          5     10       15        20
                                                            Distance
                    Areas below the curves are equal if we are in the same
                    strongly-connected component
Link Analysis for
                    Probabilistic counting
Web Information
    Retrieval

   C. Castillo

Hypothesis
                                                                         1
                             1
                                                                         0
                             0
Levels of link
                                                                         0
                             0
analysis
                                                                         0
                             0
                                           0                                     1
                                     1                                                     1
                                                                         1
                             1
Ranking
                                           0                                     0
                                     1                                                     1
                                                                         0
                             0
                                                                                 0
                                     0     0                                               0
Web spam                                           Propagation of                0
                                     0                                                     1
                                           1
                                                    bits using the               1
... detection                        0                                                     1
                                           1
                                                  “OR” operation                 1
                                     0                                                     1
                                           0
... links
                                                                             1
                                         Target
                                 0                                                   Count bits set
... contents
                                                                             0
                                          page
                                 0                                                    to estimate
... both                                                                     0
                                 0                                                    supporters
                                                                             0
                                 0
                                                                     1
                         1
Summary                                                                      1
                                 1
                                                                     0
                         0                                                   1
                                 1
                                                                     0
                         0
                                                                     0
                         0
                                                                     1
                         1
                                                                     0
                         0




                    [Becchetti et al., 2006] shows an improvement of ANF
                    algorithm [Palmer et al., 2002] based on probabilistic
                    counting [Flajolet and Martin, 1985]
Link Analysis for
                    Probabilistic counting
Web Information
    Retrieval

   C. Castillo

Hypothesis
                                                                         1
                             1
                                                                         0
                             0
Levels of link
                                                                         0
                             0
analysis
                                                                         0
                             0
                                           0                                     1
                                     1                                                     1
                                                                         1
                             1
Ranking
                                           0                                     0
                                     1                                                     1
                                                                         0
                             0
                                                                                 0
                                     0     0                                               0
Web spam                                           Propagation of                0
                                     0                                                     1
                                           1
                                                    bits using the               1
... detection                        0                                                     1
                                           1
                                                  “OR” operation                 1
                                     0                                                     1
                                           0
... links
                                                                             1
                                         Target
                                 0                                                   Count bits set
... contents
                                                                             0
                                          page
                                 0                                                    to estimate
... both                                                                     0
                                 0                                                    supporters
                                                                             0
                                 0
                                                                     1
                         1
Summary                                                                      1
                                 1
                                                                     0
                         0                                                   1
                                 1
                                                                     0
                         0
                                                                     0
                         0
                                                                     1
                         1
                                                                     0
                         0




                    [Becchetti et al., 2006] shows an improvement of ANF
                    algorithm [Palmer et al., 2002] based on probabilistic
                    counting [Flajolet and Martin, 1985]
Link Analysis for
                    Bottleneck number
Web Information
    Retrieval

   C. Castillo

Hypothesis

                    bd (x) = minj≤d {|Nj (x)|/|Nj−1 (x)|}. Minimum rate of growth
Levels of link
analysis
                    of the neighbors of x up to a certain distance. We expect that
Ranking
                    spam pages form clusters that are somehow isolated from the
Web spam
                    rest of the Web graph and they have smaller bottleneck
... detection

                    numbers than non-spam pages.
... links
                                                 0.40
                                                                                                                Normal
... contents                                                                                                     Spam

                                                 0.35


... both                                         0.30



Summary                                          0.25



                                                 0.20



                                                 0.15



                                                 0.10



                                                 0.05



                                                 0.00
                                                        1.11   1.30   1.52   1.78   2.07   2.42   2.83   3.31   3.87     4.52
Link Analysis for
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
                      Hypothesis
                    1
analysis
                      Levels of link analysis
                    2
Ranking
                      Ranking
                    3
Web spam
                      Web spam
                    4
... detection
                      ... detection
                    5
... links
                      ... links
                    6
... contents
                    7 ... contents
... both
                    8 ... both
Summary             9 Summary
Link Analysis for
                    Content-Based Features
Web Information
    Retrieval

   C. Castillo

Hypothesis
                    Most of these reported in [Ntoulas et al., 2006]:
Levels of link
                         Number of word in the page and title
analysis

Ranking
                         Average word length
Web spam
                         Fraction of anchor text
... detection

                         Fraction of visible text
... links

... contents
                         Compression rate
... both

                    From [Castillo et al., 2007]:
Summary


                         Corpus precision and corpus recall
                         Query precision and query recall
                         Independent trigram likelihood
                         Entropy of trigrams
Link Analysis for
                    Average word length
Web Information
    Retrieval

   C. Castillo

Hypothesis
                                0.12
                                                                          Normal
Levels of link
                                                                           Spam
analysis
                                0.10
Ranking

                                0.08
Web spam

... detection
                                0.06
... links

... contents                    0.04
... both
                                0.02
Summary

                                0.00
                                    3.0   3.5   4.0   4.5   5.0   5.5   6.0   6.5   7.0   7.5


                    Figure: Histogram of the average word length in non-spam vs.
                    spam pages for k = 500.
Link Analysis for
                    Corpus precision
Web Information
    Retrieval

   C. Castillo

Hypothesis
                                0.10
                                                                  Normal
Levels of link
                                0.09                               Spam
analysis
                                0.08
Ranking
                                0.07
Web spam
                                0.06
... detection
                                0.05
... links
                                0.04
... contents
                                0.03
... both
                                0.02
Summary
                                0.01
                                0.00
                                    0.0   0.1   0.2   0.3   0.4   0.5      0.6   0.7


                    Figure: Histogram of the corpus precision in non-spam vs. spam
                    pages.
Link Analysis for
                    Query precision
Web Information
    Retrieval

   C. Castillo

Hypothesis
                                0.12
                                                              Normal
Levels of link
                                                               Spam
analysis
                                0.10
Ranking

                                0.08
Web spam

... detection
                                0.06
... links

... contents                    0.04
... both
                                0.02
Summary

                                0.00
                                    0.0   0.1   0.2   0.3   0.4    0.5   0.6


                    Figure: Histogram of the query precision in non-spam vs. spam
                    pages for k = 500.
Link Analysis for
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
                      Hypothesis
                    1
analysis
                      Levels of link analysis
                    2
Ranking
                      Ranking
                    3
Web spam
                      Web spam
                    4
... detection
                      ... detection
                    5
... links
                      ... links
                    6
... contents
                    7 ... contents
... both
                    8 ... both
Summary             9 Summary
Link Analysis for
                    General hypothesis
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam
                    Pages topologically close to each other are more likely to have
... detection
                    the same label (spam/nonspam) than random pairs of pages
... links

... contents
                    Ideas for exploiting this: clustering, propagation, stacked
... both
                    learning
Summary
Link Analysis for
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary




                    [Castillo et al., 2007]
Link Analysis for
                    Topological dependencies: in-links
Web Information
    Retrieval

   C. Castillo

Hypothesis
                    Histogram of fraction of spam hosts in the in-links
Levels of link
analysis
                        0 = no in-link comes from spam hosts
Ranking
                        1 = all of the in-links come from spam hosts
Web spam

... detection
                                    0.4
... links                                             In-links of non spam
                                                            In-links of spam
                                   0.35
... contents
                                    0.3
... both
                                   0.25
Summary
                                    0.2

                                   0.15

                                    0.1

                                   0.05

                                     0
                                          0.0   0.2   0.4      0.6        0.8   1.0
Link Analysis for
                    Topological dependencies: out-links
Web Information
    Retrieval

   C. Castillo

Hypothesis
                    Histogram of fraction of spam hosts in the out-links
Levels of link
analysis
                        0 = none of the out-links points to spam hosts
Ranking
                        1 = all of the out-links point to spam hosts
Web spam

... detection
                                    1
... links                                            Out-links of non spam
                                   0.9                     Outlinks of spam
... contents
                                   0.8
... both                           0.7
Summary                            0.6
                                   0.5
                                   0.4
                                   0.3
                                   0.2
                                   0.1
                                    0
                                         0.0   0.2   0.4       0.6       0.8   1.0
Link Analysis for
                    Idea 1: Clustering
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

                    Classify, then cluster hosts, then assign the same label to all
... links

                    hosts in the same cluster by majority voting
... contents

... both

Summary
Link Analysis for
                    Idea 1: Clustering (cont.)
Web Information
    Retrieval

   C. Castillo

Hypothesis
                    Initial prediction:
Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    Idea 1: Clustering (cont.)
Web Information
    Retrieval

   C. Castillo

Hypothesis
                    Clustering:
Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    Idea 1: Clustering (cont.)
Web Information
    Retrieval

   C. Castillo

Hypothesis
                    Final prediction:
Levels of link
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    Idea 1: Clustering – Results
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
                                                   Baseline Clustering
analysis

                                           Without bagging
Ranking

Web spam
                              True positive rate    75.6%     74.5%
... detection
                              False positive rate   8.5%      6.8%
... links
                                  F-Measure         0.646     0.673
... contents
                                             With bagging
... both
                              True positive rate    78.7%     76.9%
Summary
                              False positive rate   5.7%      5.0%
                                  F-Measure         0.723     0.728

                    V Reduces error rate
Link Analysis for
                    Idea 2: Propagate the label
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection

                    Classify, then interpret “spamicity” as a probability, then do a
... links

                    random walk with restart from those nodes
... contents

... both

Summary
Link Analysis for
                    Idea 2: Propagate the label (cont.)
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
                    Initial prediction:
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    Idea 2: Propagate the label (cont.)
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
                    Propagation:
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    Idea 2: Propagate the label (cont.)
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
                    Final prediction, applying a threshold:
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    Idea 2: Propagate the label – Results
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis
                                           Baseline Fwds. Backwds.       Both
Ranking

                                       Classifier without bagging
Web spam

                      True positive rate    75.6% 70.9%         69.4%    71.4%
... detection

                      False positive rate    8.5%      6.1%      5.8%    5.8%
... links

                          F-Measure          0.646     0.665     0.664   0.676
... contents

... both
                                         Classifier with bagging
Summary
                      True positive rate    78.7% 76.5%         75.0%    75.2%
                      False positive rate    5.7%      5.4%      4.3%    4.7%
                          F-Measure          0.723     0.716     0.733   0.724
Link Analysis for
                    Idea 3: Stacked graphical learning
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam
                        Meta-learning scheme [Cohen and Kou, 2006]
... detection
                        Derive initial predictions
... links

                        Generate an additional attribute for each object by
... contents

                        combining predictions on neighbors in the graph
... both

Summary
                        Append additional attribute in the data and retrain
Link Analysis for
                    Idea 3: Stacked graphical learning (cont.)
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

                        Let p(x) ∈ [0..1] be the prediction of a classification
Ranking

                        algorithm for a host x using k features
Web spam

... detection
                        Let N(x) be the set of pages related to x (in some way)
... links
                        Compute
... contents

                                                     g ∈N(x) p(g )
... both
                                          f (x) =
                                                      |N(x)|
Summary


                        Add f (x) as an extra feature for instance x and learn a
                        new model with k + 1 features
Link Analysis for
                    Idea 3: Stacked graphical learning (cont.)
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking
                    Initial prediction:
Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    Idea 3: Stacked graphical learning (cont.)
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

                    Computation of new feature:
Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    Idea 3: Stacked graphical learning (cont.)
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
                    New prediction with k + 1 features:
analysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary
Link Analysis for
                    Idea 3: Stacked graphical learning - Results
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking
                                                         Avg.     Avg.      Avg.
Web spam
                                             Baseline    of in   of out   of both
... detection

                       True positive rate     78.7%     84.4%    78.3%    85.2%
... links

                       False positive rate    5.7%      6.7%     4.8%      6.1%
... contents

                           F-Measure          0.723     0.733    0.742
... both
                                                                           0.750
Summary

                    V Increases detection rate
Link Analysis for
                    Idea 3: Stacked graphical learning x2
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking
                    And repeat ...
Web spam

... detection
                                             Baseline   First pass   Second pass
... links
                       True positive rate     78.7%       85.2%         88.4%
... contents
                       False positive rate    5.7%         6.1%          6.3%
... both
                           F-Measure          0.723       0.750         0.763
Summary


                    V Significant improvement over the baseline
Link Analysis for
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
                      Hypothesis
                    1
analysis
                      Levels of link analysis
                    2
Ranking
                      Ranking
                    3
Web spam
                      Web spam
                    4
... detection
                      ... detection
                    5
... links
                      ... links
                    6
... contents
                    7 ... contents
... both
                    8 ... both
Summary             9 Summary
Link Analysis for
                    Concluding remarks
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection
                        Hypothesis: topical locality + link endorsement
... links
                        Primitives: similarity, ranking, propagation, etc.
... contents

                        Application to Web spam
... both

Summary
Link Analysis for
                    Concluding remarks
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection
                        Hypothesis: topical locality + link endorsement
... links
                        Primitives: similarity, ranking, propagation, etc.
... contents

                        Application to Web spam
... both

Summary
Link Analysis for
                    Concluding remarks
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

... detection
                        Hypothesis: topical locality + link endorsement
... links
                        Primitives: similarity, ranking, propagation, etc.
... contents

                        Application to Web spam
... both

Summary
Link Analysis for
Web Information
    Retrieval

   C. Castillo

Hypothesis

Levels of link
analysis

Ranking

Web spam

                    Thank you!
... detection

... links

... contents

... both

Summary
Link Analysis for
Web Information
                    Baeza-Yates, R., Boldi, P., and Castillo, C. (2006).
    Retrieval
                    Generalizing pagerank: Damping functions for link-based ranking
   C. Castillo
                    algorithms.
                    In Proceedings of ACM SIGIR, pages 308–315, Seattle, Washington, USA.
Hypothesis
                    ACM Press.
Levels of link
                    Baeza-Yates, R., Castillo, C., and Efthimiadis, E. (2007).
analysis
                    Characterization of national web domains.
Ranking
                    ACM Transactions on Internet Technology, 7(2).
Web spam
                    Baeza-Yates, R. and Poblete, B. (2006).
... detection
                    Dynamics of the chilean web structure.
... links           Comput. Networks, 50(10):1464–1473.
... contents
                    Baeza-Yates, R., Saint-Jean, F., and Castillo, C. (2002).
... both            Web structure, dynamics and page quality.
                    In Proceedings of String Processing and Information Retrieval (SPIRE),
Summary
                    volume 2476 of Lecture Notes in Computer Science, Lisbon, Portugal.
                    Springer.
                    Barab´si, A.-L. (2002).
                         a
                    Linked: The New Science of Networks.
                    Perseus Books Group.
                    Barab´si, A. L. and Albert, R. (1999).
                          a
                    Emergence of scaling in random networks.
                    Science, 286(5439):509–512.
Link Analysis for
Web Information     Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R.
    Retrieval
                    (2006).
                    Using rank propagation and probabilistic counting for link-based spam
   C. Castillo
                    detection.
Hypothesis          In Proceedings of the Workshop on Web Mining and Web Usage Analysis
                    (WebKDD), Pennsylvania, USA. ACM Press.
Levels of link
analysis
                    Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S.,
Ranking
                    Stata, R., Tomkins, A., and Wiener, J. (2000).
Web spam            Graph structure in the web: Experiments and models.
                    In Proceedings of the Ninth Conference on World Wide Web, pages
... detection
                    309–320, Amsterdam, Netherlands. ACM Press.
... links
                    Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M.,
... contents
                    and Vigna, S. (2006).
... both
                    A reference collection for web spam.
                    SIGIR Forum, 40(2):11–24.
Summary

                    Castillo, C., Donato, D., Gionis, A., Murdock, V., and Silvestri, F. (2007).
                    Know your neighbors: Web spam detection using the web topology.
                    In Proceedings of SIGIR, Amsterdam, Netherlands. ACM.
                    Chellapilla, K. and Maykov, A. (2007).
                    A taxonomy of javascript redirection spam.
                    In AIRWeb ’07: Proceedings of the 3rd international workshop on
                    Adversarial information retrieval on the web, pages 81–88, New York, NY,
                    USA. ACM Press.
Link Analysis for Web IR
Link Analysis for Web IR
Link Analysis for Web IR

More Related Content

What's hot

Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data miningKamal Acharya
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Primya Tamil
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataDataminingTools Inc
 
3.7 outlier analysis
3.7 outlier analysis3.7 outlier analysis
3.7 outlier analysisKrish_ver2
 
Text clustering
Text clusteringText clustering
Text clusteringKU Leuven
 
Divide and conquer - Quick sort
Divide and conquer - Quick sortDivide and conquer - Quick sort
Divide and conquer - Quick sortMadhu Bala
 
Classification Algorithm.
Classification Algorithm.Classification Algorithm.
Classification Algorithm.Megha Sharma
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machinesnextlib
 
NAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERNAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERKnoldus Inc.
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methodsKrish_ver2
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kambererror007
 
Capter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberCapter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberHouw Liong The
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataSalah Amean
 
Text Data Mining
Text Data MiningText Data Mining
Text Data MiningKU Leuven
 
Data Mining: Concepts and Techniques — Chapter 2 —
Data Mining:  Concepts and Techniques — Chapter 2 —Data Mining:  Concepts and Techniques — Chapter 2 —
Data Mining: Concepts and Techniques — Chapter 2 —Salah Amean
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clusteringKrish_ver2
 

What's hot (20)

Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
 
3.7 outlier analysis
3.7 outlier analysis3.7 outlier analysis
3.7 outlier analysis
 
Text clustering
Text clusteringText clustering
Text clustering
 
Nearest neighbor search
Nearest neighbor searchNearest neighbor search
Nearest neighbor search
 
Divide and conquer - Quick sort
Divide and conquer - Quick sortDivide and conquer - Quick sort
Divide and conquer - Quick sort
 
Classification Algorithm.
Classification Algorithm.Classification Algorithm.
Classification Algorithm.
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
NAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERNAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIER
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
 
Capter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberCapter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & Kamber
 
Bayesian network
Bayesian networkBayesian network
Bayesian network
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
 
Huffman codes
Huffman codesHuffman codes
Huffman codes
 
Text Data Mining
Text Data MiningText Data Mining
Text Data Mining
 
Data Mining: Concepts and Techniques — Chapter 2 —
Data Mining:  Concepts and Techniques — Chapter 2 —Data Mining:  Concepts and Techniques — Chapter 2 —
Data Mining: Concepts and Techniques — Chapter 2 —
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
 

More from Carlos Castillo (ChaTo)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social MediaCarlos Castillo (ChaTo)
 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Carlos Castillo (ChaTo)
 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Carlos Castillo (ChaTo)
 

More from Carlos Castillo (ChaTo) (20)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social Media
 
When no clicks are good news
When no clicks are good newsWhen no clicks are good news
When no clicks are good news
 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)
 
Discrimination Discovery
Discrimination DiscoveryDiscrimination Discovery
Discrimination Discovery
 
Fairness-Aware Data Mining
Fairness-Aware Data MiningFairness-Aware Data Mining
Fairness-Aware Data Mining
 
Big Crisis Data for ISPC
Big Crisis Data for ISPCBig Crisis Data for ISPC
Big Crisis Data for ISPC
 
Databeers: Big Crisis Data
Databeers: Big Crisis DataDatabeers: Big Crisis Data
Databeers: Big Crisis Data
 
Observational studies in social media
Observational studies in social mediaObservational studies in social media
Observational studies in social media
 
Natural experiments
Natural experimentsNatural experiments
Natural experiments
 
Content-based link prediction
Content-based link predictionContent-based link prediction
Content-based link prediction
 
Link prediction
Link predictionLink prediction
Link prediction
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Graph Partitioning and Spectral Methods
Graph Partitioning and Spectral MethodsGraph Partitioning and Spectral Methods
Graph Partitioning and Spectral Methods
 
Finding Dense Subgraphs
Finding Dense SubgraphsFinding Dense Subgraphs
Finding Dense Subgraphs
 
Graph Evolution Models
Graph Evolution ModelsGraph Evolution Models
Graph Evolution Models
 
Link-Based Ranking
Link-Based RankingLink-Based Ranking
Link-Based Ranking
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
Indexing
IndexingIndexing
Indexing
 
Text Summarization
Text SummarizationText Summarization
Text Summarization
 

Recently uploaded

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 

Recently uploaded (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Link Analysis for Web IR

  • 1. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Link Analysis for Web Information Retrieval Levels of link analysis With Applications to Adversarial IR Ranking Web spam Carlos Castillo1 ... detection ... links chato@yahoo-inc.com ... contents With: R. Baeza-Yates1,3 , L. Becchetti2 , P. Boldi5 , ... both D. Donato1 , A. Gionis1 , S. Leonardi2 , V.Murdock1 , Summary M. Santini5 , F. Silvestri4 , S. Vigna5 1. Yahoo! Research Barcelona – Catalunya, Spain 2. Universit` di Roma “La Sapienza” – Rome, Italy a 3. Yahoo! Research Santiago – Chile 4. ISTI-CNR –Pisa,Italy 5. Universit` degli Studi di Milano – Milan, Italy a
  • 2. Link Analysis for When you have a hammer Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 3. Link Analysis for Everything looks like a graph! Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 4. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link Hypothesis 1 analysis Levels of link analysis 2 Ranking Ranking 3 Web spam Web spam 4 ... detection ... detection 5 ... links ... links 6 ... contents 7 ... contents ... both 8 ... both Summary 9 Summary
  • 5. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Links are not placed at random Ranking Web spam ... detection ... links Topical locality hypothesis ... contents Link endorsement hypothesis ... both Summary
  • 6. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Links are not placed at random Ranking Web spam ... detection ... links Topical locality hypothesis ... contents Link endorsement hypothesis ... both Summary
  • 7. Link Analysis for Topical locality hypothesis Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam “We found that pages are significantly more likely to ... detection be related topically to pages to which they are ... links linked, as opposed to other pages selected at ... contents random or other nearby pages.” [Davison, 2000] ... both Summary
  • 8. Link Analysis for Web Information Retrieval 0.7 C. Castillo Average text similarity Hypothesis 0.6 Levels of link analysis Ranking 0.5 Web spam ... detection 0.4 ... links ... contents 0.3 ... both Summary 0.2 1 2 3 4 5 Link distance [Baeza-Yates et al., 2006], data from UK 2006
  • 9. Link Analysis for Link similarity cases Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection Link (geodesic) distance ... links Co-citation ... contents Bibliographic coupling ... both Summary
  • 10. Link Analysis for Co-citation Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 11. Link Analysis for Bibliographic coupling Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 12. Link Analysis for (Both can be generalized) Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection (Both co-citation and bibliographic coupling can be ... links generalized. E.g.: SimRank [Jeh and Widom, 2002]: ... contents generalizes the idea of co-citation to several levels) ... both Summary
  • 13. Link Analysis for Link endorsement hypothesis Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Links are assumed to be endorsements (votes, positive Ranking opinions) [Li, 1998] Web spam ... detection But they can represent: ... links Disagreement ... contents Self citations ... both Summary Nepotism Citations to methodological documents etc.
  • 14. Link Analysis for Link endorsement hypothesis Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Links are assumed to be endorsements (votes, positive Ranking opinions) [Li, 1998] Web spam ... detection But they can represent: ... links Disagreement ... contents Self citations ... both Summary Nepotism Citations to methodological documents etc.
  • 15. Link Analysis for Furthermore Web Information Retrieval C. Castillo Hypothesis Levels of link analysis They measure quantity not quality (e.g.: “Stop the Ranking numbers game!” in ACM communications a few months Web spam ago) ... detection Self-citations are frequent ... links ... contents In some topics there is more linking ... both Citations go from newer to older Summary New documents get few citations [Baeza-Yates et al., 2002] Many of the citations are irrelevant
  • 16. Link Analysis for Furthermore Web Information Retrieval C. Castillo Hypothesis Levels of link analysis They measure quantity not quality (e.g.: “Stop the Ranking numbers game!” in ACM communications a few months Web spam ago) ... detection Self-citations are frequent ... links ... contents In some topics there is more linking ... both Citations go from newer to older Summary New documents get few citations [Baeza-Yates et al., 2002] Many of the citations are irrelevant
  • 17. Link Analysis for Furthermore Web Information Retrieval C. Castillo Hypothesis Levels of link analysis They measure quantity not quality (e.g.: “Stop the Ranking numbers game!” in ACM communications a few months Web spam ago) ... detection Self-citations are frequent ... links ... contents In some topics there is more linking ... both Citations go from newer to older Summary New documents get few citations [Baeza-Yates et al., 2002] Many of the citations are irrelevant
  • 18. Link Analysis for Furthermore Web Information Retrieval C. Castillo Hypothesis Levels of link analysis They measure quantity not quality (e.g.: “Stop the Ranking numbers game!” in ACM communications a few months Web spam ago) ... detection Self-citations are frequent ... links ... contents In some topics there is more linking ... both Citations go from newer to older Summary New documents get few citations [Baeza-Yates et al., 2002] Many of the citations are irrelevant
  • 19. Link Analysis for Furthermore Web Information Retrieval C. Castillo Hypothesis Levels of link analysis They measure quantity not quality (e.g.: “Stop the Ranking numbers game!” in ACM communications a few months Web spam ago) ... detection Self-citations are frequent ... links ... contents In some topics there is more linking ... both Citations go from newer to older Summary New documents get few citations [Baeza-Yates et al., 2002] Many of the citations are irrelevant
  • 20. Link Analysis for Furthermore Web Information Retrieval C. Castillo Hypothesis Levels of link analysis They measure quantity not quality (e.g.: “Stop the Ranking numbers game!” in ACM communications a few months Web spam ago) ... detection Self-citations are frequent ... links ... contents In some topics there is more linking ... both Citations go from newer to older Summary New documents get few citations [Baeza-Yates et al., 2002] Many of the citations are irrelevant
  • 21. Link Analysis for Nevertheless Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Both the topical locality hypothesis and the link endorsement Web spam hypothesis are meaningful on the Web ... detection ... links Analogy with Economy ... contents ... both Think on the hypothesis requiring many buyers/sellers, zero Summary transaction costs, perfect information, etc. in economic sciences
  • 22. Link Analysis for Nevertheless Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Both the topical locality hypothesis and the link endorsement Web spam hypothesis are meaningful on the Web ... detection ... links Analogy with Economy ... contents ... both Think on the hypothesis requiring many buyers/sellers, zero Summary transaction costs, perfect information, etc. in economic sciences
  • 23. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link Hypothesis 1 analysis Levels of link analysis 2 Ranking Ranking 3 Web spam Web spam 4 ... detection ... detection 5 ... links ... links 6 ... contents 7 ... contents ... both 8 ... both Summary 9 Summary
  • 24. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 25. Link Analysis for How to find meaningful patterns? Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam Several levels of analysis: ... detection Macroscopic view: overall structure ... links ... contents Microscopic view: nodes ... both Mesoscopic view: regions Summary
  • 26. Link Analysis for How to find meaningful patterns? Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam Several levels of analysis: ... detection Macroscopic view: overall structure ... links ... contents Microscopic view: nodes ... both Mesoscopic view: regions Summary
  • 27. Link Analysis for How to find meaningful patterns? Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam Several levels of analysis: ... detection Macroscopic view: overall structure ... links ... contents Microscopic view: nodes ... both Mesoscopic view: regions Summary
  • 28. Link Analysis for Macroscopic view, e.g. Bow-tie Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary [Broder et al., 2000]
  • 29.
  • 30. Link Analysis for Macroscopic view, e.g. Jellyfish Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary [Tauro et al., 2001] - Internet Autonomous Systems (AS) Topology
  • 31. Link Analysis for Macroscopic view, e.g. Jellyfish Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 32. Link Analysis for Microscopic view, e.g. Degree Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary [Barab´si, 2002] and others a
  • 33. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary “While entirely of human design, the emerging network appears to have more in common with a cell or an ecological system than with a Swiss watch.” [Barab´si, 2002] a
  • 34. Link Analysis for Other scale-free networks Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam Power grid designs ... detection Sexual partners in humans ... links Collaboration of movie actors in films ... contents ... both Citations in scientific publications Summary Protein interactions
  • 35. Link Analysis for Microscopic view, e.g. Degree Web Information Retrieval C. Castillo Greece Chile Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents Spain Korea ... both Summary [Baeza-Yates et al., 2007] - compares this distribution in 8 countries . . . guess what is the result?
  • 36. Link Analysis for Mesoscopic view, e.g. Hop-plot Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 37. Link Analysis for Mesoscopic view, e.g. Hop-plot Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 38. Link Analysis for Mesoscopic view, e.g. Hop-plot Web Information Retrieval C. Castillo .it (40M pages) .uk (18M pages) Hypothesis 0.3 0.3 Levels of link analysis 0.2 0.2 Ranking Frequency Frequency Web spam 0.1 0.1 ... detection ... links 0.0 0.0 5 10 15 20 25 30 5 10 15 20 25 30 ... contents Distance Distance .eu.int (800K pages) Synthetic graph (100K pages) ... both Summary 0.3 0.3 0.2 0.2 Frequency Frequency 0.1 0.1 0.0 0.0 5 10 15 20 25 30 5 10 15 20 25 30 Distance Distance [Baeza-Yates et al., 2006]
  • 39. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 40. Link Analysis for Models Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection Preferential attachment ... links Copy model ... contents Hybrid models ... both Summary
  • 41. Link Analysis for Models Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection Preferential attachment ... links Copy model ... contents Hybrid models ... both Summary
  • 42. Link Analysis for Models Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection Preferential attachment ... links Copy model ... contents Hybrid models ... both Summary
  • 43. Link Analysis for Preferential attachment Web Information Retrieval C. Castillo Hypothesis Levels of link analysis “A common property of many large networks is that Ranking the vertex connectivities follow a scale-free Web spam power-law distribution. This feature was found to be ... detection a consequence of two generic mechanisms: (i) ... links networks expand continuously by the addition of ... contents new vertices, and (ii) new vertices attach ... both preferentially to sites that are already well Summary connected.” [Barab´si and Albert, 1999] a “rich get richer”
  • 44. Link Analysis for Preferential attachment Web Information Retrieval C. Castillo Hypothesis Levels of link analysis “A common property of many large networks is that Ranking the vertex connectivities follow a scale-free Web spam power-law distribution. This feature was found to be ... detection a consequence of two generic mechanisms: (i) ... links networks expand continuously by the addition of ... contents new vertices, and (ii) new vertices attach ... both preferentially to sites that are already well Summary connected.” [Barab´si and Albert, 1999] a “rich get richer”
  • 45. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link Hypothesis 1 analysis Levels of link analysis 2 Ranking Ranking 3 Web spam Web spam 4 ... detection ... detection 5 ... links ... links 6 ... contents 7 ... contents ... both 8 ... both Summary 9 Summary
  • 46. Link Analysis for Counting in-links does not work Web Information Retrieval C. Castillo Hypothesis Levels of link analysis “With a simple program, huge numbers of pages can Ranking Web spam be created easily, artificially inflating citation counts. ... detection Because the Web environment contains profit ... links seeking ventures, attention getting strategies evolve ... contents in response to search engine algorithms. For this ... both reason, any evaluation strategy which counts Summary replicable features of web pages is prone to manipulation” [Page et al., 1998]
  • 47. Link Analysis for PageRank: simplified version Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam PageRank ′ (v ) PageRank ′ (u) = ... detection |Γ+ (v )| ... links v ∈Γ− (u) ... contents ... both Γ− (·): in-links Summary Γ+ (·): out-links
  • 48. Link Analysis for Iterations with pseudo-PageRank Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 49. Link Analysis for Iterations with pseudo-PageRank Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 50. Link Analysis for So far, so good, but ... Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam The Web includes many pages with no out-links, these ... detection will accumulate all of the score ... links ... contents We would like Web pages to accumulate ranking ... both We add random jumps (teleportation) Summary
  • 51. Link Analysis for PageRank Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ǫ PageRank(v ) PageRank(u) = + (1 − ǫ) ... detection |Γ+ (v )| N v ∈Γ− (u) ... links ... contents ... both Γ− (·): in-links Summary Γ+ (·): out-links ǫ/N: jump to a random page with probability ǫ ≈ 0.15
  • 52. Link Analysis for HITS Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary Two scores per page: “hub score” and “authority score”.
  • 53. Link Analysis for HITS Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary Two scores per page: “hub score” and “authority score”.
  • 54. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 55. Link Analysis for Iterations Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Initialize: Web spam hub(u, 0) = auth(u, 0) = 0 ... detection ... links ... contents Iterate: ... both auth(v ,t−1) hub(u, t) = v ∈Γ+ (u) |Γ− (v )| Summary hub(v ,t−1) auth(u, t) = |Γ+ (v )| v ∈Γ− (u)
  • 56. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link Hypothesis 1 analysis Levels of link analysis 2 Ranking Ranking 3 Web spam Web spam 4 ... detection ... detection 5 ... links ... links 6 ... contents 7 ... contents ... both 8 ... both Summary 9 Summary
  • 57. Link Analysis for What is on the Web? Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 58. Link Analysis for What is on the Web [2.0]? Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 59. Link Analysis for What else is on the Web? Web Information Retrieval C. Castillo “The sum of all human knowledge plus porn” – Robert Gilbert Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 60. Link Analysis for What’s happening on the Web? Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam There is a fierce competition ... detection ... links ... contents for your attention ... both Summary
  • 61. Link Analysis for What’s happening on the Web? Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam Search engines are to some extent ... detection ... links arbiters of this competition ... contents ... both and they must watch it closely, otherwise ... Summary
  • 62. Link Analysis for Some cheating occurs Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary 1986 FIFA World Cup, Argentina vs England
  • 63. Link Analysis for Simple web spam Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 64. Link Analysis for Hidden text Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 65. Link Analysis for Made for advertising Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 66. Link Analysis for Search engine? Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 67. Link Analysis for Fake search engine Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 68. Link Analysis for “Normal” content in link farms Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 69. Link Analysis for “Normal” content in link farms Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 70. Link Analysis for Cloaking Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 71. Link Analysis for Redirection Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 72. Link Analysis for Redirects using Javascript Web Information Retrieval C. Castillo Hypothesis Simple redirect Levels of link analysis <script> Ranking document.location=quot;http://www.topsearch10.com/quot;; Web spam </script> ... detection ... links “Hidden” redirect ... contents ... both <script> Summary var1=24; var2=var1; if(var1==var2) { document.location=quot;http://www.topsearch10.com/quot;; } </script>
  • 73. Link Analysis for Problem: obfuscated code Web Information Retrieval C. Castillo Hypothesis Levels of link Obfuscated redirect analysis Ranking <script> Web spam var a1=quot;winquot;,a2=quot;dowquot;,a3=quot;locaquot;,a4=quot;tion.quot;, ... detection a5=quot;replacequot;,a6=quot;(’http://www.top10search.com/’)quot;; ... links var i,str=quot;quot;; ... contents for(i=1;i<=6;i++) ... both { Summary str += eval(quot;aquot;+i); } eval(str); </script>
  • 74. Link Analysis for Problem: really obfuscated code Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Encoded javascript Web spam <script> ... detection var s = quot;%5CBE0D%5C%05GDHJ BDE%16...%04%0Equot;; ... links var e = ’’, i; ... contents eval(unescape(’s%eDunescape%28s%29%3Bfor...%3B’)); ... both Summary </script> More examples: [Chellapilla and Maykov, 2007]
  • 75. Link Analysis for There are many attempts of cheating on the Web Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam Most of these are spam: ... detection 1,630,000 results for “free mp3 hilton viagra” in SE1 ... links ... contents 1,760,000 results for “credit vicodin loan” in SE2 ... both 1,320,000 results for “porn mortgage” in SE3 Summary
  • 76. Link Analysis for Costs Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Costs: Web spam X Costs for users: lower precision for some queries ... detection ... links X Costs for search engines: wasted storage space, ... contents network resources, and processing cycles ... both X Costs for the publishers: resources invested in cheating Summary and not in improving their contents
  • 77. Link Analysis for Adversarial IR Issues on the Web Web Information Retrieval C. Castillo Hypothesis Link spam Levels of link Content spam analysis Ranking Cloaking Web spam Comment/forum/wiki spam ... detection Spam-oriented blogging ... links ... contents Click fraud ×2 ... both Reverse engineering of ranking algorithms Summary Web content filtering Advertisement blocking Stealth crawling Malicious tagging . . . more?
  • 78. Link Analysis for Adversarial IR Issues on the Web Web Information Retrieval C. Castillo Hypothesis Link spam Levels of link Content spam analysis Ranking Cloaking Web spam Comment/forum/wiki spam ... detection Spam-oriented blogging ... links ... contents Click fraud ×2 ... both Reverse engineering of ranking algorithms Summary Web content filtering Advertisement blocking Stealth crawling Malicious tagging . . . more?
  • 79. Link Analysis for Adversarial IR Issues on the Web Web Information Retrieval C. Castillo Hypothesis Link spam Levels of link Content spam analysis Ranking Cloaking Web spam Comment/forum/wiki spam ... detection Spam-oriented blogging ... links ... contents Click fraud ×2 ... both Reverse engineering of ranking algorithms Summary Web content filtering Advertisement blocking Stealth crawling Malicious tagging . . . more?
  • 80. Link Analysis for Adversarial IR Issues on the Web Web Information Retrieval C. Castillo Hypothesis Link spam Levels of link Content spam analysis Ranking Cloaking Web spam Comment/forum/wiki spam ... detection Spam-oriented blogging ... links ... contents Click fraud ×2 ... both Reverse engineering of ranking algorithms Summary Web content filtering Advertisement blocking Stealth crawling Malicious tagging . . . more?
  • 81. Link Analysis for Adversarial IR Issues on the Web Web Information Retrieval C. Castillo Hypothesis Link spam Levels of link Content spam analysis Ranking Cloaking Web spam Comment/forum/wiki spam ... detection Spam-oriented blogging ... links ... contents Click fraud ×2 ... both Reverse engineering of ranking algorithms Summary Web content filtering Advertisement blocking Stealth crawling Malicious tagging . . . more?
  • 82. Link Analysis for Adversarial IR Issues on the Web Web Information Retrieval C. Castillo Hypothesis Link spam Levels of link Content spam analysis Ranking Cloaking Web spam Comment/forum/wiki spam ... detection Spam-oriented blogging ... links ... contents Click fraud ×2 ... both Reverse engineering of ranking algorithms Summary Web content filtering Advertisement blocking Stealth crawling Malicious tagging . . . more?
  • 83. Link Analysis for Adversarial IR Issues on the Web Web Information Retrieval C. Castillo Hypothesis Link spam Levels of link Content spam analysis Ranking Cloaking Web spam Comment/forum/wiki spam ... detection Spam-oriented blogging ... links ... contents Click fraud ×2 ... both Reverse engineering of ranking algorithms Summary Web content filtering Advertisement blocking Stealth crawling Malicious tagging . . . more?
  • 84. Link Analysis for Adversarial IR Issues on the Web Web Information Retrieval C. Castillo Hypothesis Link spam Levels of link Content spam analysis Ranking Cloaking Web spam Comment/forum/wiki spam ... detection Spam-oriented blogging ... links ... contents Click fraud ×2 ... both Reverse engineering of ranking algorithms Summary Web content filtering Advertisement blocking Stealth crawling Malicious tagging . . . more?
  • 85. Link Analysis for Adversarial IR Issues on the Web Web Information Retrieval C. Castillo Hypothesis Link spam Levels of link Content spam analysis Ranking Cloaking Web spam Comment/forum/wiki spam ... detection Spam-oriented blogging ... links ... contents Click fraud ×2 ... both Reverse engineering of ranking algorithms Summary Web content filtering Advertisement blocking Stealth crawling Malicious tagging . . . more?
  • 86. Link Analysis for Adversarial IR Issues on the Web Web Information Retrieval C. Castillo Hypothesis Link spam Levels of link Content spam analysis Ranking Cloaking Web spam Comment/forum/wiki spam ... detection Spam-oriented blogging ... links ... contents Click fraud ×2 ... both Reverse engineering of ranking algorithms Summary Web content filtering Advertisement blocking Stealth crawling Malicious tagging . . . more?
  • 87. Link Analysis for Adversarial IR Issues on the Web Web Information Retrieval C. Castillo Hypothesis Link spam Levels of link Content spam analysis Ranking Cloaking Web spam Comment/forum/wiki spam ... detection Spam-oriented blogging ... links ... contents Click fraud ×2 ... both Reverse engineering of ranking algorithms Summary Web content filtering Advertisement blocking Stealth crawling Malicious tagging . . . more?
  • 88. Link Analysis for Adversarial IR Issues on the Web Web Information Retrieval C. Castillo Hypothesis Link spam Levels of link Content spam analysis Ranking Cloaking Web spam Comment/forum/wiki spam ... detection Spam-oriented blogging ... links ... contents Click fraud ×2 ... both Reverse engineering of ranking algorithms Summary Web content filtering Advertisement blocking Stealth crawling Malicious tagging . . . more?
  • 89. Link Analysis for Opportunities for Web spam Web Information Retrieval C. Castillo Hypothesis Levels of link analysis X Spamdexing Ranking Keyword stuffing Web spam Link farms ... detection Spam blogs (splogs) ... links Cloaking ... contents ... both Adversarial relationship Summary Every undeserved gain in ranking for a spammer, is a loss of precision for the search engine.
  • 90. Link Analysis for Opportunities for Web spam Web Information Retrieval C. Castillo Hypothesis Levels of link analysis X Spamdexing Ranking Keyword stuffing Web spam Link farms ... detection Spam blogs (splogs) ... links Cloaking ... contents ... both Adversarial relationship Summary Every undeserved gain in ranking for a spammer, is a loss of precision for the search engine.
  • 91. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link Hypothesis 1 analysis Levels of link analysis 2 Ranking Ranking 3 Web spam Web spam 4 ... detection ... detection 5 ... links ... links 6 ... contents 7 ... contents ... both 8 ... both Summary 9 Summary
  • 92. Link Analysis for Motivation Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking [Fetterly et al., 2004] hypothesized that studying the Web spam distribution of statistics about pages could be a good way of ... detection detecting spam pages: ... links ... contents “in a number of these distributions, outlier values are ... both associated with web spam” Summary
  • 93. Link Analysis for Machine Learning Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 94. Link Analysis for Training of a Decision Tree Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 95. Link Analysis for Decision Tree (error = 15%) Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 96. Link Analysis for Decision Tree (error = 15% → 12%) Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 97. Link Analysis for Machine Learning (cont.) Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 98. Link Analysis for Feature Extraction Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 99. Link Analysis for Challenges: Machine Learning Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam Machine Learning Challenges: ... detection Instances are not really independent (graph) ... links ... contents Learning with few examples ... both Scalability Summary
  • 100. Link Analysis for Challenges: Machine Learning Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam Machine Learning Challenges: ... detection Instances are not really independent (graph) ... links ... contents Learning with few examples ... both Scalability Summary
  • 101. Link Analysis for Challenges: Machine Learning Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam Machine Learning Challenges: ... detection Instances are not really independent (graph) ... links ... contents Learning with few examples ... both Scalability Summary
  • 102. Link Analysis for Challenges: Information Retrieval Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Information Retrieval Challenges: Web spam Feature extraction: which features? ... detection ... links Feature aggregation: page/host/domain ... contents Feature propagation (graph) ... both Recall/precision tradeoffs Summary Scalability
  • 103. Link Analysis for Challenges: Information Retrieval Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Information Retrieval Challenges: Web spam Feature extraction: which features? ... detection ... links Feature aggregation: page/host/domain ... contents Feature propagation (graph) ... both Recall/precision tradeoffs Summary Scalability
  • 104. Link Analysis for Challenges: Information Retrieval Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Information Retrieval Challenges: Web spam Feature extraction: which features? ... detection ... links Feature aggregation: page/host/domain ... contents Feature propagation (graph) ... both Recall/precision tradeoffs Summary Scalability
  • 105. Link Analysis for Challenges: Information Retrieval Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Information Retrieval Challenges: Web spam Feature extraction: which features? ... detection ... links Feature aggregation: page/host/domain ... contents Feature propagation (graph) ... both Recall/precision tradeoffs Summary Scalability
  • 106. Link Analysis for Challenges: Information Retrieval Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Information Retrieval Challenges: Web spam Feature extraction: which features? ... detection ... links Feature aggregation: page/host/domain ... contents Feature propagation (graph) ... both Recall/precision tradeoffs Summary Scalability
  • 107. Link Analysis for Challenges: Data Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam Data is difficult to collect ... detection Data is expensive to label ... links ... contents Labels are sparse ... both Humans do not always agree Summary
  • 108. Link Analysis for Agreement Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 109. Link Analysis for Results Web Information Retrieval C. Castillo Labels Hypothesis Label Frequency Percentage Levels of link analysis Normal 4,046 61.75% Ranking Borderline 709 10.82% Web spam Spam 1,447 22.08% ... detection Can not classify 350 5.34% ... links ... contents Agreement ... both Category Kappa Interpretation Summary normal 0.62 Substantial agreement spam 0.63 Substantial agreement borderline 0.11 Slight agreement global 0.56 Moderate agreement Reference collection [Castillo et al., 2006]
  • 110. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link Hypothesis 1 analysis Levels of link analysis 2 Ranking Ranking 3 Web spam Web spam 4 ... detection ... detection 5 ... links ... links 6 ... contents 7 ... contents ... both 8 ... both Summary 9 Summary
  • 111. Link Analysis for Topological spam: link farms Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary Single-level farms can be detected by searching groups of nodes sharing their out-links [Gibson et al., 2005]
  • 112. Link Analysis for Topological spam: link farms Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary Single-level farms can be detected by searching groups of nodes sharing their out-links [Gibson et al., 2005]
  • 113. Link Analysis for Handling large graphs Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam For large graphs, random access is not possible. ... detection ... links Large graphs do not fit in main memory ... contents ... both Streaming model of computation Summary
  • 114. Link Analysis for Handling large graphs Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam For large graphs, random access is not possible. ... detection ... links Large graphs do not fit in main memory ... contents ... both Streaming model of computation Summary
  • 115. Link Analysis for Handling large graphs Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam For large graphs, random access is not possible. ... detection ... links Large graphs do not fit in main memory ... contents ... both Streaming model of computation Summary
  • 116. Link Analysis for Semi-streaming model Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection Memory size enough to hold some data per-node ... links Disk size enough to hold some data per-edge ... contents A small number of passes over the data ... both Summary
  • 117. Link Analysis for Restriction Web Information Retrieval C. Castillo Semi-streaming model: graph on disk Hypothesis Levels of link 1: for node : 1 . . . N do analysis INITIALIZE-MEM(node) 2: Ranking 3: end for Web spam 4: for distance : 1 . . . d do {Iteration step} ... detection for src : 1 . . . N do {Follow links in the graph} ... links 5: ... contents for all links from src to dest do 6: ... both COMPUTE(src,dest) 7: Summary end for 8: end for 9: NORMALIZE 10: 11: end for 12: POST-PROCESS 13: return Something
  • 118. Link Analysis for Restriction Web Information Retrieval C. Castillo Semi-streaming model: graph on disk Hypothesis Levels of link 1: for node : 1 . . . N do analysis INITIALIZE-MEM(node) 2: Ranking 3: end for Web spam 4: for distance : 1 . . . d do {Iteration step} ... detection for src : 1 . . . N do {Follow links in the graph} ... links 5: ... contents for all links from src to dest do 6: ... both COMPUTE(src,dest) 7: Summary end for 8: end for 9: NORMALIZE 10: 11: end for 12: POST-PROCESS 13: return Something
  • 119. Link Analysis for Restriction Web Information Retrieval C. Castillo Semi-streaming model: graph on disk Hypothesis Levels of link 1: for node : 1 . . . N do analysis INITIALIZE-MEM(node) 2: Ranking 3: end for Web spam 4: for distance : 1 . . . d do {Iteration step} ... detection for src : 1 . . . N do {Follow links in the graph} ... links 5: ... contents for all links from src to dest do 6: ... both COMPUTE(src,dest) 7: Summary end for 8: end for 9: NORMALIZE 10: 11: end for 12: POST-PROCESS 13: return Something
  • 120. Link Analysis for Link-Based Features Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Degree-related measures Web spam PageRank ... detection ... links TrustRank [Gy¨ngyi et al., 2004] o ... contents Truncated PageRank [Becchetti et al., 2006] ... both Estimation of supporters [Becchetti et al., 2006] Summary 140 features per host (2 pages per host)
  • 121. Link Analysis for Degree-Based Web Information Retrieval C. Castillo Hypothesis 0.12 Normal Spam Levels of link 0.10 analysis 0.08 Ranking 0.06 Web spam 0.04 ... detection 0.02 ... links ... contents 0.00 4 18 76 323 1380 5899 25212 107764 460609 1968753 0.14 ... both Normal Spam 0.12 Summary 0.10 0.08 0.06 0.04 0.02 0.00 0.0 0.0 0.0 0.1 0.6 4.9 40.0 327.9 2686.5 22009.9
  • 122. Link Analysis for TrustRank Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking TrustRank [Gy¨ngyi et al., 2004] o Web spam A node with high PageRank, but far away from a core set of ... detection “trusted nodes” is suspicious ... links ... contents Start from a set of trusted nodes, then do a random walk, ... both returning to the set of trusted nodes with probability 1 − α at Summary each step i Trusted nodes: data from http://www.dmoz.org/
  • 123. Link Analysis for TrustRank Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking TrustRank [Gy¨ngyi et al., 2004] o Web spam A node with high PageRank, but far away from a core set of ... detection “trusted nodes” is suspicious ... links ... contents Start from a set of trusted nodes, then do a random walk, ... both returning to the set of trusted nodes with probability 1 − α at Summary each step i Trusted nodes: data from http://www.dmoz.org/
  • 124. Link Analysis for TrustRank Idea Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 125. Link Analysis for TrustRank / PageRank Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking 1.00 Normal Spam Web spam 0.90 0.80 ... detection 0.70 ... links 0.60 0.50 ... contents 0.40 0.30 ... both 0.20 Summary 0.10 0.00 0.4 1 4 1e+01 4e+01 1e+02 3e+02 1e+03 3e+03 9e+03
  • 126. Link Analysis for High and low-ranked pages are different Web Information Retrieval C. Castillo 4 x 10 Hypothesis Top 0%−10% 12 Levels of link Top 40%−50% analysis Top 60%−70% 10 Ranking Number of Nodes Web spam 8 ... detection ... links 6 ... contents ... both 4 Summary 2 0 1 5 10 15 20 Distance Areas below the curves are equal if we are in the same strongly-connected component
  • 127. Link Analysis for High and low-ranked pages are different Web Information Retrieval C. Castillo 4 x 10 Hypothesis Top 0%−10% 12 Levels of link Top 40%−50% analysis Top 60%−70% 10 Ranking Number of Nodes Web spam 8 ... detection ... links 6 ... contents ... both 4 Summary 2 0 1 5 10 15 20 Distance Areas below the curves are equal if we are in the same strongly-connected component
  • 128. Link Analysis for Probabilistic counting Web Information Retrieval C. Castillo Hypothesis 1 1 0 0 Levels of link 0 0 analysis 0 0 0 1 1 1 1 1 Ranking 0 0 1 1 0 0 0 0 0 0 Web spam Propagation of 0 0 1 1 bits using the 1 ... detection 0 1 1 “OR” operation 1 0 1 0 ... links 1 Target 0 Count bits set ... contents 0 page 0 to estimate ... both 0 0 supporters 0 0 1 1 Summary 1 1 0 0 1 1 0 0 0 0 1 1 0 0 [Becchetti et al., 2006] shows an improvement of ANF algorithm [Palmer et al., 2002] based on probabilistic counting [Flajolet and Martin, 1985]
  • 129. Link Analysis for Probabilistic counting Web Information Retrieval C. Castillo Hypothesis 1 1 0 0 Levels of link 0 0 analysis 0 0 0 1 1 1 1 1 Ranking 0 0 1 1 0 0 0 0 0 0 Web spam Propagation of 0 0 1 1 bits using the 1 ... detection 0 1 1 “OR” operation 1 0 1 0 ... links 1 Target 0 Count bits set ... contents 0 page 0 to estimate ... both 0 0 supporters 0 0 1 1 Summary 1 1 0 0 1 1 0 0 0 0 1 1 0 0 [Becchetti et al., 2006] shows an improvement of ANF algorithm [Palmer et al., 2002] based on probabilistic counting [Flajolet and Martin, 1985]
  • 130. Link Analysis for Bottleneck number Web Information Retrieval C. Castillo Hypothesis bd (x) = minj≤d {|Nj (x)|/|Nj−1 (x)|}. Minimum rate of growth Levels of link analysis of the neighbors of x up to a certain distance. We expect that Ranking spam pages form clusters that are somehow isolated from the Web spam rest of the Web graph and they have smaller bottleneck ... detection numbers than non-spam pages. ... links 0.40 Normal ... contents Spam 0.35 ... both 0.30 Summary 0.25 0.20 0.15 0.10 0.05 0.00 1.11 1.30 1.52 1.78 2.07 2.42 2.83 3.31 3.87 4.52
  • 131. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link Hypothesis 1 analysis Levels of link analysis 2 Ranking Ranking 3 Web spam Web spam 4 ... detection ... detection 5 ... links ... links 6 ... contents 7 ... contents ... both 8 ... both Summary 9 Summary
  • 132. Link Analysis for Content-Based Features Web Information Retrieval C. Castillo Hypothesis Most of these reported in [Ntoulas et al., 2006]: Levels of link Number of word in the page and title analysis Ranking Average word length Web spam Fraction of anchor text ... detection Fraction of visible text ... links ... contents Compression rate ... both From [Castillo et al., 2007]: Summary Corpus precision and corpus recall Query precision and query recall Independent trigram likelihood Entropy of trigrams
  • 133. Link Analysis for Average word length Web Information Retrieval C. Castillo Hypothesis 0.12 Normal Levels of link Spam analysis 0.10 Ranking 0.08 Web spam ... detection 0.06 ... links ... contents 0.04 ... both 0.02 Summary 0.00 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 Figure: Histogram of the average word length in non-spam vs. spam pages for k = 500.
  • 134. Link Analysis for Corpus precision Web Information Retrieval C. Castillo Hypothesis 0.10 Normal Levels of link 0.09 Spam analysis 0.08 Ranking 0.07 Web spam 0.06 ... detection 0.05 ... links 0.04 ... contents 0.03 ... both 0.02 Summary 0.01 0.00 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Figure: Histogram of the corpus precision in non-spam vs. spam pages.
  • 135. Link Analysis for Query precision Web Information Retrieval C. Castillo Hypothesis 0.12 Normal Levels of link Spam analysis 0.10 Ranking 0.08 Web spam ... detection 0.06 ... links ... contents 0.04 ... both 0.02 Summary 0.00 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Figure: Histogram of the query precision in non-spam vs. spam pages for k = 500.
  • 136. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link Hypothesis 1 analysis Levels of link analysis 2 Ranking Ranking 3 Web spam Web spam 4 ... detection ... detection 5 ... links ... links 6 ... contents 7 ... contents ... both 8 ... both Summary 9 Summary
  • 137. Link Analysis for General hypothesis Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam Pages topologically close to each other are more likely to have ... detection the same label (spam/nonspam) than random pairs of pages ... links ... contents Ideas for exploiting this: clustering, propagation, stacked ... both learning Summary
  • 138. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary [Castillo et al., 2007]
  • 139. Link Analysis for Topological dependencies: in-links Web Information Retrieval C. Castillo Hypothesis Histogram of fraction of spam hosts in the in-links Levels of link analysis 0 = no in-link comes from spam hosts Ranking 1 = all of the in-links come from spam hosts Web spam ... detection 0.4 ... links In-links of non spam In-links of spam 0.35 ... contents 0.3 ... both 0.25 Summary 0.2 0.15 0.1 0.05 0 0.0 0.2 0.4 0.6 0.8 1.0
  • 140. Link Analysis for Topological dependencies: out-links Web Information Retrieval C. Castillo Hypothesis Histogram of fraction of spam hosts in the out-links Levels of link analysis 0 = none of the out-links points to spam hosts Ranking 1 = all of the out-links point to spam hosts Web spam ... detection 1 ... links Out-links of non spam 0.9 Outlinks of spam ... contents 0.8 ... both 0.7 Summary 0.6 0.5 0.4 0.3 0.2 0.1 0 0.0 0.2 0.4 0.6 0.8 1.0
  • 141. Link Analysis for Idea 1: Clustering Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection Classify, then cluster hosts, then assign the same label to all ... links hosts in the same cluster by majority voting ... contents ... both Summary
  • 142. Link Analysis for Idea 1: Clustering (cont.) Web Information Retrieval C. Castillo Hypothesis Initial prediction: Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 143. Link Analysis for Idea 1: Clustering (cont.) Web Information Retrieval C. Castillo Hypothesis Clustering: Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 144. Link Analysis for Idea 1: Clustering (cont.) Web Information Retrieval C. Castillo Hypothesis Final prediction: Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 145. Link Analysis for Idea 1: Clustering – Results Web Information Retrieval C. Castillo Hypothesis Levels of link Baseline Clustering analysis Without bagging Ranking Web spam True positive rate 75.6% 74.5% ... detection False positive rate 8.5% 6.8% ... links F-Measure 0.646 0.673 ... contents With bagging ... both True positive rate 78.7% 76.9% Summary False positive rate 5.7% 5.0% F-Measure 0.723 0.728 V Reduces error rate
  • 146. Link Analysis for Idea 2: Propagate the label Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection Classify, then interpret “spamicity” as a probability, then do a ... links random walk with restart from those nodes ... contents ... both Summary
  • 147. Link Analysis for Idea 2: Propagate the label (cont.) Web Information Retrieval C. Castillo Hypothesis Levels of link Initial prediction: analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 148. Link Analysis for Idea 2: Propagate the label (cont.) Web Information Retrieval C. Castillo Hypothesis Levels of link Propagation: analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 149. Link Analysis for Idea 2: Propagate the label (cont.) Web Information Retrieval C. Castillo Hypothesis Levels of link Final prediction, applying a threshold: analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 150. Link Analysis for Idea 2: Propagate the label – Results Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Baseline Fwds. Backwds. Both Ranking Classifier without bagging Web spam True positive rate 75.6% 70.9% 69.4% 71.4% ... detection False positive rate 8.5% 6.1% 5.8% 5.8% ... links F-Measure 0.646 0.665 0.664 0.676 ... contents ... both Classifier with bagging Summary True positive rate 78.7% 76.5% 75.0% 75.2% False positive rate 5.7% 5.4% 4.3% 4.7% F-Measure 0.723 0.716 0.733 0.724
  • 151. Link Analysis for Idea 3: Stacked graphical learning Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam Meta-learning scheme [Cohen and Kou, 2006] ... detection Derive initial predictions ... links Generate an additional attribute for each object by ... contents combining predictions on neighbors in the graph ... both Summary Append additional attribute in the data and retrain
  • 152. Link Analysis for Idea 3: Stacked graphical learning (cont.) Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Let p(x) ∈ [0..1] be the prediction of a classification Ranking algorithm for a host x using k features Web spam ... detection Let N(x) be the set of pages related to x (in some way) ... links Compute ... contents g ∈N(x) p(g ) ... both f (x) = |N(x)| Summary Add f (x) as an extra feature for instance x and learn a new model with k + 1 features
  • 153. Link Analysis for Idea 3: Stacked graphical learning (cont.) Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Initial prediction: Web spam ... detection ... links ... contents ... both Summary
  • 154. Link Analysis for Idea 3: Stacked graphical learning (cont.) Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Computation of new feature: Ranking Web spam ... detection ... links ... contents ... both Summary
  • 155. Link Analysis for Idea 3: Stacked graphical learning (cont.) Web Information Retrieval C. Castillo Hypothesis Levels of link New prediction with k + 1 features: analysis Ranking Web spam ... detection ... links ... contents ... both Summary
  • 156. Link Analysis for Idea 3: Stacked graphical learning - Results Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Avg. Avg. Avg. Web spam Baseline of in of out of both ... detection True positive rate 78.7% 84.4% 78.3% 85.2% ... links False positive rate 5.7% 6.7% 4.8% 6.1% ... contents F-Measure 0.723 0.733 0.742 ... both 0.750 Summary V Increases detection rate
  • 157. Link Analysis for Idea 3: Stacked graphical learning x2 Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking And repeat ... Web spam ... detection Baseline First pass Second pass ... links True positive rate 78.7% 85.2% 88.4% ... contents False positive rate 5.7% 6.1% 6.3% ... both F-Measure 0.723 0.750 0.763 Summary V Significant improvement over the baseline
  • 158. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link Hypothesis 1 analysis Levels of link analysis 2 Ranking Ranking 3 Web spam Web spam 4 ... detection ... detection 5 ... links ... links 6 ... contents 7 ... contents ... both 8 ... both Summary 9 Summary
  • 159. Link Analysis for Concluding remarks Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection Hypothesis: topical locality + link endorsement ... links Primitives: similarity, ranking, propagation, etc. ... contents Application to Web spam ... both Summary
  • 160. Link Analysis for Concluding remarks Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection Hypothesis: topical locality + link endorsement ... links Primitives: similarity, ranking, propagation, etc. ... contents Application to Web spam ... both Summary
  • 161. Link Analysis for Concluding remarks Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection Hypothesis: topical locality + link endorsement ... links Primitives: similarity, ranking, propagation, etc. ... contents Application to Web spam ... both Summary
  • 162. Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam Thank you! ... detection ... links ... contents ... both Summary
  • 163. Link Analysis for Web Information Baeza-Yates, R., Boldi, P., and Castillo, C. (2006). Retrieval Generalizing pagerank: Damping functions for link-based ranking C. Castillo algorithms. In Proceedings of ACM SIGIR, pages 308–315, Seattle, Washington, USA. Hypothesis ACM Press. Levels of link Baeza-Yates, R., Castillo, C., and Efthimiadis, E. (2007). analysis Characterization of national web domains. Ranking ACM Transactions on Internet Technology, 7(2). Web spam Baeza-Yates, R. and Poblete, B. (2006). ... detection Dynamics of the chilean web structure. ... links Comput. Networks, 50(10):1464–1473. ... contents Baeza-Yates, R., Saint-Jean, F., and Castillo, C. (2002). ... both Web structure, dynamics and page quality. In Proceedings of String Processing and Information Retrieval (SPIRE), Summary volume 2476 of Lecture Notes in Computer Science, Lisbon, Portugal. Springer. Barab´si, A.-L. (2002). a Linked: The New Science of Networks. Perseus Books Group. Barab´si, A. L. and Albert, R. (1999). a Emergence of scaling in random networks. Science, 286(5439):509–512.
  • 164. Link Analysis for Web Information Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R. Retrieval (2006). Using rank propagation and probabilistic counting for link-based spam C. Castillo detection. Hypothesis In Proceedings of the Workshop on Web Mining and Web Usage Analysis (WebKDD), Pennsylvania, USA. ACM Press. Levels of link analysis Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Ranking Stata, R., Tomkins, A., and Wiener, J. (2000). Web spam Graph structure in the web: Experiments and models. In Proceedings of the Ninth Conference on World Wide Web, pages ... detection 309–320, Amsterdam, Netherlands. ACM Press. ... links Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M., ... contents and Vigna, S. (2006). ... both A reference collection for web spam. SIGIR Forum, 40(2):11–24. Summary Castillo, C., Donato, D., Gionis, A., Murdock, V., and Silvestri, F. (2007). Know your neighbors: Web spam detection using the web topology. In Proceedings of SIGIR, Amsterdam, Netherlands. ACM. Chellapilla, K. and Maykov, A. (2007). A taxonomy of javascript redirection spam. In AIRWeb ’07: Proceedings of the 3rd international workshop on Adversarial information retrieval on the web, pages 81–88, New York, NY, USA. ACM Press.