SlideShare une entreprise Scribd logo
1  sur  48
Special Topics in
 Search Engines
     Result Summaries
      Anti-spamming
    Duplicate elimination
Results summaries
Summaries
   Having ranked the documents matching a
    query, we wish to present a results list
   Most commonly, the document title plus a
    short summary
   The title is typically automatically extracted
    from document metadata
   What about the summaries?
Summaries
   Two basic kinds:
       Static
       Dynamic

    A static summary of a document is always
    the same, regardless of the query that hit
    the doc
   Dynamic summaries are query-dependent
    attempt to explain why the document was
    retrieved for the query at hand
Static summaries
   In typical systems, the static summary is a
    subset of the document
   Simplest heuristic: the first 50 (or so – this
    can be varied) words of the document
       Summary cached at indexing time
   More sophisticated: extract from each
    document a set of “key” sentences
       Simple NLP heuristics to score each sentence
       Summary is made up of top-scoring
        sentences.
   Most sophisticated: NLP used to synthesize
    a summary
       Seldom used in IR; cf. text summarization
Dynamic summaries
   Present one or more “windows” within the
    document that contain several of the query
    terms
       “KWIC” snippets: Keyword in Context
        presentation
   Generated in conjunction with scoring
       If query found as a phrase, the/some
        occurrences of the phrase in the doc
       If not, windows within the doc that contain
        multiple query terms
   The summary itself gives the entire content
    of the window – all terms, not only the query
Generating dynamic summaries
   If we have only a positional index, we cannot
    (easily) reconstruct context surrounding hits
   If we cache the documents at index time, can
    run the window through it, cueing to hits
    found in the positional index
       E.g., positional index says “the query is a
        phrase in position 4378” so we go to this
        position in the cached document and stream
        out the content
   Most often, cache a fixed-size prefix of the
    doc
       Note: Cached copy can be outdated
Dynamic summaries
   Producing good dynamic summaries is a
    tricky optimization problem
       The real estate for the summary is normally
        small and fixed
       Want short item, so show as many KWIC
        matches as possible, and perhaps other
        things like title
       Want snippets to be long enough to be useful
       Want linguistically well-formed snippets:
        users prefer snippets that contain complete
        phrases
       Want snippets maximally informative about
        doc
   But users really like snippets, even if they
    complicate IR system design
Anti-spamming
Adversarial IR (Spam)
   Motives
       Commercial, political, religious, lobbies
       Promotion funded by advertising budget
   Operators
       Contractors (Search Engine Optimizers) for lobbies,
        companies
       Web masters
       Hosting services
   Forum
       Web master world ( www.webmasterworld.com )
            Search engine specific tricks
            Discussions about academic papers 
Search Engine Optimization II
Search Engine Optimization
       Adversarial IR
       Adversarial IR
  (“search engine wars”)
   (“search engine wars”)
Can you trust words on the page?
auctions.hitsoffice.com/




     Pornographic           www.ebay.com/
        Content




Examples from July 2002
Simplest forms
   Early engines relied on the density of terms
        The top-ranked pages for the query maui
         resort were the ones containing the most
         maui’s and resort’s
   SEOs responded with dense repetitions of
    chosen terms
        e.g., maui resort maui resort maui resort
        Often, the repetitions would be in the same
         color as the background of the web page
             Repeated terms got indexed by crawlers
             But not visible to humans on browsers

    Can’t trust the words on a web page, for ranking.
A few spam technologies
   Cloaking
       Serve fake content to search engine robot
       DNS cloaking: Switch IP address. Impersonate
   Doorway pages
       Pages optimized for a single keyword that re-
        direct to the real target page
   Keyword Spam
       Misleading meta-keywords, excessive
        repetition of a term, fake “anchor text”
       Hidden text with colors, CSS tricks, etc.
   Link spamming
       Mutual admiration societies, hidden links,
        awards
       Domain flooding: numerous domains that
        point or re-direct to a target page
   Robots
       Fake click stream
       Fake query stream
       Millions of submissions via Add-Url
More spam techniques
   Cloaking
 Serve fake content to search engine spider
 DNS cloaking: Switch IP address. Impersonate




                                          SPAM
                                      Y

                   Is this a Search
                   Engine spider?

                                      N   Real
               Cloaking                   Doc
Tutorial on
    Tutorial on
Cloaking & Stealth
Cloaking & Stealth
   Technology
    Technology
Variants of keyword stuffing
   Misleading meta-tags, excessive repetition
   Hidden text with colors, style sheet tricks,
    etc.

      Meta-Tags =
      “… London hotels, hotel, holiday inn, hilton, discount, booking, reservation,
      sex, mp3, britney spears, viagra, …”
More spam techniques
   Doorway pages
       Pages optimized for a single keyword that re-
        direct to the real target page
   Link spamming
       Mutual admiration societies, hidden links,
        awards – more on these later
       Domain flooding: numerous domains that
        point or re-direct to a target page
   Robots
       Fake query stream – rank checking programs
            “Curve-fit” ranking programs of search engines
       Millions of submissions via Add-Url
The war against spam
Quality signals - Prefer authoritative
pages based on:
   Votes from authors (linkage signals)
   Votes from users (usage signals)
   Policing of URL submissions
   Anti robot test
 Limits on meta-keywords
Robust link analysis
   Ignore statistically implausible linkage (or text)
   Use link analysis to detect spammers (guilt by
    association)
The war against spam
   Spam recognition by machine learning
       Training set based on known spam
   Family friendly filters
       Linguistic analysis, general classification
        techniques, etc.
       For images: flesh tone detectors, source text
        analysis, etc.
   Editorial intervention
       Blacklists
       Top queries audited
       Complaints addressed
Acid test
   Which SEO’s rank highly on the query seo?
   Web search engines have policies on SEO
    practices they tolerate/block
       See pointers in Resources
   Adversarial IR: the unending (technical)
    battle between SEO’s and web search
    engines
   See for instance
    http://airweb.cse.lehigh.edu/
Duplicate detection
Duplicate/Near-Duplicate Detection

   Duplication: Exact match with fingerprints
   Near-Duplication: Approximate match
   Overview
       Compute syntactic similarity with an edit-
        distance measure
       Use similarity threshold to detect near-
        duplicates
         
             E.g., Similarity > 80% => Documents are “near
             duplicates”
         
             Not transitive though sometimes used
             transitively
Computing Similarity
   Segments of a document (natural or artificial
    breakpoints) [Brin95]
   Shingles (Word k-Grams) [Brin95, Brod98]
        “a rose is a rose is a rose” =>
           a_rose_is_a
              rose_is_a_rose
                    is_a_rose_is
   Similarity Measure between two docs (= sets
    of shingles)
       Set intersection [Brod98]
        (Specifically, Size_of_Intersection /
        Size_of_Union )
                   Jaccard measure
Shingles + Set Intersection
Computing exact set intersection of shingles
between all pairs of documents is expensive
   Approximate using a cleverly chosen subset of
    shingles from each (a sketch)
   Estimate Jaccard from a short sketch
Create a “sketch vector” (e.g., of size 200) for
each document
   Documents which share more than t (say 80%)
    corresponding vector elements are similar
    For doc d, sketchd[i] is computed as follows:
     
         Let f map all shingles in the universe to 0..2 m
        Let πi be a specific random permutation on 0..2 m
        Pick MIN πi (f(s)) over all shingles s in d
Shingling with sampling
minima
   Given two documents A1, A2.
   Let S1 and S2 be their shingle sets
   Resemblance = |Intersection of S1 and S2| / |
    Union of S1 and S2|.
   Let Alpha = min ( π (S1))
   Let Beta = min (π(S2))
       Probability (Alpha = Beta) = Resemblance
Computing Sketch[i] for Doc1
    Document 1


                 264   Start with 64 bit shingles

                 264
                       Permute on the number line
                 264   with   πi
                 264   Pick the min value
Test if Doc1.Sketch[i] = Doc2.Sketch[i]

       Document 1                 Document 2


                         264                      264
                         264                      264

                         264                      264
       A                              B
                         2   64                   264


              Are these equal?
Test for 200 random permutations: π1, π2,… π200
However…
            Document 1                 Document 2

                                 264                264
                                 264                264
            A
                                 264    B           264
                                 264                264

A = B iff the shingle with the MIN value in the union of
Doc1 and Doc2 is common to both (I.e., lies in the
intersection)

This happens with probability:
       Size_of_intersection / Size_of_union
                     Why?
Set Similarity
   Set Similarity (Jaccard measure)
                                    Ci  C j
                simJ(Ci , C j ) =
                                    Ci  C j
   View sets as columns of a matrix; one row for
    each element in the universe. aij = 1 indicates
    presence of item i in set j
    Example      C1 C2

                  0   1
                  1   0
                  1   1      simJ(C1,C2) = 2/5 = 0.4
                  0   0
                  1   1
                  0   1
Key Observation
   For columns Ci, Cj, four types of rows
             Ci     Cj
       A      1      1
       B      1      0
       C      0      1
       D      0      0
   Overload notation: A = # of rows of type A
   Claim                     A
              simJ(Ci , C j ) =
                                  A+B+C
Min Hashing
   Randomly permute rows
   h(Ci) = index of first row with 1 in column Ci
   Surprising Property

   Why? P [ h(Ci ) = h(C j ) ] = simJ ( Ci , C j )
        Both are A/(A+B+C)
        Look down columns Ci, Cj until first non-Type-
         D row
        h(Ci) = h(Cj)  type A row
Mirror Detection
   Mirroring is systematic replication of web pages
    across hosts.
      Single largest cause of duplication on the web

   Host1/α and Host2/β are mirrors iff
        For all (or most) paths p such that when
          http://Host1/ α / p exists
          http://Host2/ β / p exists as well
        with identical (or near identical) content, and
          vice versa.
Mirror Detection example
   http://www.elsevier.com/ and http://www.elsevier.nl/
   Structural Classification of Proteins
       http://scop.mrc-lmb.cam.ac.uk/scop
       http://scop.berkeley.edu/
       http://scop.wehi.edu.au/scop
       http://pdb.weizmann.ac.il/scop
       http://scop.protres.ru/
Repackaged Mirrors
Auctions.msn.com        Auctions.lycos.com




                                   Aug
Motivation
   Why detect mirrors?
       Smart crawling
         
             Fetch from the fastest or freshest server
         
             Avoid duplication
       Better connectivity analysis
         
             Combine inlinks
         
             Avoid double counting outlinks
       Redundancy in result listings
            “If that fails you can try: <mirror>/samepath”
       Proxy caching
Bottom Up Mirror Detection
[Cho00]
   Maintain clusters of subgraphs
   Initialize clusters of trivial subgraphs
       Group near-duplicate single documents into a cluster
   Subsequent passes
       Merge clusters of the same cardinality and corresponding linkage




       Avoid decreasing cluster cardinality
   To detect mirrors we need:
       Adequate path overlap
       Contents of corresponding pages within a small time range
Can we use URLs to find
    mirrors?
                 www.synthesis.org                               synthesis.stanford.edu


                 a              b                                    a             b
                                           d                                                  d
                         c                                                   c

www.synthesis.org/Docs/ProjAbs/synsys/synalysis.html     synthesis.stanford.edu/Docs/ProjAbs/deliv/high-tech-…
www.synthesis.org/Docs/ProjAbs/synsys/visual-semi-quant.html
                                                         synthesis.stanford.edu/Docs/ProjAbs/mech/mech-enhanced…
www.synthesis.org/Docs/annual.report96.final.html        synthesis.stanford.edu/Docs/ProjAbs/mech/mech-intro-…
www.synthesis.org/Docs/cicee-berlin-paper.html           synthesis.stanford.edu/Docs/ProjAbs/mech/mech-mm-case-…
www.synthesis.org/Docs/myr5                              synthesis.stanford.edu/Docs/ProjAbs/synsys/quant-dev-new-…
www.synthesis.org/Docs/myr5/cicee/bridge-gap.html        synthesis.stanford.edu/Docs/annual.report96.final.html
www.synthesis.org/Docs/myr5/cs/cs-meta.html              synthesis.stanford.edu/Docs/annual.report96.final_fn.html
www.synthesis.org/Docs/myr5/mech/mech-intro-mechatron.html
                                                         synthesis.stanford.edu/Docs/myr5/assessment
www.synthesis.org/Docs/myr5/mech/mech-take-home.html synthesis.stanford.edu/Docs/myr5/assessment/assessment-…
www.synthesis.org/Docs/myr5/synsys/experiential-learning.html
                                                         synthesis.stanford.edu/Docs/myr5/assessment/mm-forum-kiosk-…
www.synthesis.org/Docs/myr5/synsys/mm-mech-dissec.html   synthesis.stanford.edu/Docs/myr5/assessment/neato-ucb.html
www.synthesis.org/Docs/yr5ar                             synthesis.stanford.edu/Docs/myr5/assessment/not-available.html
www.synthesis.org/Docs/yr5ar/assess                      synthesis.stanford.edu/Docs/myr5/cicee
www.synthesis.org/Docs/yr5ar/cicee                       synthesis.stanford.edu/Docs/myr5/cicee/bridge-gap.html
www.synthesis.org/Docs/yr5ar/cicee/bridge-gap.html       synthesis.stanford.edu/Docs/myr5/cicee/cicee-main.html
www.synthesis.org/Docs/yr5ar/cicee/comp-integ-analysis.html
                                                         synthesis.stanford.edu/Docs/myr5/cicee/comp-integ-analysis.html
Top Down Mirror Detection
    [Bhar99, Bhar00c]
   E.g.,
     www.synthesis.org/Docs/ProjAbs/synsys/synalysis.html
     synthesis.stanford.edu/Docs/ProjAbs/synsys/quant-dev-new-teach.html
   What features could indicate mirroring?
        Hostname similarity:
             word unigrams and bigrams: { www, www.synthesis, synthesis, …}
        Directory similarity:
             Positional path bigrams { 0:Docs/ProjAbs, 1:ProjAbs/synsys, … }
        IP address similarity:
             3 or 4 octet overlap
             Many hosts sharing an IP address => virtual hosting by an ISP
        Host outlink overlap
        Path overlap
             Potentially, path + sketch overlap
Implementation
   Phase I - Candidate Pair Detection
        Find features that pairs of hosts have in common
        Compute a list of host pairs which might be mirrors
   Phase II - Host Pair Validation
         Test each host pair and determine extent of mirroring
          Check if 20 paths sampled from Host1 have near-

           duplicates on Host2 and vice versa
          Use transitive inferences:

               IF Mirror(A,x) AND Mirror(x,B) THEN Mirror(A,B)
               IF Mirror(A,x) AND !Mirror(x,B) THEN !Mirror(A,B)
   Evaluation
       140 million URLs on 230,000 hosts (1999)
       Best approach combined 5 sets of features
          Top 100,000 host pairs had precision = 0.57 and recall =

           0.86
WebIR Infrastructure
   Connectivity Server
       Fast access to links to support for link
        analysis
   Term Vector Database
       Fast access to document vectors to augment
        link analysis
Connectivity Server
[CS1: Bhar98b, CS2 & 3: Rand01]
   Fast web graph access to support connectivity
    analysis
   Stores mappings in memory from
       
           URL to outlinks, URL to inlinks
   Applications
       
           HITS, Pagerank computations
       
           Crawl simulation
       
           Graph algorithms: web connectivity, diameter etc.
               more on this later
       
           Visualizations
Usage
    Input
                      Execution             Output
   Graph
  algorithm            Graph                URLs
      +       URLs    algorithm   IDs         +
    URLs      to       runs in    to        Values
      +       FPs     memory      URLs
   Values     to
              IDs

    Translation Tables on Disk
    URL text: 9 bytes/URL (compressed from ~80 bytes )
    FP(64b) -> ID(32b): 5 bytes
    ID(32b) -> FP(64b): 8 bytes
    ID(32b) -> URLs: 0.5 bytes
E.g., HIGH IDs:
    ID assignment                        Max(indegree , outdegree) > 254

   Partition URLs into 3 sets, sorted
                                         ID                URL
    lexicographically
        High: Max degree > 254          …
        Medium: 254 > Max degree > 24   9891       www.amazon.com/
        Low: remaining (75%)            9912       www.amazon.com/jobs/
                                         …
   IDs assigned in sequence (densely)
                                         9821878    www.geocities.com/
                                         …
                                         40930030   www.google.com/

    Adjacency lists                      …

     In memory tables for Outlinks,
      Inlinks                            85903590   www.yahoo.com/

     List index maps from a Source
      ID to start of adjacency list
Adjacency List Compression - I

                        …
           …           98                             …
                       132               …            -6
    104                153                            34
    105                98         104                 21
    106                147        105                 -8
                       153        106                 49
                        …                              6
           …                                          …
                     Sequence            …
                         of                         Delta
           List      Adjacency                    Encoded
          Index        Lists             List     Adjacency
                                        Index       Lists
• Adjacency List:
     - Smaller delta values are exponentially more frequent (80% to same host)
     - Compress deltas with variable length encoding (e.g., Huffman)
• List Index pointers: 32b for high, Base+16b for med, Base+8b for low
     - Avg = 12b per pointer
Adjacency List Compression - II

   Inter List Compression
       Basis: Similar URLs may share links
            Close in ID space => adjacency lists may overlap
       Approach
            Define a representative adjacency list for a block of IDs
                  Adjacency list of a reference ID
                  Union of adjacency lists in the block
            Represent adjacency list in terms of deletions and additions
             when it is cheaper to do so
       Measurements
            Intra List + Starts: 8-11 bits per link (580M pages/16GB RAM)
            Inter List: 5.4-5.7 bits per link (870M pages/16GB RAM.)
Term Vector Database
    [Stat00]
   Fast access to 50 word term vectors for web pages
        Term Selection:
             Restricted to middle 1/3rd of lexicon by document frequency
             Top 50 words in document by TF.IDF.
        Term Weighting:
             Deferred till run-time (can be based on term freq, doc freq, doc length)
   Applications
        Content + Connectivity analysis (e.g., Topic Distillation)
        Topic specific crawls
        Document classification
   Performance
        Storage: 33GB for 272M term vectors
        Speed: 17 ms/vector on AlphaServer 4100 (latency to read a disk
         block)
Architecture
URLid * 64 /480

                                       offset

                                         URL Info
            Base (4 bytes)
                                          LC:TID

                               Terms
                                          LC:TID
                                            …        128
              Bit vector                             Byte
                 For                      LC:TID      TV
             480 URLids                             Record
                                         FRQ:RL
                                         FRQ:RL
                               Freq
                                            …
                                         FRQ:RL
        URLid to Term Vector
               Lookup

Contenu connexe

Similaire à Special Topics in Search Engine Result Summaries and Anti-spamming Techniques

Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesMax Irwin
 
LLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team StructureLLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team StructureAggregage
 
Introduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information RetrievalIntroduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information RetrievalA. LE
 
Measuring Your Code
Measuring Your CodeMeasuring Your Code
Measuring Your CodeNate Abele
 
Measuring Your Code 2.0
Measuring Your Code 2.0Measuring Your Code 2.0
Measuring Your Code 2.0Nate Abele
 
Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Chris Fregly
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016MLconf
 
New Tools for an Old Art: Rhetorical Analysis Through Visualization and Play
 New Tools for an Old Art: Rhetorical Analysis Through Visualization and Play New Tools for an Old Art: Rhetorical Analysis Through Visualization and Play
New Tools for an Old Art: Rhetorical Analysis Through Visualization and PlayShannan Butler
 
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6  - IndexingInformation Retrieval, Encoding, Indexing, Big Table. Lecture 6  - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - IndexingSean Golliher
 
PoolParty Advanced Thesaurus Management
PoolParty Advanced Thesaurus ManagementPoolParty Advanced Thesaurus Management
PoolParty Advanced Thesaurus ManagementAndreas Blumauer
 
CSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialCSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialLeeFeigenbaum
 
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...source{d}
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
Sasa Nesic - PhD Dissertation Defense
Sasa Nesic - PhD Dissertation DefenseSasa Nesic - PhD Dissertation Defense
Sasa Nesic - PhD Dissertation DefenseSasa Nesic
 
Quality, Quantity, Web and Semantics
Quality, Quantity, Web and SemanticsQuality, Quantity, Web and Semantics
Quality, Quantity, Web and SemanticsZemanta
 
Quality, quantity, web and semantics
Quality, quantity, web and semanticsQuality, quantity, web and semantics
Quality, quantity, web and semanticsAndraz Tori
 
Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1Tobias Wunner
 
The search engine index
The search engine indexThe search engine index
The search engine indexCJ Jenkins
 

Similaire à Special Topics in Search Engine Result Summaries and Anti-spamming Techniques (20)

Anatomy of google
Anatomy of googleAnatomy of google
Anatomy of google
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
 
LLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team StructureLLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team Structure
 
Introduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information RetrievalIntroduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information Retrieval
 
Web Search Engine
Web Search EngineWeb Search Engine
Web Search Engine
 
Measuring Your Code
Measuring Your CodeMeasuring Your Code
Measuring Your Code
 
Measuring Your Code 2.0
Measuring Your Code 2.0Measuring Your Code 2.0
Measuring Your Code 2.0
 
Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
 
New Tools for an Old Art: Rhetorical Analysis Through Visualization and Play
 New Tools for an Old Art: Rhetorical Analysis Through Visualization and Play New Tools for an Old Art: Rhetorical Analysis Through Visualization and Play
New Tools for an Old Art: Rhetorical Analysis Through Visualization and Play
 
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6  - IndexingInformation Retrieval, Encoding, Indexing, Big Table. Lecture 6  - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
 
PoolParty Advanced Thesaurus Management
PoolParty Advanced Thesaurus ManagementPoolParty Advanced Thesaurus Management
PoolParty Advanced Thesaurus Management
 
CSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialCSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web Tutorial
 
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Sasa Nesic - PhD Dissertation Defense
Sasa Nesic - PhD Dissertation DefenseSasa Nesic - PhD Dissertation Defense
Sasa Nesic - PhD Dissertation Defense
 
Quality, Quantity, Web and Semantics
Quality, Quantity, Web and SemanticsQuality, Quantity, Web and Semantics
Quality, Quantity, Web and Semantics
 
Quality, quantity, web and semantics
Quality, quantity, web and semanticsQuality, quantity, web and semantics
Quality, quantity, web and semantics
 
Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1
 
The search engine index
The search engine indexThe search engine index
The search engine index
 

Dernier

How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersChitralekhaTherkar
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 

Dernier (20)

How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of Powders
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 

Special Topics in Search Engine Result Summaries and Anti-spamming Techniques

  • 1. Special Topics in Search Engines Result Summaries Anti-spamming Duplicate elimination
  • 3. Summaries  Having ranked the documents matching a query, we wish to present a results list  Most commonly, the document title plus a short summary  The title is typically automatically extracted from document metadata  What about the summaries?
  • 4. Summaries  Two basic kinds:  Static  Dynamic  A static summary of a document is always the same, regardless of the query that hit the doc  Dynamic summaries are query-dependent attempt to explain why the document was retrieved for the query at hand
  • 5. Static summaries  In typical systems, the static summary is a subset of the document  Simplest heuristic: the first 50 (or so – this can be varied) words of the document  Summary cached at indexing time  More sophisticated: extract from each document a set of “key” sentences  Simple NLP heuristics to score each sentence  Summary is made up of top-scoring sentences.  Most sophisticated: NLP used to synthesize a summary  Seldom used in IR; cf. text summarization
  • 6. Dynamic summaries  Present one or more “windows” within the document that contain several of the query terms  “KWIC” snippets: Keyword in Context presentation  Generated in conjunction with scoring  If query found as a phrase, the/some occurrences of the phrase in the doc  If not, windows within the doc that contain multiple query terms  The summary itself gives the entire content of the window – all terms, not only the query
  • 7. Generating dynamic summaries  If we have only a positional index, we cannot (easily) reconstruct context surrounding hits  If we cache the documents at index time, can run the window through it, cueing to hits found in the positional index  E.g., positional index says “the query is a phrase in position 4378” so we go to this position in the cached document and stream out the content  Most often, cache a fixed-size prefix of the doc  Note: Cached copy can be outdated
  • 8. Dynamic summaries  Producing good dynamic summaries is a tricky optimization problem  The real estate for the summary is normally small and fixed  Want short item, so show as many KWIC matches as possible, and perhaps other things like title  Want snippets to be long enough to be useful  Want linguistically well-formed snippets: users prefer snippets that contain complete phrases  Want snippets maximally informative about doc  But users really like snippets, even if they complicate IR system design
  • 10. Adversarial IR (Spam)  Motives  Commercial, political, religious, lobbies  Promotion funded by advertising budget  Operators  Contractors (Search Engine Optimizers) for lobbies, companies  Web masters  Hosting services  Forum  Web master world ( www.webmasterworld.com )  Search engine specific tricks  Discussions about academic papers 
  • 11. Search Engine Optimization II Search Engine Optimization Adversarial IR Adversarial IR (“search engine wars”) (“search engine wars”)
  • 12. Can you trust words on the page? auctions.hitsoffice.com/ Pornographic www.ebay.com/ Content Examples from July 2002
  • 13. Simplest forms  Early engines relied on the density of terms  The top-ranked pages for the query maui resort were the ones containing the most maui’s and resort’s  SEOs responded with dense repetitions of chosen terms  e.g., maui resort maui resort maui resort  Often, the repetitions would be in the same color as the background of the web page  Repeated terms got indexed by crawlers  But not visible to humans on browsers Can’t trust the words on a web page, for ranking.
  • 14. A few spam technologies  Cloaking  Serve fake content to search engine robot  DNS cloaking: Switch IP address. Impersonate  Doorway pages  Pages optimized for a single keyword that re- direct to the real target page  Keyword Spam  Misleading meta-keywords, excessive repetition of a term, fake “anchor text”  Hidden text with colors, CSS tricks, etc.  Link spamming  Mutual admiration societies, hidden links, awards  Domain flooding: numerous domains that point or re-direct to a target page  Robots  Fake click stream  Fake query stream  Millions of submissions via Add-Url
  • 15. More spam techniques  Cloaking  Serve fake content to search engine spider  DNS cloaking: Switch IP address. Impersonate SPAM Y Is this a Search Engine spider? N Real Cloaking Doc
  • 16. Tutorial on Tutorial on Cloaking & Stealth Cloaking & Stealth Technology Technology
  • 17. Variants of keyword stuffing  Misleading meta-tags, excessive repetition  Hidden text with colors, style sheet tricks, etc. Meta-Tags = “… London hotels, hotel, holiday inn, hilton, discount, booking, reservation, sex, mp3, britney spears, viagra, …”
  • 18. More spam techniques  Doorway pages  Pages optimized for a single keyword that re- direct to the real target page  Link spamming  Mutual admiration societies, hidden links, awards – more on these later  Domain flooding: numerous domains that point or re-direct to a target page  Robots  Fake query stream – rank checking programs  “Curve-fit” ranking programs of search engines  Millions of submissions via Add-Url
  • 19. The war against spam Quality signals - Prefer authoritative pages based on:  Votes from authors (linkage signals)  Votes from users (usage signals)  Policing of URL submissions  Anti robot test  Limits on meta-keywords Robust link analysis  Ignore statistically implausible linkage (or text)  Use link analysis to detect spammers (guilt by association)
  • 20. The war against spam  Spam recognition by machine learning  Training set based on known spam  Family friendly filters  Linguistic analysis, general classification techniques, etc.  For images: flesh tone detectors, source text analysis, etc.  Editorial intervention  Blacklists  Top queries audited  Complaints addressed
  • 21. Acid test  Which SEO’s rank highly on the query seo?  Web search engines have policies on SEO practices they tolerate/block  See pointers in Resources  Adversarial IR: the unending (technical) battle between SEO’s and web search engines  See for instance http://airweb.cse.lehigh.edu/
  • 23. Duplicate/Near-Duplicate Detection  Duplication: Exact match with fingerprints  Near-Duplication: Approximate match  Overview  Compute syntactic similarity with an edit- distance measure  Use similarity threshold to detect near- duplicates  E.g., Similarity > 80% => Documents are “near duplicates”  Not transitive though sometimes used transitively
  • 24. Computing Similarity  Segments of a document (natural or artificial breakpoints) [Brin95]  Shingles (Word k-Grams) [Brin95, Brod98] “a rose is a rose is a rose” => a_rose_is_a rose_is_a_rose is_a_rose_is  Similarity Measure between two docs (= sets of shingles)  Set intersection [Brod98] (Specifically, Size_of_Intersection / Size_of_Union ) Jaccard measure
  • 25. Shingles + Set Intersection Computing exact set intersection of shingles between all pairs of documents is expensive  Approximate using a cleverly chosen subset of shingles from each (a sketch)  Estimate Jaccard from a short sketch Create a “sketch vector” (e.g., of size 200) for each document  Documents which share more than t (say 80%) corresponding vector elements are similar  For doc d, sketchd[i] is computed as follows:  Let f map all shingles in the universe to 0..2 m  Let πi be a specific random permutation on 0..2 m  Pick MIN πi (f(s)) over all shingles s in d
  • 26. Shingling with sampling minima  Given two documents A1, A2.  Let S1 and S2 be their shingle sets  Resemblance = |Intersection of S1 and S2| / | Union of S1 and S2|.  Let Alpha = min ( π (S1))  Let Beta = min (π(S2))  Probability (Alpha = Beta) = Resemblance
  • 27. Computing Sketch[i] for Doc1 Document 1 264 Start with 64 bit shingles 264 Permute on the number line 264 with πi 264 Pick the min value
  • 28. Test if Doc1.Sketch[i] = Doc2.Sketch[i] Document 1 Document 2 264 264 264 264 264 264 A B 2 64 264 Are these equal? Test for 200 random permutations: π1, π2,… π200
  • 29. However… Document 1 Document 2 264 264 264 264 A 264 B 264 264 264 A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is common to both (I.e., lies in the intersection) This happens with probability: Size_of_intersection / Size_of_union Why?
  • 30. Set Similarity  Set Similarity (Jaccard measure) Ci  C j simJ(Ci , C j ) = Ci  C j  View sets as columns of a matrix; one row for each element in the universe. aij = 1 indicates presence of item i in set j  Example C1 C2 0 1 1 0 1 1 simJ(C1,C2) = 2/5 = 0.4 0 0 1 1 0 1
  • 31. Key Observation  For columns Ci, Cj, four types of rows Ci Cj A 1 1 B 1 0 C 0 1 D 0 0  Overload notation: A = # of rows of type A  Claim A simJ(Ci , C j ) = A+B+C
  • 32. Min Hashing  Randomly permute rows  h(Ci) = index of first row with 1 in column Ci  Surprising Property  Why? P [ h(Ci ) = h(C j ) ] = simJ ( Ci , C j )  Both are A/(A+B+C)  Look down columns Ci, Cj until first non-Type- D row  h(Ci) = h(Cj)  type A row
  • 33. Mirror Detection  Mirroring is systematic replication of web pages across hosts.  Single largest cause of duplication on the web  Host1/α and Host2/β are mirrors iff For all (or most) paths p such that when http://Host1/ α / p exists http://Host2/ β / p exists as well with identical (or near identical) content, and vice versa.
  • 34. Mirror Detection example  http://www.elsevier.com/ and http://www.elsevier.nl/  Structural Classification of Proteins  http://scop.mrc-lmb.cam.ac.uk/scop  http://scop.berkeley.edu/  http://scop.wehi.edu.au/scop  http://pdb.weizmann.ac.il/scop  http://scop.protres.ru/
  • 35. Repackaged Mirrors Auctions.msn.com Auctions.lycos.com Aug
  • 36. Motivation  Why detect mirrors?  Smart crawling  Fetch from the fastest or freshest server  Avoid duplication  Better connectivity analysis  Combine inlinks  Avoid double counting outlinks  Redundancy in result listings  “If that fails you can try: <mirror>/samepath”  Proxy caching
  • 37. Bottom Up Mirror Detection [Cho00]  Maintain clusters of subgraphs  Initialize clusters of trivial subgraphs  Group near-duplicate single documents into a cluster  Subsequent passes  Merge clusters of the same cardinality and corresponding linkage  Avoid decreasing cluster cardinality  To detect mirrors we need:  Adequate path overlap  Contents of corresponding pages within a small time range
  • 38. Can we use URLs to find mirrors? www.synthesis.org synthesis.stanford.edu a b a b d d c c www.synthesis.org/Docs/ProjAbs/synsys/synalysis.html synthesis.stanford.edu/Docs/ProjAbs/deliv/high-tech-… www.synthesis.org/Docs/ProjAbs/synsys/visual-semi-quant.html synthesis.stanford.edu/Docs/ProjAbs/mech/mech-enhanced… www.synthesis.org/Docs/annual.report96.final.html synthesis.stanford.edu/Docs/ProjAbs/mech/mech-intro-… www.synthesis.org/Docs/cicee-berlin-paper.html synthesis.stanford.edu/Docs/ProjAbs/mech/mech-mm-case-… www.synthesis.org/Docs/myr5 synthesis.stanford.edu/Docs/ProjAbs/synsys/quant-dev-new-… www.synthesis.org/Docs/myr5/cicee/bridge-gap.html synthesis.stanford.edu/Docs/annual.report96.final.html www.synthesis.org/Docs/myr5/cs/cs-meta.html synthesis.stanford.edu/Docs/annual.report96.final_fn.html www.synthesis.org/Docs/myr5/mech/mech-intro-mechatron.html synthesis.stanford.edu/Docs/myr5/assessment www.synthesis.org/Docs/myr5/mech/mech-take-home.html synthesis.stanford.edu/Docs/myr5/assessment/assessment-… www.synthesis.org/Docs/myr5/synsys/experiential-learning.html synthesis.stanford.edu/Docs/myr5/assessment/mm-forum-kiosk-… www.synthesis.org/Docs/myr5/synsys/mm-mech-dissec.html synthesis.stanford.edu/Docs/myr5/assessment/neato-ucb.html www.synthesis.org/Docs/yr5ar synthesis.stanford.edu/Docs/myr5/assessment/not-available.html www.synthesis.org/Docs/yr5ar/assess synthesis.stanford.edu/Docs/myr5/cicee www.synthesis.org/Docs/yr5ar/cicee synthesis.stanford.edu/Docs/myr5/cicee/bridge-gap.html www.synthesis.org/Docs/yr5ar/cicee/bridge-gap.html synthesis.stanford.edu/Docs/myr5/cicee/cicee-main.html www.synthesis.org/Docs/yr5ar/cicee/comp-integ-analysis.html synthesis.stanford.edu/Docs/myr5/cicee/comp-integ-analysis.html
  • 39. Top Down Mirror Detection [Bhar99, Bhar00c]  E.g., www.synthesis.org/Docs/ProjAbs/synsys/synalysis.html synthesis.stanford.edu/Docs/ProjAbs/synsys/quant-dev-new-teach.html  What features could indicate mirroring?  Hostname similarity:  word unigrams and bigrams: { www, www.synthesis, synthesis, …}  Directory similarity:  Positional path bigrams { 0:Docs/ProjAbs, 1:ProjAbs/synsys, … }  IP address similarity:  3 or 4 octet overlap  Many hosts sharing an IP address => virtual hosting by an ISP  Host outlink overlap  Path overlap  Potentially, path + sketch overlap
  • 40. Implementation  Phase I - Candidate Pair Detection  Find features that pairs of hosts have in common  Compute a list of host pairs which might be mirrors  Phase II - Host Pair Validation  Test each host pair and determine extent of mirroring  Check if 20 paths sampled from Host1 have near- duplicates on Host2 and vice versa  Use transitive inferences: IF Mirror(A,x) AND Mirror(x,B) THEN Mirror(A,B) IF Mirror(A,x) AND !Mirror(x,B) THEN !Mirror(A,B)  Evaluation  140 million URLs on 230,000 hosts (1999)  Best approach combined 5 sets of features  Top 100,000 host pairs had precision = 0.57 and recall = 0.86
  • 41. WebIR Infrastructure  Connectivity Server  Fast access to links to support for link analysis  Term Vector Database  Fast access to document vectors to augment link analysis
  • 42. Connectivity Server [CS1: Bhar98b, CS2 & 3: Rand01]  Fast web graph access to support connectivity analysis  Stores mappings in memory from  URL to outlinks, URL to inlinks  Applications  HITS, Pagerank computations  Crawl simulation  Graph algorithms: web connectivity, diameter etc.  more on this later  Visualizations
  • 43. Usage Input Execution Output Graph algorithm Graph URLs + URLs algorithm IDs + URLs to runs in to Values + FPs memory URLs Values to IDs Translation Tables on Disk URL text: 9 bytes/URL (compressed from ~80 bytes ) FP(64b) -> ID(32b): 5 bytes ID(32b) -> FP(64b): 8 bytes ID(32b) -> URLs: 0.5 bytes
  • 44. E.g., HIGH IDs: ID assignment Max(indegree , outdegree) > 254  Partition URLs into 3 sets, sorted ID URL lexicographically  High: Max degree > 254 …  Medium: 254 > Max degree > 24 9891 www.amazon.com/  Low: remaining (75%) 9912 www.amazon.com/jobs/ …  IDs assigned in sequence (densely) 9821878 www.geocities.com/ … 40930030 www.google.com/ Adjacency lists …  In memory tables for Outlinks, Inlinks 85903590 www.yahoo.com/  List index maps from a Source ID to start of adjacency list
  • 45. Adjacency List Compression - I … … 98 … 132 … -6 104 153 34 105 98 104 21 106 147 105 -8 153 106 49 … 6 … … Sequence … of Delta List Adjacency Encoded Index Lists List Adjacency Index Lists • Adjacency List: - Smaller delta values are exponentially more frequent (80% to same host) - Compress deltas with variable length encoding (e.g., Huffman) • List Index pointers: 32b for high, Base+16b for med, Base+8b for low - Avg = 12b per pointer
  • 46. Adjacency List Compression - II  Inter List Compression  Basis: Similar URLs may share links  Close in ID space => adjacency lists may overlap  Approach  Define a representative adjacency list for a block of IDs  Adjacency list of a reference ID  Union of adjacency lists in the block  Represent adjacency list in terms of deletions and additions when it is cheaper to do so  Measurements  Intra List + Starts: 8-11 bits per link (580M pages/16GB RAM)  Inter List: 5.4-5.7 bits per link (870M pages/16GB RAM.)
  • 47. Term Vector Database [Stat00]  Fast access to 50 word term vectors for web pages  Term Selection:  Restricted to middle 1/3rd of lexicon by document frequency  Top 50 words in document by TF.IDF.  Term Weighting:  Deferred till run-time (can be based on term freq, doc freq, doc length)  Applications  Content + Connectivity analysis (e.g., Topic Distillation)  Topic specific crawls  Document classification  Performance  Storage: 33GB for 272M term vectors  Speed: 17 ms/vector on AlphaServer 4100 (latency to read a disk block)
  • 48. Architecture URLid * 64 /480 offset URL Info Base (4 bytes) LC:TID Terms LC:TID … 128 Bit vector Byte For LC:TID TV 480 URLids Record FRQ:RL FRQ:RL Freq … FRQ:RL URLid to Term Vector Lookup

Notes de l'éditeur

  1. Arms race
  2. Small biotech firm ; query example from last time ; infoseek exapmle
  3. Talk about expert witness; george w bush example
  4. More complex problem of finding the “original” site