SlideShare a Scribd company logo
1 of 38
Download to read offline
Link Analysis

  Rajendra Akerkar
Vestlandsforsking, Norway
Objectives
       To review common approaches to l k analysis
                                h      link l

       To calculate the popularity of a site based on link
        analysis

       To model human judgments indirectly




2            R. Akerkar
Outline
        1.
        1    Early Approaches to Link Analysis
        2.   Hubs and Authorities: HITS
        3.   Page Rank
        4.   Stability
        5.   Probabilistic Link Analysis
        6.   Limitation of Link Analysis




3            R. Akerkar
Early Approaches
    Basic Assumptions
       • Hyperlinks contain information about the human
         judgment of a site
       • The more incoming links to a site the more it is
                                      site,
         judged important
    Bray 1996
       y
       • The visibility of a site is measured by the number
         of other sites pointing to it
       • The luminosity of a site is measured by the number
         of other sites to which it points
        Limitation: failure to capture the relative
         importance of different parents (children) sites
4            R. Akerkar
Early Approaches
    Mark 1988
    • To calculate the score S of a document at vertex v
                               1
            S(v) = s(v) +             Σ           S(w)
                            | ch[v] | w Є |ch(v)|
                 v: a vertex i the h
                             in h hypertext graph G = (V E)
                                                  h      (V,
                 S(v): the global score
                 s(v): the score if the document is isolated
                 ch(v): children of the document at vertex v
    • Limitation:
       - Require G to be a directed acyclic g p (
             q                            y   graph (DAG)
                                                        )
       - If v has a single link to w, S(v) > S(w)
       - If v has a long path to w and s(v) < s(w),
            then S(v) > S (w)
        unreasonable
5              R. Akerkar
Early Approaches
    Marchiori (1997)
        • Hyper information should complement textual
          information to obtain the overall information
           S(v) = s(v) + h(v)
            ( )    ( )    ( )
              - S(v): overall information
              - s(v): textual information
              - h(v): hyper information
                                        r(v, w)
        • h(v) =      Σ             F             S(w)
                      w Є |ch[v]|
              - F a fading constant F Є (0 1)
                F:           constant,      (0,
              - r(v, w): the rank of w after sorting the children
                of v by S(w)
        a remedy of the previous approach (Mark 1988)
6              R. Akerkar
HITS - Kleinberg’s Algorithm
     • HITS – Hypertext Induced Topic Selection
               yp                 p

     • For each vertex v Є V in a subgraph of interest:
                 a(v) - the authority of v
                 h(v) - the hubness of v

     • A site is very a thoritati e if it recei es man
                  er authoritative        receives many
       citations. Citation from important sites weight
       more than citations from less-important sites

     • Hubness shows the importance of a site. A good
       hub is a site that links to many authoritative sites
                                      y

7             R. Akerkar
Authority and Hubness

    2                                                       5


    3                            1     1                    6


4
                                                            7

        a(1) = h(2) + h(3) + h(4)    h(1) = a(5) + a(6) + a(7)

8                   R. Akerkar
Authority d H b
    A th it and Hubness Convergence
                        C     g

       • R
         Recursive d
               i dependency:
                      d

              a(v)  Σ               h(w)
                         w Є pa[v]

              h(v)  Σ w Є ch[v] a(w)

        • Using Linear Algebra, we can prove:

               a(v) and h(v) converge



9          R. Akerkar
HITS Example
  Find a base subgraph:
• Start with a root set R {1, 2, 3, 4}

• {1, 2, 3, 4} - nodes relevant to
               the topic

• Expand the root set R to include
all the children and a fixed
number of parents of nodes in R

 A new set S (base subgraph) 


10                  R. Akerkar
HITS Example
     BaseSubgraph( R d)
                   R,
     1. S  r
     2.   for each v in R
     3.   do S  S U ch[v]
     4.       P  pa[v]
     5.       if |P| > d
               f
     6.       then P  arbitrary subset of P having size d
     7.            SSUP
     8.
     8    return S




11                R. Akerkar
HITS Example
      Hubs and authorities: two n-dimensional a and h
              HubsAuthorities(G)
                                              |V|
               1     1  [1,…,1] Є R
               2     a0  h 0  1
               3     t 1
               4     repeat
               5            for each v in V
               6            do at (v)  Σ                      h      (w)
                                                    w Є pa[v] t -1
               7                    ht (v)  Σ w Є pa[v] a              (w)
               8              a t  at / || at ||                  t -1
               9              h t  ht / || ht ||
               10             t  t+1
               11     until || a t – at -1 || + || h t – ht -1 || < ε
                                          1                  1
               12     return (a t , h t )
12            R. Akerkar
HITS Example Results
             y
     Authority
     Hubness




1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Authority and hubness weights
13               R. Akerkar
HITS Improvements
     Brarat and Henzinger (1998)
       • HITS problems
          1) The document can contain many identical links to
             the same document in another host
          2) Links are generated automatically (e.g. messages
             posted on newsgroups)

       • Solutions
          1) Assign weight to identical multiple edges, which
             are inversely proportional to their multiplicity
          2) Prune irrelevant nodes or regulating the influence
             of a node with a relevance weight

14             R. Akerkar
PageRank
 Introduced by Page et al (
                y g            (1998))
   The weight is assigned by the rank of parents




 Difference with HITS
   HITS takes Hubness & Authority weights
   The page rank is proportional to its parents’ rank, but inversely
    proportional to its parents’ outdegree


                                                                15
Matrix Notation




                                     Adjacent M i
                                     Adj      Matrix



                                     A=


  * http://www.kusatro.kyoto-u.com

                                                       16
Matrix Notation
 Matrix Notation
  r=αBr=Mr
    α : eigenvalue
    r : eigenvector of B


       Ax=λx
                              B=
       | A - λI | x = 0

 Finding Pagerank
  to find eigenvector of B with an associated eigenvalue α

                                                  17
Matrix Notation
     PageRank: eigenvector of P relative to max eigenvalue

       B = P D P-1
           D: diagonal matrix of eigenvalues {λ1, … λn}
           P: regular matrix that consists of eigenvectors




        PageRank   r1 =
                                      normalized


18                 R. Akerkar
Matrix Notation




                        • Confirm the result
                         # of inlinks from high ranked page
                         hard to explain about 5&2 6&7
                                               5&2,

                        • Interesting Topic
                          How do you create your
                                homepage highly ranked?
19         R. Akerkar
Markov Chain Notation
      Random surfer model
        Description of a random walk through the Web graph
        Interpreted as a transition matrix with asymptotic probability that a
         surfer is currently browsing that page




                 rt = M rtt-1
                            1
                 M: transition matrix for a first-order Markov chain (stochastic)

        Does it converge to some sensible solution ( too)
                       g                           (as   )
        regardless of the initial ranks ?
20                    R. Akerkar
Problem
      “Rank Sink” Problem
        In general, many Web pages have no inlinks/outlinks
        It results in dangling edges in the graph

       E.g.
         no parent  rank 0
             MT converges to a matrix
              whose last column is all zero

         no children  no solution
             MT converges to zero matrix




21                      R. Akerkar
Modification
      Surfer will restart browsing by p
                                  g y picking a new Web p g at
                                            g           page
       random
         M=(B+E)
           E : escape matrix
           M : stochastic matrix

      S ll problem?
       Still bl
        It is not guaranteed that M is primitive

        If M is stochastic and primitive, PageRank converges to
         corresponding stationary distribution of M

22                      R. Akerkar
PageRank Algorithm




                        * Page et al 1998
                                  al,
23         R. Akerkar
Distribution of the Mixture Model
      The probability distribution that results from combining the
           p         y                                        g
       Markovian random walk distribution & the static rank source
       distribution
             r = εe + (1- ε)x
                       1
                     ε: probability of selecting non-linked page

             PageRank


 Now, transition matrix [εH + (1- ε)M] is primitive and stochastic
 rt converges to the dominant eigenvector
24                    R. Akerkar
Stability
      Whether the link analysis algorithms based on eigenvectors
       are stable in the sense that results don’t change significantly?

      Th connectivity of a portion of the graph is changed
       The              f            f h       h h        d
       arbitrary
        How will it affect the results of algorithms?




25                 R. Akerkar
Stability of HITS
     Ng et al (2001)
     • A bound on the number of hyperlinks k that can added or deleted
     from one page without affecting the authority or hubness weights
     • It is possible to perturb a symmetric matrix by a quantity that grows
     as δ that produces a constant perturbation of the dominant
     eigenvector




                                   δ: eigengap λ1 – λ2
                                        g g p
                                   d: maximum outdegree of G
26                    R. Akerkar
Stability of PageRank

                                         Ng et al (2001)

               V: the set of vertices touched by the perturbation
 The parameter ε of the mixture model has a stabilization role
 If the set of pages affected by the p
                pg              y     perturbation have a small rank, the overall
                                                                    ,
     change will also be small


                                         tighter bound by
                                         Bianchini et al (2001)

              δ(j) >= 2 depends on the edges incident on j
27                    R. Akerkar
SALSA
      SALSA (Lempel, Moran 2001)
        Probabilistic extension of the HITS algorithm
        Random walk is carried out by following hyperlinks both in the
         forward and in the backward direction
      Two separate random walks
        Hub walk
        Authority walk




28               R. Akerkar
Forming a Bipartite Graph in SALSA




29         R. Akerkar
Random Walks
      Hub walk
        Follow a Web link from a page uh to a page wa (a forward link)
         and then
        Immediately traverse a backlink going from wa to vh, where (u w)
                                                                     (u,w)
         Є E and (v,w) Є E
      Authority Walk
                 y
        Follow a Web link from a page w(a) to a page u(h) (a backward
         link) and then
        Immediately traverse a f
                d l             forward link going b k f
                                      dl k         back from vh to wa
         where (u,w) Є E and (v,w) Є E


30                R. Akerkar
Computing Weights
      Hub weight computed from the sum of the product of the
       inverse degree of the in-links and the out-links




31               R. Akerkar
Why We Care
      Lempel and Moran (2001) showed theoretically that SALSA
       weights are more robust that HITS weights in the presence of the
       Tightly Knit Community (TKC) Effect.
        This effect occurs when a small collection of pages (related to a given topic)
         is connected so that every hub links to every authority and includes as a special
         case the mutual reinforcement effect
      The pages in a community connected in this way can be ranked
       highly by HITS, higher than pages in a much larger collection
                 HITS
       where only some hubs link to some authorities
      TKC could be exploited by spammers hoping to increase their
       page weight ( li k f
               i h (e.g. link farms)
                                   )



32                 R. Akerkar
A Similar Approach
      Rafiei and Mendelzon (2000) and Ng et al. (2001)
       propose similar approaches using reset as in
       PageRank
      Unlike PageRank, in this model the surfer will
       follow a forward link on odd steps but a
       backward li k
       b k d link on even steps t
      The stability properties of these ranking
       distributions are similar to those of PageRank (Ng
       et al. 2001)

33             R. Akerkar
Overcoming TKC
      Similarity downweight sequencing and sequential
                y        g            g
      clustering (Roberts and Rosenthal 2003)
       Consider the underlying structure of clusters
                             y g
       Suggest downweight sequencing to avoid the
        Tight Knit Community problem
          g                     yp
       Results indicate approach is effective for few
        tested queries, but still untested on a large scale


34            R. Akerkar
PHITS and More
      PHITS: Cohn and Chang (2000)
        Only the principal eigenvector is extracted using SALSA, so the
         authority along the remaining eigenvectors is completely
         neglected
           Account for more eigenvectors of the co-citation matrix

      See also Lempel, Moran (2003)




35                 R. Akerkar
Limits of Link Analysis
      META tags/ invisible text
        Search engines relying on meta tags in documents are often
         misled (intentionally) by web developers
     P f
      Pay-for-place
                 l
       Search engine bias : organizations pay search engines and page
        rank
       Advertisements: organizations pay high ranking pages for
        advertising space
           W h a primary effect of increased visibility to end users and a secondary
            With           ff     f         d bl              d         d        d
            effect of increased respectability due to relevance to high ranking page



36                 R. Akerkar
Limits of Link Analysis
      Stability
        Adding even a small number of nodes/edges to the graph has a
         significant impact
      T i drift – similar t TKC
       Topic d ift i il to
        A top authority may be a hub of pages on a different topic
         resulting in increased rank of the authority p g
                 g                                  y page
      Content evolution
        Adding/removing links/content can affect the intuitive
         authority rank of a page requiring recalculation of page ranks



37                R. Akerkar
Further Reading
        R. Akerkar d P. Lingras, Building an I ll
         R Ak k and P L             B ld      Intelligent W b Th
                                                          Web: Theory
         and Practice, Jones & Bartlett, 2008
        http://www.jbpub.com/catalog/9780763741372/
            p          j p                g




38               R. Akerkar

More Related Content

What's hot

Text Extraction from Product Images Using State-of-the-Art Deep Learning Tech...
Text Extraction from Product Images Using State-of-the-Art Deep Learning Tech...Text Extraction from Product Images Using State-of-the-Art Deep Learning Tech...
Text Extraction from Product Images Using State-of-the-Art Deep Learning Tech...Databricks
 
Sentiment Analysis using Twitter Data
Sentiment Analysis using Twitter DataSentiment Analysis using Twitter Data
Sentiment Analysis using Twitter DataHari Prasad
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment AnalysisAditya Nag
 
Network centrality measures and their effectiveness
Network centrality measures and their effectivenessNetwork centrality measures and their effectiveness
Network centrality measures and their effectivenessemapesce
 
Machine Learning vs. Deep Learning
Machine Learning vs. Deep LearningMachine Learning vs. Deep Learning
Machine Learning vs. Deep LearningBelatrix Software
 
Low-cost convolutional neural network for tomato plant diseases classification
Low-cost convolutional neural network for tomato plant diseases classificationLow-cost convolutional neural network for tomato plant diseases classification
Low-cost convolutional neural network for tomato plant diseases classificationIAESIJAI
 
Transfer learning-presentation
Transfer learning-presentationTransfer learning-presentation
Transfer learning-presentationBushra Jbawi
 
Online social network analysis with machine learning techniques
Online social network analysis with machine learning techniquesOnline social network analysis with machine learning techniques
Online social network analysis with machine learning techniquesHari KC
 
Natural Language Processing: L02 words
Natural Language Processing: L02 wordsNatural Language Processing: L02 words
Natural Language Processing: L02 wordsananth
 
Petri Nets: Properties, Analysis and Applications
Petri Nets: Properties, Analysis and ApplicationsPetri Nets: Properties, Analysis and Applications
Petri Nets: Properties, Analysis and ApplicationsDr. Mohamed Torky
 
4.5 mining the worldwideweb
4.5 mining the worldwideweb4.5 mining the worldwideweb
4.5 mining the worldwidewebKrish_ver2
 
5.2 mining time series data
5.2 mining time series data5.2 mining time series data
5.2 mining time series dataKrish_ver2
 
Corporate Digital Responsibility Playbook XL
Corporate Digital Responsibility Playbook XLCorporate Digital Responsibility Playbook XL
Corporate Digital Responsibility Playbook XLOliver Merx
 
Beyond Fact Checking — Modelling Information Change in Scientific Communication
Beyond Fact Checking — Modelling Information Change in Scientific CommunicationBeyond Fact Checking — Modelling Information Change in Scientific Communication
Beyond Fact Checking — Modelling Information Change in Scientific CommunicationIsabelle Augenstein
 
Summary of Multilingual Natural Language Processing Applications: From Theory...
Summary of Multilingual Natural Language Processing Applications: From Theory...Summary of Multilingual Natural Language Processing Applications: From Theory...
Summary of Multilingual Natural Language Processing Applications: From Theory...iwan_rg
 
How Does Customer Feedback Sentiment Analysis Work in Search Marketing?
How Does Customer Feedback Sentiment Analysis Work in Search Marketing?How Does Customer Feedback Sentiment Analysis Work in Search Marketing?
How Does Customer Feedback Sentiment Analysis Work in Search Marketing?Countants
 

What's hot (20)

Text Extraction from Product Images Using State-of-the-Art Deep Learning Tech...
Text Extraction from Product Images Using State-of-the-Art Deep Learning Tech...Text Extraction from Product Images Using State-of-the-Art Deep Learning Tech...
Text Extraction from Product Images Using State-of-the-Art Deep Learning Tech...
 
Mycin
MycinMycin
Mycin
 
Sentiment Analysis using Twitter Data
Sentiment Analysis using Twitter DataSentiment Analysis using Twitter Data
Sentiment Analysis using Twitter Data
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 
AI: Planning and AI
AI: Planning and AIAI: Planning and AI
AI: Planning and AI
 
Network centrality measures and their effectiveness
Network centrality measures and their effectivenessNetwork centrality measures and their effectiveness
Network centrality measures and their effectiveness
 
Machine Learning vs. Deep Learning
Machine Learning vs. Deep LearningMachine Learning vs. Deep Learning
Machine Learning vs. Deep Learning
 
Low-cost convolutional neural network for tomato plant diseases classification
Low-cost convolutional neural network for tomato plant diseases classificationLow-cost convolutional neural network for tomato plant diseases classification
Low-cost convolutional neural network for tomato plant diseases classification
 
Transfer learning-presentation
Transfer learning-presentationTransfer learning-presentation
Transfer learning-presentation
 
Online social network analysis with machine learning techniques
Online social network analysis with machine learning techniquesOnline social network analysis with machine learning techniques
Online social network analysis with machine learning techniques
 
Natural Language Processing: L02 words
Natural Language Processing: L02 wordsNatural Language Processing: L02 words
Natural Language Processing: L02 words
 
Petri Nets: Properties, Analysis and Applications
Petri Nets: Properties, Analysis and ApplicationsPetri Nets: Properties, Analysis and Applications
Petri Nets: Properties, Analysis and Applications
 
4.5 mining the worldwideweb
4.5 mining the worldwideweb4.5 mining the worldwideweb
4.5 mining the worldwideweb
 
Web Usage Pattern
Web Usage PatternWeb Usage Pattern
Web Usage Pattern
 
5.2 mining time series data
5.2 mining time series data5.2 mining time series data
5.2 mining time series data
 
CRO
CRO CRO
CRO
 
Corporate Digital Responsibility Playbook XL
Corporate Digital Responsibility Playbook XLCorporate Digital Responsibility Playbook XL
Corporate Digital Responsibility Playbook XL
 
Beyond Fact Checking — Modelling Information Change in Scientific Communication
Beyond Fact Checking — Modelling Information Change in Scientific CommunicationBeyond Fact Checking — Modelling Information Change in Scientific Communication
Beyond Fact Checking — Modelling Information Change in Scientific Communication
 
Summary of Multilingual Natural Language Processing Applications: From Theory...
Summary of Multilingual Natural Language Processing Applications: From Theory...Summary of Multilingual Natural Language Processing Applications: From Theory...
Summary of Multilingual Natural Language Processing Applications: From Theory...
 
How Does Customer Feedback Sentiment Analysis Work in Search Marketing?
How Does Customer Feedback Sentiment Analysis Work in Search Marketing?How Does Customer Feedback Sentiment Analysis Work in Search Marketing?
How Does Customer Feedback Sentiment Analysis Work in Search Marketing?
 

More from R A Akerkar

Rajendraakerkar lemoproject
Rajendraakerkar lemoprojectRajendraakerkar lemoproject
Rajendraakerkar lemoprojectR A Akerkar
 
Big Data and Harvesting Data from Social Media
Big Data and Harvesting Data from Social MediaBig Data and Harvesting Data from Social Media
Big Data and Harvesting Data from Social MediaR A Akerkar
 
Can You Really Make Best Use of Big Data?
Can You Really Make Best Use of Big Data?Can You Really Make Best Use of Big Data?
Can You Really Make Best Use of Big Data?R A Akerkar
 
Big data in Business Innovation
Big data in Business Innovation   Big data in Business Innovation
Big data in Business Innovation R A Akerkar
 
What is Big Data ?
What is Big Data ?What is Big Data ?
What is Big Data ?R A Akerkar
 
Connecting and Exploiting Big Data
Connecting and Exploiting Big DataConnecting and Exploiting Big Data
Connecting and Exploiting Big DataR A Akerkar
 
Linked open data
Linked open dataLinked open data
Linked open dataR A Akerkar
 
Semi structure data extraction
Semi structure data extractionSemi structure data extraction
Semi structure data extractionR A Akerkar
 
Big data: analyzing large data sets
Big data: analyzing large data setsBig data: analyzing large data sets
Big data: analyzing large data setsR A Akerkar
 
Description logics
Description logicsDescription logics
Description logicsR A Akerkar
 
artificial intelligence
artificial intelligenceartificial intelligence
artificial intelligenceR A Akerkar
 
Case Based Reasoning
Case Based ReasoningCase Based Reasoning
Case Based ReasoningR A Akerkar
 
Semantic Markup
Semantic Markup Semantic Markup
Semantic Markup R A Akerkar
 
Intelligent natural language system
Intelligent natural language systemIntelligent natural language system
Intelligent natural language systemR A Akerkar
 
Knowledge Organization Systems
Knowledge Organization SystemsKnowledge Organization Systems
Knowledge Organization SystemsR A Akerkar
 
Rational Unified Process for User Interface Design
Rational Unified Process for User Interface DesignRational Unified Process for User Interface Design
Rational Unified Process for User Interface DesignR A Akerkar
 
Unified Modelling Language
Unified Modelling LanguageUnified Modelling Language
Unified Modelling LanguageR A Akerkar
 
Statistical Preliminaries
Statistical PreliminariesStatistical Preliminaries
Statistical PreliminariesR A Akerkar
 

More from R A Akerkar (20)

Rajendraakerkar lemoproject
Rajendraakerkar lemoprojectRajendraakerkar lemoproject
Rajendraakerkar lemoproject
 
Big Data and Harvesting Data from Social Media
Big Data and Harvesting Data from Social MediaBig Data and Harvesting Data from Social Media
Big Data and Harvesting Data from Social Media
 
Can You Really Make Best Use of Big Data?
Can You Really Make Best Use of Big Data?Can You Really Make Best Use of Big Data?
Can You Really Make Best Use of Big Data?
 
Big data in Business Innovation
Big data in Business Innovation   Big data in Business Innovation
Big data in Business Innovation
 
What is Big Data ?
What is Big Data ?What is Big Data ?
What is Big Data ?
 
Connecting and Exploiting Big Data
Connecting and Exploiting Big DataConnecting and Exploiting Big Data
Connecting and Exploiting Big Data
 
Linked open data
Linked open dataLinked open data
Linked open data
 
Semi structure data extraction
Semi structure data extractionSemi structure data extraction
Semi structure data extraction
 
Big data: analyzing large data sets
Big data: analyzing large data setsBig data: analyzing large data sets
Big data: analyzing large data sets
 
Description logics
Description logicsDescription logics
Description logics
 
Data Mining
Data MiningData Mining
Data Mining
 
artificial intelligence
artificial intelligenceartificial intelligence
artificial intelligence
 
Case Based Reasoning
Case Based ReasoningCase Based Reasoning
Case Based Reasoning
 
Semantic Markup
Semantic Markup Semantic Markup
Semantic Markup
 
Intelligent natural language system
Intelligent natural language systemIntelligent natural language system
Intelligent natural language system
 
Data mining
Data miningData mining
Data mining
 
Knowledge Organization Systems
Knowledge Organization SystemsKnowledge Organization Systems
Knowledge Organization Systems
 
Rational Unified Process for User Interface Design
Rational Unified Process for User Interface DesignRational Unified Process for User Interface Design
Rational Unified Process for User Interface Design
 
Unified Modelling Language
Unified Modelling LanguageUnified Modelling Language
Unified Modelling Language
 
Statistical Preliminaries
Statistical PreliminariesStatistical Preliminaries
Statistical Preliminaries
 

Recently uploaded

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 

Recently uploaded (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Link analysis

  • 1. Link Analysis Rajendra Akerkar Vestlandsforsking, Norway
  • 2. Objectives  To review common approaches to l k analysis h link l  To calculate the popularity of a site based on link analysis  To model human judgments indirectly 2 R. Akerkar
  • 3. Outline 1. 1 Early Approaches to Link Analysis 2. Hubs and Authorities: HITS 3. Page Rank 4. Stability 5. Probabilistic Link Analysis 6. Limitation of Link Analysis 3 R. Akerkar
  • 4. Early Approaches Basic Assumptions • Hyperlinks contain information about the human judgment of a site • The more incoming links to a site the more it is site, judged important Bray 1996 y • The visibility of a site is measured by the number of other sites pointing to it • The luminosity of a site is measured by the number of other sites to which it points  Limitation: failure to capture the relative importance of different parents (children) sites 4 R. Akerkar
  • 5. Early Approaches Mark 1988 • To calculate the score S of a document at vertex v 1 S(v) = s(v) + Σ S(w) | ch[v] | w Є |ch(v)| v: a vertex i the h in h hypertext graph G = (V E) h (V, S(v): the global score s(v): the score if the document is isolated ch(v): children of the document at vertex v • Limitation: - Require G to be a directed acyclic g p ( q y graph (DAG) ) - If v has a single link to w, S(v) > S(w) - If v has a long path to w and s(v) < s(w), then S(v) > S (w)  unreasonable 5 R. Akerkar
  • 6. Early Approaches Marchiori (1997) • Hyper information should complement textual information to obtain the overall information S(v) = s(v) + h(v) ( ) ( ) ( ) - S(v): overall information - s(v): textual information - h(v): hyper information r(v, w) • h(v) = Σ F S(w) w Є |ch[v]| - F a fading constant F Є (0 1) F: constant, (0, - r(v, w): the rank of w after sorting the children of v by S(w)  a remedy of the previous approach (Mark 1988) 6 R. Akerkar
  • 7. HITS - Kleinberg’s Algorithm • HITS – Hypertext Induced Topic Selection yp p • For each vertex v Є V in a subgraph of interest: a(v) - the authority of v h(v) - the hubness of v • A site is very a thoritati e if it recei es man er authoritative receives many citations. Citation from important sites weight more than citations from less-important sites • Hubness shows the importance of a site. A good hub is a site that links to many authoritative sites y 7 R. Akerkar
  • 8. Authority and Hubness 2 5 3 1 1 6 4 7 a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7) 8 R. Akerkar
  • 9. Authority d H b A th it and Hubness Convergence C g • R Recursive d i dependency: d a(v)  Σ h(w) w Є pa[v] h(v)  Σ w Є ch[v] a(w) • Using Linear Algebra, we can prove: a(v) and h(v) converge 9 R. Akerkar
  • 10. HITS Example Find a base subgraph: • Start with a root set R {1, 2, 3, 4} • {1, 2, 3, 4} - nodes relevant to the topic • Expand the root set R to include all the children and a fixed number of parents of nodes in R  A new set S (base subgraph)  10 R. Akerkar
  • 11. HITS Example BaseSubgraph( R d) R, 1. S  r 2. for each v in R 3. do S  S U ch[v] 4. P  pa[v] 5. if |P| > d f 6. then P  arbitrary subset of P having size d 7. SSUP 8. 8 return S 11 R. Akerkar
  • 12. HITS Example Hubs and authorities: two n-dimensional a and h HubsAuthorities(G) |V| 1 1  [1,…,1] Є R 2 a0  h 0  1 3 t 1 4 repeat 5 for each v in V 6 do at (v)  Σ h (w) w Є pa[v] t -1 7 ht (v)  Σ w Є pa[v] a (w) 8 a t  at / || at || t -1 9 h t  ht / || ht || 10 t  t+1 11 until || a t – at -1 || + || h t – ht -1 || < ε 1 1 12 return (a t , h t ) 12 R. Akerkar
  • 13. HITS Example Results y Authority Hubness 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Authority and hubness weights 13 R. Akerkar
  • 14. HITS Improvements Brarat and Henzinger (1998) • HITS problems 1) The document can contain many identical links to the same document in another host 2) Links are generated automatically (e.g. messages posted on newsgroups) • Solutions 1) Assign weight to identical multiple edges, which are inversely proportional to their multiplicity 2) Prune irrelevant nodes or regulating the influence of a node with a relevance weight 14 R. Akerkar
  • 15. PageRank  Introduced by Page et al ( y g (1998))  The weight is assigned by the rank of parents  Difference with HITS  HITS takes Hubness & Authority weights  The page rank is proportional to its parents’ rank, but inversely proportional to its parents’ outdegree 15
  • 16. Matrix Notation Adjacent M i Adj Matrix A= * http://www.kusatro.kyoto-u.com 16
  • 17. Matrix Notation  Matrix Notation r=αBr=Mr α : eigenvalue r : eigenvector of B Ax=λx B= | A - λI | x = 0 Finding Pagerank  to find eigenvector of B with an associated eigenvalue α 17
  • 18. Matrix Notation PageRank: eigenvector of P relative to max eigenvalue B = P D P-1 D: diagonal matrix of eigenvalues {λ1, … λn} P: regular matrix that consists of eigenvectors PageRank r1 = normalized 18 R. Akerkar
  • 19. Matrix Notation • Confirm the result # of inlinks from high ranked page hard to explain about 5&2 6&7 5&2, • Interesting Topic How do you create your homepage highly ranked? 19 R. Akerkar
  • 20. Markov Chain Notation  Random surfer model  Description of a random walk through the Web graph  Interpreted as a transition matrix with asymptotic probability that a surfer is currently browsing that page rt = M rtt-1 1 M: transition matrix for a first-order Markov chain (stochastic) Does it converge to some sensible solution ( too) g (as ) regardless of the initial ranks ? 20 R. Akerkar
  • 21. Problem  “Rank Sink” Problem  In general, many Web pages have no inlinks/outlinks  It results in dangling edges in the graph E.g. no parent  rank 0 MT converges to a matrix whose last column is all zero no children  no solution MT converges to zero matrix 21 R. Akerkar
  • 22. Modification  Surfer will restart browsing by p g y picking a new Web p g at g page random M=(B+E) E : escape matrix M : stochastic matrix  S ll problem? Still bl  It is not guaranteed that M is primitive  If M is stochastic and primitive, PageRank converges to corresponding stationary distribution of M 22 R. Akerkar
  • 23. PageRank Algorithm * Page et al 1998 al, 23 R. Akerkar
  • 24. Distribution of the Mixture Model  The probability distribution that results from combining the p y g Markovian random walk distribution & the static rank source distribution r = εe + (1- ε)x 1 ε: probability of selecting non-linked page PageRank Now, transition matrix [εH + (1- ε)M] is primitive and stochastic rt converges to the dominant eigenvector 24 R. Akerkar
  • 25. Stability  Whether the link analysis algorithms based on eigenvectors are stable in the sense that results don’t change significantly?  Th connectivity of a portion of the graph is changed The f f h h h d arbitrary  How will it affect the results of algorithms? 25 R. Akerkar
  • 26. Stability of HITS Ng et al (2001) • A bound on the number of hyperlinks k that can added or deleted from one page without affecting the authority or hubness weights • It is possible to perturb a symmetric matrix by a quantity that grows as δ that produces a constant perturbation of the dominant eigenvector δ: eigengap λ1 – λ2 g g p d: maximum outdegree of G 26 R. Akerkar
  • 27. Stability of PageRank Ng et al (2001) V: the set of vertices touched by the perturbation  The parameter ε of the mixture model has a stabilization role  If the set of pages affected by the p pg y perturbation have a small rank, the overall , change will also be small tighter bound by Bianchini et al (2001) δ(j) >= 2 depends on the edges incident on j 27 R. Akerkar
  • 28. SALSA  SALSA (Lempel, Moran 2001)  Probabilistic extension of the HITS algorithm  Random walk is carried out by following hyperlinks both in the forward and in the backward direction  Two separate random walks  Hub walk  Authority walk 28 R. Akerkar
  • 29. Forming a Bipartite Graph in SALSA 29 R. Akerkar
  • 30. Random Walks  Hub walk  Follow a Web link from a page uh to a page wa (a forward link) and then  Immediately traverse a backlink going from wa to vh, where (u w) (u,w) Є E and (v,w) Є E  Authority Walk y  Follow a Web link from a page w(a) to a page u(h) (a backward link) and then  Immediately traverse a f d l forward link going b k f dl k back from vh to wa where (u,w) Є E and (v,w) Є E 30 R. Akerkar
  • 31. Computing Weights  Hub weight computed from the sum of the product of the inverse degree of the in-links and the out-links 31 R. Akerkar
  • 32. Why We Care  Lempel and Moran (2001) showed theoretically that SALSA weights are more robust that HITS weights in the presence of the Tightly Knit Community (TKC) Effect.  This effect occurs when a small collection of pages (related to a given topic) is connected so that every hub links to every authority and includes as a special case the mutual reinforcement effect  The pages in a community connected in this way can be ranked highly by HITS, higher than pages in a much larger collection HITS where only some hubs link to some authorities  TKC could be exploited by spammers hoping to increase their page weight ( li k f i h (e.g. link farms) ) 32 R. Akerkar
  • 33. A Similar Approach  Rafiei and Mendelzon (2000) and Ng et al. (2001) propose similar approaches using reset as in PageRank  Unlike PageRank, in this model the surfer will follow a forward link on odd steps but a backward li k b k d link on even steps t  The stability properties of these ranking distributions are similar to those of PageRank (Ng et al. 2001) 33 R. Akerkar
  • 34. Overcoming TKC  Similarity downweight sequencing and sequential y g g clustering (Roberts and Rosenthal 2003)  Consider the underlying structure of clusters y g  Suggest downweight sequencing to avoid the Tight Knit Community problem g yp  Results indicate approach is effective for few tested queries, but still untested on a large scale 34 R. Akerkar
  • 35. PHITS and More  PHITS: Cohn and Chang (2000)  Only the principal eigenvector is extracted using SALSA, so the authority along the remaining eigenvectors is completely neglected  Account for more eigenvectors of the co-citation matrix  See also Lempel, Moran (2003) 35 R. Akerkar
  • 36. Limits of Link Analysis  META tags/ invisible text  Search engines relying on meta tags in documents are often misled (intentionally) by web developers P f Pay-for-place l  Search engine bias : organizations pay search engines and page rank  Advertisements: organizations pay high ranking pages for advertising space  W h a primary effect of increased visibility to end users and a secondary With ff f d bl d d d effect of increased respectability due to relevance to high ranking page 36 R. Akerkar
  • 37. Limits of Link Analysis  Stability  Adding even a small number of nodes/edges to the graph has a significant impact  T i drift – similar t TKC Topic d ift i il to  A top authority may be a hub of pages on a different topic resulting in increased rank of the authority p g g y page  Content evolution  Adding/removing links/content can affect the intuitive authority rank of a page requiring recalculation of page ranks 37 R. Akerkar
  • 38. Further Reading  R. Akerkar d P. Lingras, Building an I ll R Ak k and P L B ld Intelligent W b Th Web: Theory and Practice, Jones & Bartlett, 2008  http://www.jbpub.com/catalog/9780763741372/ p j p g 38 R. Akerkar