SlideShare une entreprise Scribd logo
1  sur  59
Télécharger pour lire hors ligne
Co-occurrence and Meaning
                               Co-occurrence graphs
                       Interpretation of Co-citations
                                     Topical Anchors
                                          References




    Semantics hidden within co-occurrence patterns
                 A bottom-up approach to the Semantic Web?


                                         Srinath Srinivasa

                                              IIIT Bangalore
                                               sri@iiitb.ac.in




IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                 Co-occurrence graphs
                         Interpretation of Co-citations
                                       Topical Anchors
                                            References


Outline



  1   Co-occurrence and Meaning


  2   Co-occurrence graphs


  3   Interpretation of Co-citations


  4   Topical Anchors




  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                 Co-occurrence graphs
                         Interpretation of Co-citations
                                       Topical Anchors
                                            References


Outline



  1   Co-occurrence and Meaning


  2   Co-occurrence graphs


  3   Interpretation of Co-citations


  4   Topical Anchors




  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                 Co-occurrence graphs
                         Interpretation of Co-citations
                                       Topical Anchors
                                            References


Conventional WebIR and co-occurrence



         Lexical feature extraction: Bag-of-words model
         Document vectorization
         Implicit assumption of independence of dimensions
         Vector space reduction and spectral analyses for identifying
         hidden semantics (Ex: LSA, SVD, Clustering, etc.)
  In human languages, lexical terms are not only not independent of
  one another, important semantic structures are inherent in the way
  terms co-occur.



  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                 Co-occurrence graphs
                         Interpretation of Co-citations
                                       Topical Anchors
                                            References


Motivational Problems


  Some motivational problems to show limitations of purely lexical
  approaches to IR:
  The topical anchor problem
  “If ever a player has overshadowed Sachin Tendulkar for sheer class of
  batsmanship, it is V V S Laxman. After a record 353-run fourth-wicket
  partnership in the 2004 Sydney Test when Laxman hit 30 fours in his 178
  to Tendulkar’s 33 in his unbeaten 241, the master put the artistry of V V
  S in perspective.”

  What is the best topic of this paragraph: Sachin Tendulkar, V V S
  Laxman, Sydney, Australia, Cricket, Test Match


  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                 Co-occurrence graphs
                         Interpretation of Co-citations
                                       Topical Anchors
                                            References


Motivational Problems

  The semantic attributes problem
  Given that a user has searched for the term “Malm¨” which of the following
                                                   o
  keywords can be termed as “attributes” that enhance the meaning represented
  by Malm¨ :
          o

         Driving
         History
         Mileage
         Weather
         Symptoms
         Elephant
         A
         LTEX beamer
         Infringement


  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                  Co-occurrence graphs
                          Interpretation of Co-citations
                                        Topical Anchors
                                             References


Motivational Problems


  The topical marker problem

  The US Federal Aviation Regulations Sec 380.12 states that:
         The charter operator may not cancel a charter for any reason (including insufficient participation), except
         for circumstances that make it physically impossible to perform the charter trip, less than 10 days before
         the scheduled date of departure of the outbound trip.
         If the charter operator cancels 10 or more days before the scheduled date of departure, the operator must
         so notify each participant in writing within 7 days after the cancellation but in any event not less than 10
         days before the scheduled departure date of the outbound trip. If a charter is canceled less than 10 days
         before scheduled departure (i.e., for circumstances that make it physically impossible to perform the
         charter trip), the operator must get the message to each participant as soon as possible.


  If a user who has booked a ticket with a charter operator finds out that her
  flight has been cancelled suddenly without notice and wants to confront the
  operator; what should she search for: charter operator, FAR, cancellation,
  scheduled trip, Sec 380, operator, notification, . . .



  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                  Co-occurrence graphs
                          Interpretation of Co-citations
                                        Topical Anchors
                                             References


Motivational Problems


  The topical marker problem

  The US Federal Aviation Regulations Sec 380.12 states that:
         The charter operator may not cancel a charter for any reason (including insufficient participation), except
         for circumstances that make it physically impossible to perform the charter trip, less than 10 days before
         the scheduled date of departure of the outbound trip.
         If the charter operator cancels 10 or more days before the scheduled date of departure, the operator must
         so notify each participant in writing within 7 days after the cancellation but in any event not less than 10
         days before the scheduled departure date of the outbound trip. If a charter is canceled less than 10 days
         before scheduled departure (i.e., for circumstances that make it physically impossible to perform the
         charter trip), the operator must get the message to each participant as soon as possible.


  If a user who has booked a ticket with a charter operator finds out that her
  flight has been cancelled suddenly without notice and wants to confront the
  operator; what should she search for: charter operator, FAR, cancellation,
  scheduled trip, Sec 380, operator, notification, . . .



  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                   Co-occurrence graphs
                           Interpretation of Co-citations
                                         Topical Anchors
                                              References


Motivational Problems
  The theme problem:
  Article 1

  A US Airways Airbus A320 suffered a bird hit immediately after take off from New York’s La Gaurdia airport and
  was forced to land on the Hudson river. All 150 passengers and 5 crew members are reported to be safe with only
  minor injuries.


  Article 2

  La Guardia International Airport is conveniently located to serve the citizens of both New York and New Jersey.
  Airlines that operate from here include: US Airways, Delta, Continental and Virgin. Its MRO routinely serves a
  number of aircraft types including Boeing 73x series and the Airbus A320 and A330 series. La Guardia is easily
  reachable from New Jersey through the Lincoln tunnel that runs under the Hudson river.


  Article 3

  Pilot Steve Bolle of a light aircraft in northern Australia was forced to make an emergency landing in water after
  suffering engine trouble on take-off. He landed the Piper Chieftain plane in shallow waters after realising he would
  not make it back to the airport. Mr Bolle and his five passengers were able to wade safely to shore.


  Which of the articles above are similar to one another? (Ack to sources: Wikipedia and BBC)


  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                   Co-occurrence graphs
                           Interpretation of Co-citations
                                         Topical Anchors
                                              References


Motivational Problems
  The theme problem:
  Article 1

  A US Airways Airbus A320 suffered a bird hit immediately after take off from New York’s La Gaurdia airport and
  was forced to land on the Hudson river. All 150 passengers and 5 crew members are reported to be safe with only
  minor injuries.


  Article 2

  La Guardia International Airport is conveniently located to serve the citizens of both New York and New Jersey.
  Airlines that operate from here include: US Airways, Delta, Continental and Virgin. Its MRO routinely serves a
  number of aircraft types including Boeing 73x series and the Airbus A320 and A330 series. La Guardia is easily
  reachable from New Jersey through the Lincoln tunnel that runs under the Hudson river.


  Article 3

  Pilot Steve Bolle of a light aircraft in northern Australia was forced to make an emergency landing in water after
  suffering engine trouble on take-off. He landed the Piper Chieftain plane in shallow waters after realising he would
  not make it back to the airport. Mr Bolle and his five passengers were able to wade safely to shore.


  Which of the articles above are similar to one another? (Ack to sources: Wikipedia and BBC)


  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                   Co-occurrence graphs
                           Interpretation of Co-citations
                                         Topical Anchors
                                              References


Motivational Problems
  The theme problem:
  Article 1

  A US Airways Airbus A320 suffered a bird hit immediately after take off from New York’s La Gaurdia airport and
  was forced to land on the Hudson river. All 150 passengers and 5 crew members are reported to be safe with only
  minor injuries.


  Article 2

  La Guardia International Airport is conveniently located to serve the citizens of both New York and New Jersey.
  Airlines that operate from here include: US Airways, Delta, Continental and Virgin. Its MRO routinely serves a
  number of aircraft types including Boeing 73x series and the Airbus A320 and A330 series. La Guardia is easily
  reachable from New Jersey through the Lincoln tunnel that runs under the Hudson river.


  Article 3

  Pilot Steve Bolle of a light aircraft in northern Australia was forced to make an emergency landing in water after
  suffering engine trouble on take-off. He landed the Piper Chieftain plane in shallow waters after realising he would
  not make it back to the airport. Mr Bolle and his five passengers were able to wade safely to shore.


  Which of the articles above are similar to one another? (Ack to sources: Wikipedia and BBC)


  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                   Co-occurrence graphs
                           Interpretation of Co-citations
                                         Topical Anchors
                                              References


Motivational Problems
  The theme problem:
  Article 1

  A US Airways Airbus A320 suffered a bird hit immediately after take off from New York’s La Gaurdia airport and
  was forced to land on the Hudson river. All 150 passengers and 5 crew members are reported to be safe with only
  minor injuries.


  Article 2

  La Guardia International Airport is conveniently located to serve the citizens of both New York and New Jersey.
  Airlines that operate from here include: US Airways, Delta, Continental and Virgin. Its MRO routinely serves a
  number of aircraft types including Boeing 73x series and the Airbus A320 and A330 series. La Guardia is easily
  reachable from New Jersey through the Lincoln tunnel that runs under the Hudson river.


  Article 3

  Pilot Steve Bolle of a light aircraft in northern Australia was forced to make an emergency landing in water after
  suffering engine trouble on take-off. He landed the Piper Chieftain plane in shallow waters after realising he would
  not make it back to the airport. Mr Bolle and his five passengers were able to wade safely to shore.


  Which of the articles above are similar to one another? (Ack to sources: Wikipedia and BBC)


  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                   Co-occurrence graphs
                           Interpretation of Co-citations
                                         Topical Anchors
                                              References


Co-occurrence and Meaning


  Hebbian learning

         Co-occurrence plays a central role in the Hebbian theory of the semantic organization of the human mind,
         which states that synaptic plasticity between neurons are determined by repeated and persistent
         stimulation of the pre- and post-synaptic cells [2].
         This is also summarized as: Cells that fire together, wire together


  Co-occurrence and the language instinct

         Language structures such as pluralization, is often learnt by analyzing co-occurrence patterns. An
         interesting example is the “wug” test (cf. [5]):
         That is a pig; these are pigs. That is a dog; these are dogs. That is a cat; these are cats. That is a wug;
         these are      .
         The use of co-occurrence is even more apparent in this example, that leads to confusion (even if for a
         moment):
         The plural of radius is radii; the plural of thesis is theses; the plural of bus is buses. The plural of lotus is
         lotii? lotes? lotuses?




  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                   Co-occurrence graphs
                           Interpretation of Co-citations
                                         Topical Anchors
                                              References


Co-occurrence and Meaning


  Hebbian learning

         Co-occurrence plays a central role in the Hebbian theory of the semantic organization of the human mind,
         which states that synaptic plasticity between neurons are determined by repeated and persistent
         stimulation of the pre- and post-synaptic cells [2].
         This is also summarized as: Cells that fire together, wire together


  Co-occurrence and the language instinct

         Language structures such as pluralization, is often learnt by analyzing co-occurrence patterns. An
         interesting example is the “wug” test (cf. [5]):
         That is a pig; these are pigs. That is a dog; these are dogs. That is a cat; these are cats. That is a wug;
         these are      .
         The use of co-occurrence is even more apparent in this example, that leads to confusion (even if for a
         moment):
         The plural of radius is radii; the plural of thesis is theses; the plural of bus is buses. The plural of lotus is
         lotii? lotes? lotuses?




  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                  Co-occurrence graphs
                          Interpretation of Co-citations
                                        Topical Anchors
                                             References


Co-occurrence and meaning
  Meaning is usage
  The analytic philosophy worldview: Meaning is usage [1] can be explained by
  representing usage as co-occurrence analysis.

  Consider the following paragraphs:

  Everyday, I go to work in my pqer. My pqer runs on diesel and gives one of the
  best mileage for pqers in its category. My pqer can seat five people and is a
  good candidate for pqer-pooling.


  On December 26 2004, a massive earthquake measuring 9.1 jolted Java. This
  earthquake triggered a huge tsunami that has been the deadliest in history. We
  have developed an applet to simulate the path taken by the tsunami. You can
  run this applet in any browser that has Java enabled.

  In the first paragraph, the meaning of the word “pqer” and in the second paragraph, the word-sense of the term

  “Java” are both resolved by looking at other terms that co-occur with them.
  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                  Co-occurrence graphs
                          Interpretation of Co-citations
                                        Topical Anchors
                                             References


Co-occurrence and meaning
  Meaning is usage
  The analytic philosophy worldview: Meaning is usage [1] can be explained by
  representing usage as co-occurrence analysis.

  Consider the following paragraphs:

  Everyday, I go to work in my pqer. My pqer runs on diesel and gives one of the
  best mileage for pqers in its category. My pqer can seat five people and is a
  good candidate for pqer-pooling.


  On December 26 2004, a massive earthquake measuring 9.1 jolted Java. This
  earthquake triggered a huge tsunami that has been the deadliest in history. We
  have developed an applet to simulate the path taken by the tsunami. You can
  run this applet in any browser that has Java enabled.

  In the first paragraph, the meaning of the word “pqer” and in the second paragraph, the word-sense of the term

  “Java” are both resolved by looking at other terms that co-occur with them.
  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                  Co-occurrence graphs
                          Interpretation of Co-citations
                                        Topical Anchors
                                             References


Co-occurrence and meaning
  Meaning is usage
  The analytic philosophy worldview: Meaning is usage [1] can be explained by
  representing usage as co-occurrence analysis.

  Consider the following paragraphs:

  Everyday, I go to work in my pqer. My pqer runs on diesel and gives one of the
  best mileage for pqers in its category. My pqer can seat five people and is a
  good candidate for pqer-pooling.


  On December 26 2004, a massive earthquake measuring 9.1 jolted Java. This
  earthquake triggered a huge tsunami that has been the deadliest in history. We
  have developed an applet to simulate the path taken by the tsunami. You can
  run this applet in any browser that has Java enabled.

  In the first paragraph, the meaning of the word “pqer” and in the second paragraph, the word-sense of the term

  “Java” are both resolved by looking at other terms that co-occur with them.
  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                  Co-occurrence graphs
                          Interpretation of Co-citations
                                        Topical Anchors
                                             References


Co-occurrence and meaning
  Meaning is usage
  The analytic philosophy worldview: Meaning is usage [1] can be explained by
  representing usage as co-occurrence analysis.

  Consider the following paragraphs:

  Everyday, I go to work in my pqer. My pqer runs on diesel and gives one of the
  best mileage for pqers in its category. My pqer can seat five people and is a
  good candidate for pqer-pooling.


  On December 26 2004, a massive earthquake measuring 9.1 jolted Java. This
  earthquake triggered a huge tsunami that has been the deadliest in history. We
  have developed an applet to simulate the path taken by the tsunami. You can
  run this applet in any browser that has Java enabled.

  In the first paragraph, the meaning of the word “pqer” and in the second paragraph, the word-sense of the term

  “Java” are both resolved by looking at other terms that co-occur with them.
  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                 Co-occurrence graphs
                         Interpretation of Co-citations
                                       Topical Anchors
                                            References


Outline



  1   Co-occurrence and Meaning


  2   Co-occurrence graphs


  3   Interpretation of Co-citations


  4   Topical Anchors




  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                 Co-occurrence graphs
                         Interpretation of Co-citations
                                       Topical Anchors
                                            References


Capturing co-occurrence
         We are given a document corpus that is represented as a set
         of “contexts”:
                            C = {C1 , C2 , . . . Cn }
         Depending on the specific problem, a context may take
         various forms like: sentence, paragraph, document, etc.
         Two entities ei and ej are said to co-occur (denoted as
         ei     ej ) if there is some context C such that ei , ej ∈ C
         The support for a co-occurring pair ei      ej is the probability
         of finding this co-occurrence in any given context C in the
         corpus. In other words, the support is the joint probability
         P(ei , ej )
  Note that co-occurrence is an n-ary relation. But for purposes of simplicity, we
  focus on pairwise co-occurrences and derive higher order semantics when
  required.
  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                 Co-occurrence graphs
                         Interpretation of Co-citations
                                       Topical Anchors
                                            References


Capturing co-occurrence
         We are given a document corpus that is represented as a set
         of “contexts”:
                            C = {C1 , C2 , . . . Cn }
         Depending on the specific problem, a context may take
         various forms like: sentence, paragraph, document, etc.
         Two entities ei and ej are said to co-occur (denoted as
         ei     ej ) if there is some context C such that ei , ej ∈ C
         The support for a co-occurring pair ei      ej is the probability
         of finding this co-occurrence in any given context C in the
         corpus. In other words, the support is the joint probability
         P(ei , ej )
  Note that co-occurrence is an n-ary relation. But for purposes of simplicity, we
  focus on pairwise co-occurrences and derive higher order semantics when
  required.
  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                 Co-occurrence graphs
                         Interpretation of Co-citations
                                       Topical Anchors
                                            References


Capturing co-occurrence
         We are given a document corpus that is represented as a set
         of “contexts”:
                            C = {C1 , C2 , . . . Cn }
         Depending on the specific problem, a context may take
         various forms like: sentence, paragraph, document, etc.
         Two entities ei and ej are said to co-occur (denoted as
         ei     ej ) if there is some context C such that ei , ej ∈ C
         The support for a co-occurring pair ei      ej is the probability
         of finding this co-occurrence in any given context C in the
         corpus. In other words, the support is the joint probability
         P(ei , ej )
  Note that co-occurrence is an n-ary relation. But for purposes of simplicity, we
  focus on pairwise co-occurrences and derive higher order semantics when
  required.
  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                 Co-occurrence graphs
                         Interpretation of Co-citations
                                       Topical Anchors
                                            References


Co-occurrence graphs

  Co-occurrence graph
  A co-occurrence graph is a weighted, undirected graph G = (E , , w ), where
  E is a set of “entities”, ⊆ E × E is a set of co-occurrences, and w : →
  indicates support for the co-occurrence


         Co-occurrence versus n-partite graphs

  Semantic co-occurrence graphs
  A semantic co-occurrence graph is a co-occurrence graph that is augmented
  with a concept hierarchy. A concept hierarchy is defined by one or more partial
  orders of the form:      ⊆ E × E , representing relationships like is-a and is-in,
  that are reflexive, anti-symmetric and transitive.



  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                 Co-occurrence graphs
                         Interpretation of Co-citations
                                       Topical Anchors
                                            References


Co-occurrence graphs

  Co-occurrence graph
  A co-occurrence graph is a weighted, undirected graph G = (E , , w ), where
  E is a set of “entities”, ⊆ E × E is a set of co-occurrences, and w : →
  indicates support for the co-occurrence


         Co-occurrence versus n-partite graphs

  Semantic co-occurrence graphs
  A semantic co-occurrence graph is a co-occurrence graph that is augmented
  with a concept hierarchy. A concept hierarchy is defined by one or more partial
  orders of the form:      ⊆ E × E , representing relationships like is-a and is-in,
  that are reflexive, anti-symmetric and transitive.



  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                 Co-occurrence graphs
                         Interpretation of Co-citations
                                       Topical Anchors
                                            References


Co-occurrence graphs

  Co-occurrence graph
  A co-occurrence graph is a weighted, undirected graph G = (E , , w ), where
  E is a set of “entities”, ⊆ E × E is a set of co-occurrences, and w : →
  indicates support for the co-occurrence


         Co-occurrence versus n-partite graphs

  Semantic co-occurrence graphs
  A semantic co-occurrence graph is a co-occurrence graph that is augmented
  with a concept hierarchy. A concept hierarchy is defined by one or more partial
  orders of the form:      ⊆ E × E , representing relationships like is-a and is-in,
  that are reflexive, anti-symmetric and transitive.



  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                 Co-occurrence graphs
                         Interpretation of Co-citations
                                       Topical Anchors
                                            References


Co-occurrence graph
  Example:
                                                                             Concept hierarchy construction
                                                                                 1   Start with a base
                                                                                     Ontology
                                                                                 2   Use co-occurrence
                                                                                     patterns to guess
                                                                                     conceptual relationships
                                                                                     across terms
                                                                                 3   Use concept hierarchy
                                                                                     to identify deeper
                                                                                     co-occurrence patterns
                                                                                 4   Repeat from step 2 in a
                                                                                     semi-automated fashion
                                                                                     until algorithm
                                                                                     stabilizes
  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                 Co-occurrence graphs
                         Interpretation of Co-citations
                                       Topical Anchors
                                            References


Co-occurrence graph
  Example:
                                                                             Concept hierarchy construction
                                                                                 1   Start with a base
                                                                                     Ontology
                                                                                 2   Use co-occurrence
                                                                                     patterns to guess
                                                                                     conceptual relationships
                                                                                     across terms
                                                                                 3   Use concept hierarchy
                                                                                     to identify deeper
                                                                                     co-occurrence patterns
                                                                                 4   Repeat from step 2 in a
                                                                                     semi-automated fashion
                                                                                     until algorithm
                                                                                     stabilizes
  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                 Co-occurrence graphs
                         Interpretation of Co-citations
                                       Topical Anchors
                                            References


Co-occurrence graphs


  Characteristics of co-occurrence graphs
         Triadic closure (highly clustered)
         Disconnected components or a single component of very small
         diameter
         Co-occurrence graph of all noun phrases in Wikipedia has a
         diameter of 4
         Co-occurrence support for entity pairs follow a power-law




  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                 Co-occurrence graphs
                         Interpretation of Co-citations
                                       Topical Anchors
                                            References


Outline



  1   Co-occurrence and Meaning


  2   Co-occurrence graphs


  3   Interpretation of Co-citations


  4   Topical Anchors




  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                 Co-occurrence graphs
                         Interpretation of Co-citations
                                       Topical Anchors
                                            References


Co-citation




         Co-citation and bibliographic coupling are important metrics in several
         datasets like scientific literature, web pages, wikis, tagging systems like
         delicious, etc.
         Co-citation of a pair of documents corresponds to the co-occurrence of
         these references (Ex. URLs) in a context
         Pair-wise co-citation graphs have the same properties as co-occurrence
         graphs

  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                           Co-occurrence graphs
                                   Interpretation of Co-citations
                                                 Topical Anchors
                                                      References


  Co-citation Patterns
  Hyperlink distance across pairs of highly co-cited pages [8]




      300                                                                12000

      250                                                                10000

                                                                         8000
      200
                                                                         6000




                                                                     F
      150
  F




                                                                         4000
      100
                                                                         2000
      50
                                                                            0
       0




                                                                                 1


                                                                                     2


                                                                                           3


                                                                                               4


                                                                                                   5


                                                                                                           6


                                                                                                               7

                                                                                                                    ax


                                                                                                                            ax
                                                                                                                   km


                                                                                                                            m
              1     2     3    4     5     6     7   kmax >kmax




                                                                                                                         >k
                                     k                                                                 k




Figure: Hyperlink distance across pairs of Figure: Hyperlink distance across pairs of
highly co-cited Web pages                  highly co-cited Wikipedia pages
            IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                       Co-occurrence graphs
                               Interpretation of Co-citations
                                             Topical Anchors
                                                  References


Co-citation Patterns
Hyperlink distance across pairs of highly co-cited pages

   Endorsement of a citation
                                                                  Topical aggregation




           Page A endorses the content of page B                          Document A represents content about a
           Users reading page A, traverses this link and                  “higher-level” topic in terms of is-a or is-in
           finds page B useful too                                         relationships; and links to (hence co-cites)
                                                                          several pages on “lower-level” topics
           Users create their own pages citing both A
           and B                                                          Pages on the “lower-level” topics usually cite
                                                                          back the page on the “higher-level” topic,
           If A has several outgoing links, and only some
                                                                          hence giving a citation distance of 2 among
           pairs of outlinks are co-cited, then co-citation
                                                                          themselves
           can be seen as an endorsement of the citation

    Nepotistic co-citations

    Another major source of co-citation (primarily on web pages) are “nepotistic links” in the form of navigational tabs
    IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                       Co-occurrence graphs
                               Interpretation of Co-citations
                                             Topical Anchors
                                                  References


Co-citation Patterns
Hyperlink distance across pairs of highly co-cited pages

   Endorsement of a citation
                                                                  Topical aggregation




           Page A endorses the content of page B                          Document A represents content about a
           Users reading page A, traverses this link and                  “higher-level” topic in terms of is-a or is-in
           finds page B useful too                                         relationships; and links to (hence co-cites)
                                                                          several pages on “lower-level” topics
           Users create their own pages citing both A
           and B                                                          Pages on the “lower-level” topics usually cite
                                                                          back the page on the “higher-level” topic,
           If A has several outgoing links, and only some
                                                                          hence giving a citation distance of 2 among
           pairs of outlinks are co-cited, then co-citation
                                                                          themselves
           can be seen as an endorsement of the citation

    Nepotistic co-citations

    Another major source of co-citation (primarily on web pages) are “nepotistic links” in the form of navigational tabs
    IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                       Co-occurrence graphs
                               Interpretation of Co-citations
                                             Topical Anchors
                                                  References


Co-citation Patterns
Hyperlink distance across pairs of highly co-cited pages

   Endorsement of a citation
                                                                  Topical aggregation




           Page A endorses the content of page B                          Document A represents content about a
           Users reading page A, traverses this link and                  “higher-level” topic in terms of is-a or is-in
           finds page B useful too                                         relationships; and links to (hence co-cites)
                                                                          several pages on “lower-level” topics
           Users create their own pages citing both A
           and B                                                          Pages on the “lower-level” topics usually cite
                                                                          back the page on the “higher-level” topic,
           If A has several outgoing links, and only some
                                                                          hence giving a citation distance of 2 among
           pairs of outlinks are co-cited, then co-citation
                                                                          themselves
           can be seen as an endorsement of the citation

    Nepotistic co-citations

    Another major source of co-citation (primarily on web pages) are “nepotistic links” in the form of navigational tabs
    IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                   Co-occurrence graphs
                           Interpretation of Co-citations
                                         Topical Anchors
                                              References


Co-citation graph of a web crawl
Pairs of pages with at least 100 non-nepotistic co-citations




    IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                 Co-occurrence graphs
                         Interpretation of Co-citations
                                       Topical Anchors
                                            References


Co-citation graph of a web crawl


         Co-citation graph depicts non-nepotistic co-citations of at
         least 100 or more across pairs of pages
         In addition to being made of disconnected components, the
         graph also shows various recurring structural motifs like:
                 Star
                 Clique
                 Clique chain
                 Dumb-bell
         Interpretations for the above motifs along with examples are
         explained in Mutalikdesai and Srinivasa (2009) [4]



  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                 Co-occurrence graphs
                         Interpretation of Co-citations
                                       Topical Anchors
                                            References


Endorsed hyperlink graph (EHG)

  On the web, co-citations usually implies a citation. Hence the EHG
  is essentially a directed version of the co-citation graph. Some
  EHG components are depicted below:




                                             EHG clique chain


  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                  Co-occurrence graphs
                          Interpretation of Co-citations
                                        Topical Anchors
                                             References


Endorsed citation graph (ECG) for scientific literature
ECG of citation info obtained from CiteSeer




   IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                 Co-occurrence graphs
                         Interpretation of Co-citations
                                       Topical Anchors
                                            References


Endorsed citation graph

         The ECG over scientific literature data (using CiteSeer) shows
         similar componentization of the graph, except, the ECG has
         one giant component
         Citation in scientific literature has some subtle differences
         from hyperlink citations
         Scientific literature citations are always into the past
         Very rarely (if at all) do scientific literature citations form
         cyclic structures
         ECG comprises mostly of weakly connected directed graph
         components, while EHG may contain strongly connected
         components


  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                  Co-occurrence graphs
                          Interpretation of Co-citations
                                        Topical Anchors
                                             References


ERank
Importance of a page within an EHG




          ERank is an authority score of a page within an EHG (ECG)
          component
          Depicts reachability of the page within the component
          ERank scores in a component shown to be uncorrelated to the
          PageRank scores of pages of that component




   IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                 Co-occurrence graphs
                         Interpretation of Co-citations
                                       Topical Anchors
                                            References


EndorSeer




         A Firefox plugin for augmented browsing of Citeseer
         Currently shows endorsed citations from among the list of
         citations from any paper
         Currently underway: Show the ECG component and ECG
         neighbourhood of a paper




  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                 Co-occurrence graphs
                         Interpretation of Co-citations
                                       Topical Anchors
                                            References


Outline



  1   Co-occurrence and Meaning


  2   Co-occurrence graphs


  3   Interpretation of Co-citations


  4   Topical Anchors




  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                  Co-occurrence graphs
                          Interpretation of Co-citations
                                        Topical Anchors
                                             References


Topical Anchors [6, 7]
Motivation




   Example: “Will my oral insulin drugs, along with my hypertension
   and high blood glucose, have any side effects on the health of my
   pancreas?”
          Can a machine detect diabetes as the context?
          Another example: A document containing the words, Andy
          Roddick, Roger Federer and Rafael Nadal.
          How likely is it that the word Tennis will be mentioned
          (semantically) when discussing about these players?




   IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                  Co-occurrence graphs
                          Interpretation of Co-citations
                                        Topical Anchors
                                             References


Topical Anchors [6, 7]
Motivation




   Example: “Will my oral insulin drugs, along with my hypertension
   and high blood glucose, have any side effects on the health of my
   pancreas?”
          Can a machine detect diabetes as the context?
          Another example: A document containing the words, Andy
          Roddick, Roger Federer and Rafael Nadal.
          How likely is it that the word Tennis will be mentioned
          (semantically) when discussing about these players?




   IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                 Co-occurrence graphs
                         Interpretation of Co-citations
                                       Topical Anchors
                                            References


Co-occurrence context


         Given a set of query terms, the co-occurrence context is
         defined as the subgraph formed by the query terms and the
         set of terms that co-occur with at least one of the terms




  Conjecture: The topical anchor of a set of terms, is a highly authoritative term
  that lies with the co-occurrence context of the query terms



  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                 Co-occurrence graphs
                         Interpretation of Co-citations
                                       Topical Anchors
                                            References


Online Page Importance Computation



         Each node i in the context is intialised with a cash ci .
         A node a is picked at random and the cash ca is added to its history ha .
         Then ca is distributed amongst all its neighbours proportional to the edge
         weights.
         This process is iterated till the ratio of hi s becomes a near constant.
         Node with the largest hi is chosen as the most central node.

  Unfortunately OPIC was seen to be unsuitable for determining topical anchors
  since it tends to find central nodes for the entire graph




  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                 Co-occurrence graphs
                         Interpretation of Co-citations
                                       Topical Anchors
                                            References


Cash Leaking Random Walk


         Cooccurrence graphs have extremely small diameters (4-5).
         Roger Federer to feral child in two hops.
         Football becomes most central to Roger Federer and Rafael
         Nadal instead of Tennis.
         Solution: Cash Leakage




  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                 Co-occurrence graphs
                         Interpretation of Co-citations
                                       Topical Anchors
                                            References


Bias and History Vectors



         There is a hidden bias between query words for the way
         centrality is computed.
         Example: Jim Carrey, Hugh Grant, Rajkumar
         Bias due to difference in neighbourhood sizes
         Bias due to polysemy
         Example: Java, Beans, Kaffe




  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                 Co-occurrence graphs
                         Interpretation of Co-citations
                                       Topical Anchors
                                            References


Bias examples
    Query Terms                                            Topical Anchors
    Java, Beans, Kaffe                                      Programming language, Indonesia,
                                                           Food
    United States Dollar, Euro, West                       French language, Guinea, Guinea-
    African CFA franc                                      Bissau
    Bayes,    Euclid,     Ramanujan,                       Probability, Mathematics, Number
    Bernoulli
    MIT, Stanford, IIT                                     University, Indian Institute of Tech-
                                                           nology, Bombay
    Leaf, Fruit, Stem, Photosynthesis                      Linguistics, Plant, Tree
    Bernoulli, Poisson, Weibull, Bino-                     Godwin, Norway, Harold Godwin-
    mial                                                   son
                      Table: Examples with irrelevant topical anchors



  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                 Co-occurrence graphs
                         Interpretation of Co-citations
                                       Topical Anchors
                                            References


Solution to the topic bias problem


  Labelled cash.
  Vector models of CLRW
  Cash from each of the query term qi is given a “colour” ci . The cash history at
  any node is hence a vector of the form (v1 , v2 , . . . vn ) showing cash flow history
  for each of the colours. The vector is then normalized as:
                                            vi
                                      vi =
                                            v
                                            ˆ
  where v = max vi and vi ∈ [0, 1]
        ˆ
                   i




  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                 Co-occurrence graphs
                         Interpretation of Co-citations
                                       Topical Anchors
                                            References


Projection


                                                            Projection

                                                                    The line joining 0n to 1n
                                                                    represents points where all
                                                                    query terms have contributed
                                                                    equally to the cash history.
                                                                    This is called the baseline
                                                                    Hence, for any given node, its
                                                                    projection onto the baseline
                                                                    represents the importance of
                                                                    the node in being a topical
                                                                    anchor




  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                 Co-occurrence graphs
                         Interpretation of Co-citations
                                       Topical Anchors
                                            References


Euclidean Distance



                                                            Eucledian distance
                                                                    Eucledian metric computes the
                                                                    L2 distance from the
                                                                    normalized cash history vector
                                                                    of a candidate node with 1n
                                                                    Favours uniformity in cash
                                                                    history distribution over overall
                                                                    magnitude of the cash history




  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                 Co-occurrence graphs
                         Interpretation of Co-citations
                                       Topical Anchors
                                            References


Cosine Similarity




                                                            Cosine similarity
                                                                    Computes the cosine between a
                                                                    given node’s normalized cash
                                                                    history vector and 1n
                                                                    Another metric for factoring
                                                                    both uniformity in cash
                                                                    distribution and magnitude




  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                 Co-occurrence graphs
                         Interpretation of Co-citations
                                       Topical Anchors
                                            References


Example results
    Query Terms                  Projection                    Eucledian                 Cosine
    United States Dol-           French language,              Currency,         Bank,   Currency,    Bank,
    lar, Euro, West              Guinea,     Guinea-           France                    France
    African CFA franc            Bissau
    Bayes, Euclid, Ra-           Probability, Math-            Mathematics,              Mathematics,
    manujan, Bernoulli           ematics, Number               Mathematician,            Mathematician,
                                                               Euler                     Probability distri-
                                                                                         bution
    MIT, Stanford, IIT           University, Indian            University,   Col-        University,   Col-
                                 Institute of Tech-            lege, Technology          lege, Science
                                 nology, Bombay
    Leaf, Fruit, Stem,           Linguistics, Plant,           Plant,       Tree,        Plant,        Tree,
    Photosynthesis               Tree                          Species                   Species
    Bernoulli, Poisson,          Godwin, Norway,               Mathematics,              Mathematics,
    Weibull, Binomial            Harold Godwinson              Probability,  Ex-         Probability, Statis-
                                                               pected Value              tics

  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                 Co-occurrence graphs
                         Interpretation of Co-citations
                                       Topical Anchors
                                            References


User evaluation
  Experimental Setup:
         86 volunteer users were given a set of queries and asked to provide topical
         labels for these queries ranked according to their perceived importance
         66 volunteers answered 100 questions, while the rest answered 30 random
         questions chosen from the 100 questions
         User responses were charted for consistency in results (chart shown below)




  IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                  Co-occurrence graphs
                          Interpretation of Co-citations
                                        Topical Anchors
                                             References


User evaluation
CLRW against tf-idf and OPIC




   IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                  Co-occurrence graphs
                          Interpretation of Co-citations
                                        Topical Anchors
                                             References


Comparison
Comparison with Automatic Topic Labeling algorithm [3]




   Caveats: Comparison with Eucledian algorithm. ATL requires document
   contexts where the topical anchor is present (unlike CLRW which searches on
   the co-occurrence graph built over a corpus)

   IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                  Co-occurrence graphs
                          Interpretation of Co-citations
                                        Topical Anchors
                                             References


Future Work
Several open questions..




          Topical markers, semantic siblings
          Co-occurrence semantics when coupled with concept
          hierarchies
          Automatic detection of semantic relations based on
          co-occurrence
          Automatic attribute identification
                                                Thank You!




   IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
Co-occurrence and Meaning
                                 Co-occurrence graphs
                         Interpretation of Co-citations
                                       Topical Anchors
                                            References

[1] A. Biletzki and A. Matar. Ludwig wittgenstein (second revision). Stanford Encyclopedia of Philosophy, May
    2009.
[2] Gerstner and Kistler. Spiking Neuron Models. Single Neurons, Populations, Plasticity. Cambridge University
    Press, 2002.
[3] Q. Mei, X. Shen, and C. Zhai. Automatic labeling of multinomial topic models. In KDD ’07: Proceedings of
    the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 490–499,
    New York, NY, USA, 2007. ACM.
[4] M. R. Mutalikdesai and S. Srinivasa. Co-citations as endorsements of citations. Submitted for publication,
    2009.
[5] S. Pinker. The Language Instinct. Harper Perennial Modern Classics, 2007.
[6] A. R. Rachakonda and S. Srinivasa. Finding the topical anchors of a context using lexical cooccurrence data.
    In Proceedings of ACM Conference on Information and Knowledge Management (CIKM), 2009.
[7] A. R. Rachakonda and S. Srinivasa. Vector-based ranking techniques for identifying the topical anchors of a
    context. In Proceedings of the 15th International Conference on Management of Data (COMAD), 2009.
[8] S. Reddy, S. Srinivasa, and M. R. Mutalikdesai. Measures of ”ignorance” on the web. In Proceedings of the
    International Conference on Management of Data (COMAD), Dec 2006.




IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore

Contenu connexe

Plus de Srinath Srinivasa

Towards a "Mindful" Web
Towards a "Mindful" WebTowards a "Mindful" Web
Towards a "Mindful" Web
Srinath Srinivasa
 
Big Data and the Semantic Web: Challenges and Opportunities
Big Data and the Semantic Web: Challenges and OpportunitiesBig Data and the Semantic Web: Challenges and Opportunities
Big Data and the Semantic Web: Challenges and Opportunities
Srinath Srinivasa
 

Plus de Srinath Srinivasa (15)

AI and the sense of self
AI and the sense of selfAI and the sense of self
AI and the sense of self
 
Modeling sustainability in social networks
Modeling sustainability in social networksModeling sustainability in social networks
Modeling sustainability in social networks
 
Characterizing online social cognition
Characterizing online social cognitionCharacterizing online social cognition
Characterizing online social cognition
 
Open ended data
Open ended dataOpen ended data
Open ended data
 
The Web and the Mind
The Web and the MindThe Web and the Mind
The Web and the Mind
 
Big Social Machines: Architecture and Challenges
Big Social Machines: Architecture and ChallengesBig Social Machines: Architecture and Challenges
Big Social Machines: Architecture and Challenges
 
Abstraction and Expression on the Web
Abstraction and Expression on the WebAbstraction and Expression on the Web
Abstraction and Expression on the Web
 
Towards a "Mindful" Web
Towards a "Mindful" WebTowards a "Mindful" Web
Towards a "Mindful" Web
 
The Power Law of Social Media: What CIOs Should Know
The Power Law of Social Media: What CIOs Should KnowThe Power Law of Social Media: What CIOs Should Know
The Power Law of Social Media: What CIOs Should Know
 
Big Data and the Semantic Web: Challenges and Opportunities
Big Data and the Semantic Web: Challenges and OpportunitiesBig Data and the Semantic Web: Challenges and Opportunities
Big Data and the Semantic Web: Challenges and Opportunities
 
Aggregating Operational Knowledge in Community Settings
Aggregating Operational Knowledge in Community SettingsAggregating Operational Knowledge in Community Settings
Aggregating Operational Knowledge in Community Settings
 
Information Networks and Semantics
Information Networks and SemanticsInformation Networks and Semantics
Information Networks and Semantics
 
The open problem of open-world computing
The open problem of open-world computingThe open problem of open-world computing
The open problem of open-world computing
 
Trends In Graph Data Management And Mining
Trends In Graph Data Management And MiningTrends In Graph Data Management And Mining
Trends In Graph Data Management And Mining
 
Information Networks And Their Dynamics
Information Networks And Their DynamicsInformation Networks And Their Dynamics
Information Networks And Their Dynamics
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Semantics hidden within co-occurrence patterns

  • 1. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Semantics hidden within co-occurrence patterns A bottom-up approach to the Semantic Web? Srinath Srinivasa IIIT Bangalore sri@iiitb.ac.in IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 2. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Outline 1 Co-occurrence and Meaning 2 Co-occurrence graphs 3 Interpretation of Co-citations 4 Topical Anchors IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 3. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Outline 1 Co-occurrence and Meaning 2 Co-occurrence graphs 3 Interpretation of Co-citations 4 Topical Anchors IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 4. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Conventional WebIR and co-occurrence Lexical feature extraction: Bag-of-words model Document vectorization Implicit assumption of independence of dimensions Vector space reduction and spectral analyses for identifying hidden semantics (Ex: LSA, SVD, Clustering, etc.) In human languages, lexical terms are not only not independent of one another, important semantic structures are inherent in the way terms co-occur. IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 5. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Motivational Problems Some motivational problems to show limitations of purely lexical approaches to IR: The topical anchor problem “If ever a player has overshadowed Sachin Tendulkar for sheer class of batsmanship, it is V V S Laxman. After a record 353-run fourth-wicket partnership in the 2004 Sydney Test when Laxman hit 30 fours in his 178 to Tendulkar’s 33 in his unbeaten 241, the master put the artistry of V V S in perspective.” What is the best topic of this paragraph: Sachin Tendulkar, V V S Laxman, Sydney, Australia, Cricket, Test Match IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 6. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Motivational Problems The semantic attributes problem Given that a user has searched for the term “Malm¨” which of the following o keywords can be termed as “attributes” that enhance the meaning represented by Malm¨ : o Driving History Mileage Weather Symptoms Elephant A LTEX beamer Infringement IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 7. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Motivational Problems The topical marker problem The US Federal Aviation Regulations Sec 380.12 states that: The charter operator may not cancel a charter for any reason (including insufficient participation), except for circumstances that make it physically impossible to perform the charter trip, less than 10 days before the scheduled date of departure of the outbound trip. If the charter operator cancels 10 or more days before the scheduled date of departure, the operator must so notify each participant in writing within 7 days after the cancellation but in any event not less than 10 days before the scheduled departure date of the outbound trip. If a charter is canceled less than 10 days before scheduled departure (i.e., for circumstances that make it physically impossible to perform the charter trip), the operator must get the message to each participant as soon as possible. If a user who has booked a ticket with a charter operator finds out that her flight has been cancelled suddenly without notice and wants to confront the operator; what should she search for: charter operator, FAR, cancellation, scheduled trip, Sec 380, operator, notification, . . . IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 8. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Motivational Problems The topical marker problem The US Federal Aviation Regulations Sec 380.12 states that: The charter operator may not cancel a charter for any reason (including insufficient participation), except for circumstances that make it physically impossible to perform the charter trip, less than 10 days before the scheduled date of departure of the outbound trip. If the charter operator cancels 10 or more days before the scheduled date of departure, the operator must so notify each participant in writing within 7 days after the cancellation but in any event not less than 10 days before the scheduled departure date of the outbound trip. If a charter is canceled less than 10 days before scheduled departure (i.e., for circumstances that make it physically impossible to perform the charter trip), the operator must get the message to each participant as soon as possible. If a user who has booked a ticket with a charter operator finds out that her flight has been cancelled suddenly without notice and wants to confront the operator; what should she search for: charter operator, FAR, cancellation, scheduled trip, Sec 380, operator, notification, . . . IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 9. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Motivational Problems The theme problem: Article 1 A US Airways Airbus A320 suffered a bird hit immediately after take off from New York’s La Gaurdia airport and was forced to land on the Hudson river. All 150 passengers and 5 crew members are reported to be safe with only minor injuries. Article 2 La Guardia International Airport is conveniently located to serve the citizens of both New York and New Jersey. Airlines that operate from here include: US Airways, Delta, Continental and Virgin. Its MRO routinely serves a number of aircraft types including Boeing 73x series and the Airbus A320 and A330 series. La Guardia is easily reachable from New Jersey through the Lincoln tunnel that runs under the Hudson river. Article 3 Pilot Steve Bolle of a light aircraft in northern Australia was forced to make an emergency landing in water after suffering engine trouble on take-off. He landed the Piper Chieftain plane in shallow waters after realising he would not make it back to the airport. Mr Bolle and his five passengers were able to wade safely to shore. Which of the articles above are similar to one another? (Ack to sources: Wikipedia and BBC) IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 10. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Motivational Problems The theme problem: Article 1 A US Airways Airbus A320 suffered a bird hit immediately after take off from New York’s La Gaurdia airport and was forced to land on the Hudson river. All 150 passengers and 5 crew members are reported to be safe with only minor injuries. Article 2 La Guardia International Airport is conveniently located to serve the citizens of both New York and New Jersey. Airlines that operate from here include: US Airways, Delta, Continental and Virgin. Its MRO routinely serves a number of aircraft types including Boeing 73x series and the Airbus A320 and A330 series. La Guardia is easily reachable from New Jersey through the Lincoln tunnel that runs under the Hudson river. Article 3 Pilot Steve Bolle of a light aircraft in northern Australia was forced to make an emergency landing in water after suffering engine trouble on take-off. He landed the Piper Chieftain plane in shallow waters after realising he would not make it back to the airport. Mr Bolle and his five passengers were able to wade safely to shore. Which of the articles above are similar to one another? (Ack to sources: Wikipedia and BBC) IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 11. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Motivational Problems The theme problem: Article 1 A US Airways Airbus A320 suffered a bird hit immediately after take off from New York’s La Gaurdia airport and was forced to land on the Hudson river. All 150 passengers and 5 crew members are reported to be safe with only minor injuries. Article 2 La Guardia International Airport is conveniently located to serve the citizens of both New York and New Jersey. Airlines that operate from here include: US Airways, Delta, Continental and Virgin. Its MRO routinely serves a number of aircraft types including Boeing 73x series and the Airbus A320 and A330 series. La Guardia is easily reachable from New Jersey through the Lincoln tunnel that runs under the Hudson river. Article 3 Pilot Steve Bolle of a light aircraft in northern Australia was forced to make an emergency landing in water after suffering engine trouble on take-off. He landed the Piper Chieftain plane in shallow waters after realising he would not make it back to the airport. Mr Bolle and his five passengers were able to wade safely to shore. Which of the articles above are similar to one another? (Ack to sources: Wikipedia and BBC) IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 12. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Motivational Problems The theme problem: Article 1 A US Airways Airbus A320 suffered a bird hit immediately after take off from New York’s La Gaurdia airport and was forced to land on the Hudson river. All 150 passengers and 5 crew members are reported to be safe with only minor injuries. Article 2 La Guardia International Airport is conveniently located to serve the citizens of both New York and New Jersey. Airlines that operate from here include: US Airways, Delta, Continental and Virgin. Its MRO routinely serves a number of aircraft types including Boeing 73x series and the Airbus A320 and A330 series. La Guardia is easily reachable from New Jersey through the Lincoln tunnel that runs under the Hudson river. Article 3 Pilot Steve Bolle of a light aircraft in northern Australia was forced to make an emergency landing in water after suffering engine trouble on take-off. He landed the Piper Chieftain plane in shallow waters after realising he would not make it back to the airport. Mr Bolle and his five passengers were able to wade safely to shore. Which of the articles above are similar to one another? (Ack to sources: Wikipedia and BBC) IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 13. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-occurrence and Meaning Hebbian learning Co-occurrence plays a central role in the Hebbian theory of the semantic organization of the human mind, which states that synaptic plasticity between neurons are determined by repeated and persistent stimulation of the pre- and post-synaptic cells [2]. This is also summarized as: Cells that fire together, wire together Co-occurrence and the language instinct Language structures such as pluralization, is often learnt by analyzing co-occurrence patterns. An interesting example is the “wug” test (cf. [5]): That is a pig; these are pigs. That is a dog; these are dogs. That is a cat; these are cats. That is a wug; these are . The use of co-occurrence is even more apparent in this example, that leads to confusion (even if for a moment): The plural of radius is radii; the plural of thesis is theses; the plural of bus is buses. The plural of lotus is lotii? lotes? lotuses? IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 14. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-occurrence and Meaning Hebbian learning Co-occurrence plays a central role in the Hebbian theory of the semantic organization of the human mind, which states that synaptic plasticity between neurons are determined by repeated and persistent stimulation of the pre- and post-synaptic cells [2]. This is also summarized as: Cells that fire together, wire together Co-occurrence and the language instinct Language structures such as pluralization, is often learnt by analyzing co-occurrence patterns. An interesting example is the “wug” test (cf. [5]): That is a pig; these are pigs. That is a dog; these are dogs. That is a cat; these are cats. That is a wug; these are . The use of co-occurrence is even more apparent in this example, that leads to confusion (even if for a moment): The plural of radius is radii; the plural of thesis is theses; the plural of bus is buses. The plural of lotus is lotii? lotes? lotuses? IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 15. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-occurrence and meaning Meaning is usage The analytic philosophy worldview: Meaning is usage [1] can be explained by representing usage as co-occurrence analysis. Consider the following paragraphs: Everyday, I go to work in my pqer. My pqer runs on diesel and gives one of the best mileage for pqers in its category. My pqer can seat five people and is a good candidate for pqer-pooling. On December 26 2004, a massive earthquake measuring 9.1 jolted Java. This earthquake triggered a huge tsunami that has been the deadliest in history. We have developed an applet to simulate the path taken by the tsunami. You can run this applet in any browser that has Java enabled. In the first paragraph, the meaning of the word “pqer” and in the second paragraph, the word-sense of the term “Java” are both resolved by looking at other terms that co-occur with them. IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 16. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-occurrence and meaning Meaning is usage The analytic philosophy worldview: Meaning is usage [1] can be explained by representing usage as co-occurrence analysis. Consider the following paragraphs: Everyday, I go to work in my pqer. My pqer runs on diesel and gives one of the best mileage for pqers in its category. My pqer can seat five people and is a good candidate for pqer-pooling. On December 26 2004, a massive earthquake measuring 9.1 jolted Java. This earthquake triggered a huge tsunami that has been the deadliest in history. We have developed an applet to simulate the path taken by the tsunami. You can run this applet in any browser that has Java enabled. In the first paragraph, the meaning of the word “pqer” and in the second paragraph, the word-sense of the term “Java” are both resolved by looking at other terms that co-occur with them. IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 17. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-occurrence and meaning Meaning is usage The analytic philosophy worldview: Meaning is usage [1] can be explained by representing usage as co-occurrence analysis. Consider the following paragraphs: Everyday, I go to work in my pqer. My pqer runs on diesel and gives one of the best mileage for pqers in its category. My pqer can seat five people and is a good candidate for pqer-pooling. On December 26 2004, a massive earthquake measuring 9.1 jolted Java. This earthquake triggered a huge tsunami that has been the deadliest in history. We have developed an applet to simulate the path taken by the tsunami. You can run this applet in any browser that has Java enabled. In the first paragraph, the meaning of the word “pqer” and in the second paragraph, the word-sense of the term “Java” are both resolved by looking at other terms that co-occur with them. IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 18. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-occurrence and meaning Meaning is usage The analytic philosophy worldview: Meaning is usage [1] can be explained by representing usage as co-occurrence analysis. Consider the following paragraphs: Everyday, I go to work in my pqer. My pqer runs on diesel and gives one of the best mileage for pqers in its category. My pqer can seat five people and is a good candidate for pqer-pooling. On December 26 2004, a massive earthquake measuring 9.1 jolted Java. This earthquake triggered a huge tsunami that has been the deadliest in history. We have developed an applet to simulate the path taken by the tsunami. You can run this applet in any browser that has Java enabled. In the first paragraph, the meaning of the word “pqer” and in the second paragraph, the word-sense of the term “Java” are both resolved by looking at other terms that co-occur with them. IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 19. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Outline 1 Co-occurrence and Meaning 2 Co-occurrence graphs 3 Interpretation of Co-citations 4 Topical Anchors IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 20. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Capturing co-occurrence We are given a document corpus that is represented as a set of “contexts”: C = {C1 , C2 , . . . Cn } Depending on the specific problem, a context may take various forms like: sentence, paragraph, document, etc. Two entities ei and ej are said to co-occur (denoted as ei ej ) if there is some context C such that ei , ej ∈ C The support for a co-occurring pair ei ej is the probability of finding this co-occurrence in any given context C in the corpus. In other words, the support is the joint probability P(ei , ej ) Note that co-occurrence is an n-ary relation. But for purposes of simplicity, we focus on pairwise co-occurrences and derive higher order semantics when required. IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 21. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Capturing co-occurrence We are given a document corpus that is represented as a set of “contexts”: C = {C1 , C2 , . . . Cn } Depending on the specific problem, a context may take various forms like: sentence, paragraph, document, etc. Two entities ei and ej are said to co-occur (denoted as ei ej ) if there is some context C such that ei , ej ∈ C The support for a co-occurring pair ei ej is the probability of finding this co-occurrence in any given context C in the corpus. In other words, the support is the joint probability P(ei , ej ) Note that co-occurrence is an n-ary relation. But for purposes of simplicity, we focus on pairwise co-occurrences and derive higher order semantics when required. IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 22. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Capturing co-occurrence We are given a document corpus that is represented as a set of “contexts”: C = {C1 , C2 , . . . Cn } Depending on the specific problem, a context may take various forms like: sentence, paragraph, document, etc. Two entities ei and ej are said to co-occur (denoted as ei ej ) if there is some context C such that ei , ej ∈ C The support for a co-occurring pair ei ej is the probability of finding this co-occurrence in any given context C in the corpus. In other words, the support is the joint probability P(ei , ej ) Note that co-occurrence is an n-ary relation. But for purposes of simplicity, we focus on pairwise co-occurrences and derive higher order semantics when required. IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 23. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-occurrence graphs Co-occurrence graph A co-occurrence graph is a weighted, undirected graph G = (E , , w ), where E is a set of “entities”, ⊆ E × E is a set of co-occurrences, and w : → indicates support for the co-occurrence Co-occurrence versus n-partite graphs Semantic co-occurrence graphs A semantic co-occurrence graph is a co-occurrence graph that is augmented with a concept hierarchy. A concept hierarchy is defined by one or more partial orders of the form: ⊆ E × E , representing relationships like is-a and is-in, that are reflexive, anti-symmetric and transitive. IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 24. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-occurrence graphs Co-occurrence graph A co-occurrence graph is a weighted, undirected graph G = (E , , w ), where E is a set of “entities”, ⊆ E × E is a set of co-occurrences, and w : → indicates support for the co-occurrence Co-occurrence versus n-partite graphs Semantic co-occurrence graphs A semantic co-occurrence graph is a co-occurrence graph that is augmented with a concept hierarchy. A concept hierarchy is defined by one or more partial orders of the form: ⊆ E × E , representing relationships like is-a and is-in, that are reflexive, anti-symmetric and transitive. IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 25. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-occurrence graphs Co-occurrence graph A co-occurrence graph is a weighted, undirected graph G = (E , , w ), where E is a set of “entities”, ⊆ E × E is a set of co-occurrences, and w : → indicates support for the co-occurrence Co-occurrence versus n-partite graphs Semantic co-occurrence graphs A semantic co-occurrence graph is a co-occurrence graph that is augmented with a concept hierarchy. A concept hierarchy is defined by one or more partial orders of the form: ⊆ E × E , representing relationships like is-a and is-in, that are reflexive, anti-symmetric and transitive. IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 26. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-occurrence graph Example: Concept hierarchy construction 1 Start with a base Ontology 2 Use co-occurrence patterns to guess conceptual relationships across terms 3 Use concept hierarchy to identify deeper co-occurrence patterns 4 Repeat from step 2 in a semi-automated fashion until algorithm stabilizes IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 27. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-occurrence graph Example: Concept hierarchy construction 1 Start with a base Ontology 2 Use co-occurrence patterns to guess conceptual relationships across terms 3 Use concept hierarchy to identify deeper co-occurrence patterns 4 Repeat from step 2 in a semi-automated fashion until algorithm stabilizes IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 28. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-occurrence graphs Characteristics of co-occurrence graphs Triadic closure (highly clustered) Disconnected components or a single component of very small diameter Co-occurrence graph of all noun phrases in Wikipedia has a diameter of 4 Co-occurrence support for entity pairs follow a power-law IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 29. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Outline 1 Co-occurrence and Meaning 2 Co-occurrence graphs 3 Interpretation of Co-citations 4 Topical Anchors IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 30. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-citation Co-citation and bibliographic coupling are important metrics in several datasets like scientific literature, web pages, wikis, tagging systems like delicious, etc. Co-citation of a pair of documents corresponds to the co-occurrence of these references (Ex. URLs) in a context Pair-wise co-citation graphs have the same properties as co-occurrence graphs IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 31. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-citation Patterns Hyperlink distance across pairs of highly co-cited pages [8] 300 12000 250 10000 8000 200 6000 F 150 F 4000 100 2000 50 0 0 1 2 3 4 5 6 7 ax ax km m 1 2 3 4 5 6 7 kmax >kmax >k k k Figure: Hyperlink distance across pairs of Figure: Hyperlink distance across pairs of highly co-cited Web pages highly co-cited Wikipedia pages IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 32. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-citation Patterns Hyperlink distance across pairs of highly co-cited pages Endorsement of a citation Topical aggregation Page A endorses the content of page B Document A represents content about a Users reading page A, traverses this link and “higher-level” topic in terms of is-a or is-in finds page B useful too relationships; and links to (hence co-cites) several pages on “lower-level” topics Users create their own pages citing both A and B Pages on the “lower-level” topics usually cite back the page on the “higher-level” topic, If A has several outgoing links, and only some hence giving a citation distance of 2 among pairs of outlinks are co-cited, then co-citation themselves can be seen as an endorsement of the citation Nepotistic co-citations Another major source of co-citation (primarily on web pages) are “nepotistic links” in the form of navigational tabs IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 33. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-citation Patterns Hyperlink distance across pairs of highly co-cited pages Endorsement of a citation Topical aggregation Page A endorses the content of page B Document A represents content about a Users reading page A, traverses this link and “higher-level” topic in terms of is-a or is-in finds page B useful too relationships; and links to (hence co-cites) several pages on “lower-level” topics Users create their own pages citing both A and B Pages on the “lower-level” topics usually cite back the page on the “higher-level” topic, If A has several outgoing links, and only some hence giving a citation distance of 2 among pairs of outlinks are co-cited, then co-citation themselves can be seen as an endorsement of the citation Nepotistic co-citations Another major source of co-citation (primarily on web pages) are “nepotistic links” in the form of navigational tabs IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 34. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-citation Patterns Hyperlink distance across pairs of highly co-cited pages Endorsement of a citation Topical aggregation Page A endorses the content of page B Document A represents content about a Users reading page A, traverses this link and “higher-level” topic in terms of is-a or is-in finds page B useful too relationships; and links to (hence co-cites) several pages on “lower-level” topics Users create their own pages citing both A and B Pages on the “lower-level” topics usually cite back the page on the “higher-level” topic, If A has several outgoing links, and only some hence giving a citation distance of 2 among pairs of outlinks are co-cited, then co-citation themselves can be seen as an endorsement of the citation Nepotistic co-citations Another major source of co-citation (primarily on web pages) are “nepotistic links” in the form of navigational tabs IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 35. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-citation graph of a web crawl Pairs of pages with at least 100 non-nepotistic co-citations IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 36. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-citation graph of a web crawl Co-citation graph depicts non-nepotistic co-citations of at least 100 or more across pairs of pages In addition to being made of disconnected components, the graph also shows various recurring structural motifs like: Star Clique Clique chain Dumb-bell Interpretations for the above motifs along with examples are explained in Mutalikdesai and Srinivasa (2009) [4] IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 37. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Endorsed hyperlink graph (EHG) On the web, co-citations usually implies a citation. Hence the EHG is essentially a directed version of the co-citation graph. Some EHG components are depicted below: EHG clique chain IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 38. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Endorsed citation graph (ECG) for scientific literature ECG of citation info obtained from CiteSeer IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 39. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Endorsed citation graph The ECG over scientific literature data (using CiteSeer) shows similar componentization of the graph, except, the ECG has one giant component Citation in scientific literature has some subtle differences from hyperlink citations Scientific literature citations are always into the past Very rarely (if at all) do scientific literature citations form cyclic structures ECG comprises mostly of weakly connected directed graph components, while EHG may contain strongly connected components IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 40. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References ERank Importance of a page within an EHG ERank is an authority score of a page within an EHG (ECG) component Depicts reachability of the page within the component ERank scores in a component shown to be uncorrelated to the PageRank scores of pages of that component IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 41. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References EndorSeer A Firefox plugin for augmented browsing of Citeseer Currently shows endorsed citations from among the list of citations from any paper Currently underway: Show the ECG component and ECG neighbourhood of a paper IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 42. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Outline 1 Co-occurrence and Meaning 2 Co-occurrence graphs 3 Interpretation of Co-citations 4 Topical Anchors IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 43. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Topical Anchors [6, 7] Motivation Example: “Will my oral insulin drugs, along with my hypertension and high blood glucose, have any side effects on the health of my pancreas?” Can a machine detect diabetes as the context? Another example: A document containing the words, Andy Roddick, Roger Federer and Rafael Nadal. How likely is it that the word Tennis will be mentioned (semantically) when discussing about these players? IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 44. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Topical Anchors [6, 7] Motivation Example: “Will my oral insulin drugs, along with my hypertension and high blood glucose, have any side effects on the health of my pancreas?” Can a machine detect diabetes as the context? Another example: A document containing the words, Andy Roddick, Roger Federer and Rafael Nadal. How likely is it that the word Tennis will be mentioned (semantically) when discussing about these players? IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 45. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Co-occurrence context Given a set of query terms, the co-occurrence context is defined as the subgraph formed by the query terms and the set of terms that co-occur with at least one of the terms Conjecture: The topical anchor of a set of terms, is a highly authoritative term that lies with the co-occurrence context of the query terms IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 46. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Online Page Importance Computation Each node i in the context is intialised with a cash ci . A node a is picked at random and the cash ca is added to its history ha . Then ca is distributed amongst all its neighbours proportional to the edge weights. This process is iterated till the ratio of hi s becomes a near constant. Node with the largest hi is chosen as the most central node. Unfortunately OPIC was seen to be unsuitable for determining topical anchors since it tends to find central nodes for the entire graph IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 47. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Cash Leaking Random Walk Cooccurrence graphs have extremely small diameters (4-5). Roger Federer to feral child in two hops. Football becomes most central to Roger Federer and Rafael Nadal instead of Tennis. Solution: Cash Leakage IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 48. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Bias and History Vectors There is a hidden bias between query words for the way centrality is computed. Example: Jim Carrey, Hugh Grant, Rajkumar Bias due to difference in neighbourhood sizes Bias due to polysemy Example: Java, Beans, Kaffe IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 49. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Bias examples Query Terms Topical Anchors Java, Beans, Kaffe Programming language, Indonesia, Food United States Dollar, Euro, West French language, Guinea, Guinea- African CFA franc Bissau Bayes, Euclid, Ramanujan, Probability, Mathematics, Number Bernoulli MIT, Stanford, IIT University, Indian Institute of Tech- nology, Bombay Leaf, Fruit, Stem, Photosynthesis Linguistics, Plant, Tree Bernoulli, Poisson, Weibull, Bino- Godwin, Norway, Harold Godwin- mial son Table: Examples with irrelevant topical anchors IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 50. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Solution to the topic bias problem Labelled cash. Vector models of CLRW Cash from each of the query term qi is given a “colour” ci . The cash history at any node is hence a vector of the form (v1 , v2 , . . . vn ) showing cash flow history for each of the colours. The vector is then normalized as: vi vi = v ˆ where v = max vi and vi ∈ [0, 1] ˆ i IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 51. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Projection Projection The line joining 0n to 1n represents points where all query terms have contributed equally to the cash history. This is called the baseline Hence, for any given node, its projection onto the baseline represents the importance of the node in being a topical anchor IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 52. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Euclidean Distance Eucledian distance Eucledian metric computes the L2 distance from the normalized cash history vector of a candidate node with 1n Favours uniformity in cash history distribution over overall magnitude of the cash history IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 53. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Cosine Similarity Cosine similarity Computes the cosine between a given node’s normalized cash history vector and 1n Another metric for factoring both uniformity in cash distribution and magnitude IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 54. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Example results Query Terms Projection Eucledian Cosine United States Dol- French language, Currency, Bank, Currency, Bank, lar, Euro, West Guinea, Guinea- France France African CFA franc Bissau Bayes, Euclid, Ra- Probability, Math- Mathematics, Mathematics, manujan, Bernoulli ematics, Number Mathematician, Mathematician, Euler Probability distri- bution MIT, Stanford, IIT University, Indian University, Col- University, Col- Institute of Tech- lege, Technology lege, Science nology, Bombay Leaf, Fruit, Stem, Linguistics, Plant, Plant, Tree, Plant, Tree, Photosynthesis Tree Species Species Bernoulli, Poisson, Godwin, Norway, Mathematics, Mathematics, Weibull, Binomial Harold Godwinson Probability, Ex- Probability, Statis- pected Value tics IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 55. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References User evaluation Experimental Setup: 86 volunteer users were given a set of queries and asked to provide topical labels for these queries ranked according to their perceived importance 66 volunteers answered 100 questions, while the rest answered 30 random questions chosen from the 100 questions User responses were charted for consistency in results (chart shown below) IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 56. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References User evaluation CLRW against tf-idf and OPIC IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 57. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Comparison Comparison with Automatic Topic Labeling algorithm [3] Caveats: Comparison with Eucledian algorithm. ATL requires document contexts where the topical anchor is present (unlike CLRW which searches on the co-occurrence graph built over a corpus) IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 58. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References Future Work Several open questions.. Topical markers, semantic siblings Co-occurrence semantics when coupled with concept hierarchies Automatic detection of semantic relations based on co-occurrence Automatic attribute identification Thank You! IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
  • 59. Co-occurrence and Meaning Co-occurrence graphs Interpretation of Co-citations Topical Anchors References [1] A. Biletzki and A. Matar. Ludwig wittgenstein (second revision). Stanford Encyclopedia of Philosophy, May 2009. [2] Gerstner and Kistler. Spiking Neuron Models. Single Neurons, Populations, Plasticity. Cambridge University Press, 2002. [3] Q. Mei, X. Shen, and C. Zhai. Automatic labeling of multinomial topic models. In KDD ’07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 490–499, New York, NY, USA, 2007. ACM. [4] M. R. Mutalikdesai and S. Srinivasa. Co-citations as endorsements of citations. Submitted for publication, 2009. [5] S. Pinker. The Language Instinct. Harper Perennial Modern Classics, 2007. [6] A. R. Rachakonda and S. Srinivasa. Finding the topical anchors of a context using lexical cooccurrence data. In Proceedings of ACM Conference on Information and Knowledge Management (CIKM), 2009. [7] A. R. Rachakonda and S. Srinivasa. Vector-based ranking techniques for identifying the topical anchors of a context. In Proceedings of the 15th International Conference on Management of Data (COMAD), 2009. [8] S. Reddy, S. Srinivasa, and M. R. Mutalikdesai. Measures of ”ignorance” on the web. In Proceedings of the International Conference on Management of Data (COMAD), Dec 2006. IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore