Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Semantics hidden within co-occurrence patterns
1. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Semantics hidden within co-occurrence patterns
A bottom-up approach to the Semantic Web?
Srinath Srinivasa
IIIT Bangalore
sri@iiitb.ac.in
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
2. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Outline
1 Co-occurrence and Meaning
2 Co-occurrence graphs
3 Interpretation of Co-citations
4 Topical Anchors
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
3. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Outline
1 Co-occurrence and Meaning
2 Co-occurrence graphs
3 Interpretation of Co-citations
4 Topical Anchors
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
4. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Conventional WebIR and co-occurrence
Lexical feature extraction: Bag-of-words model
Document vectorization
Implicit assumption of independence of dimensions
Vector space reduction and spectral analyses for identifying
hidden semantics (Ex: LSA, SVD, Clustering, etc.)
In human languages, lexical terms are not only not independent of
one another, important semantic structures are inherent in the way
terms co-occur.
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
5. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Motivational Problems
Some motivational problems to show limitations of purely lexical
approaches to IR:
The topical anchor problem
“If ever a player has overshadowed Sachin Tendulkar for sheer class of
batsmanship, it is V V S Laxman. After a record 353-run fourth-wicket
partnership in the 2004 Sydney Test when Laxman hit 30 fours in his 178
to Tendulkar’s 33 in his unbeaten 241, the master put the artistry of V V
S in perspective.”
What is the best topic of this paragraph: Sachin Tendulkar, V V S
Laxman, Sydney, Australia, Cricket, Test Match
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
6. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Motivational Problems
The semantic attributes problem
Given that a user has searched for the term “Malm¨” which of the following
o
keywords can be termed as “attributes” that enhance the meaning represented
by Malm¨ :
o
Driving
History
Mileage
Weather
Symptoms
Elephant
A
LTEX beamer
Infringement
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
7. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Motivational Problems
The topical marker problem
The US Federal Aviation Regulations Sec 380.12 states that:
The charter operator may not cancel a charter for any reason (including insufficient participation), except
for circumstances that make it physically impossible to perform the charter trip, less than 10 days before
the scheduled date of departure of the outbound trip.
If the charter operator cancels 10 or more days before the scheduled date of departure, the operator must
so notify each participant in writing within 7 days after the cancellation but in any event not less than 10
days before the scheduled departure date of the outbound trip. If a charter is canceled less than 10 days
before scheduled departure (i.e., for circumstances that make it physically impossible to perform the
charter trip), the operator must get the message to each participant as soon as possible.
If a user who has booked a ticket with a charter operator finds out that her
flight has been cancelled suddenly without notice and wants to confront the
operator; what should she search for: charter operator, FAR, cancellation,
scheduled trip, Sec 380, operator, notification, . . .
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
8. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Motivational Problems
The topical marker problem
The US Federal Aviation Regulations Sec 380.12 states that:
The charter operator may not cancel a charter for any reason (including insufficient participation), except
for circumstances that make it physically impossible to perform the charter trip, less than 10 days before
the scheduled date of departure of the outbound trip.
If the charter operator cancels 10 or more days before the scheduled date of departure, the operator must
so notify each participant in writing within 7 days after the cancellation but in any event not less than 10
days before the scheduled departure date of the outbound trip. If a charter is canceled less than 10 days
before scheduled departure (i.e., for circumstances that make it physically impossible to perform the
charter trip), the operator must get the message to each participant as soon as possible.
If a user who has booked a ticket with a charter operator finds out that her
flight has been cancelled suddenly without notice and wants to confront the
operator; what should she search for: charter operator, FAR, cancellation,
scheduled trip, Sec 380, operator, notification, . . .
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
9. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Motivational Problems
The theme problem:
Article 1
A US Airways Airbus A320 suffered a bird hit immediately after take off from New York’s La Gaurdia airport and
was forced to land on the Hudson river. All 150 passengers and 5 crew members are reported to be safe with only
minor injuries.
Article 2
La Guardia International Airport is conveniently located to serve the citizens of both New York and New Jersey.
Airlines that operate from here include: US Airways, Delta, Continental and Virgin. Its MRO routinely serves a
number of aircraft types including Boeing 73x series and the Airbus A320 and A330 series. La Guardia is easily
reachable from New Jersey through the Lincoln tunnel that runs under the Hudson river.
Article 3
Pilot Steve Bolle of a light aircraft in northern Australia was forced to make an emergency landing in water after
suffering engine trouble on take-off. He landed the Piper Chieftain plane in shallow waters after realising he would
not make it back to the airport. Mr Bolle and his five passengers were able to wade safely to shore.
Which of the articles above are similar to one another? (Ack to sources: Wikipedia and BBC)
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
10. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Motivational Problems
The theme problem:
Article 1
A US Airways Airbus A320 suffered a bird hit immediately after take off from New York’s La Gaurdia airport and
was forced to land on the Hudson river. All 150 passengers and 5 crew members are reported to be safe with only
minor injuries.
Article 2
La Guardia International Airport is conveniently located to serve the citizens of both New York and New Jersey.
Airlines that operate from here include: US Airways, Delta, Continental and Virgin. Its MRO routinely serves a
number of aircraft types including Boeing 73x series and the Airbus A320 and A330 series. La Guardia is easily
reachable from New Jersey through the Lincoln tunnel that runs under the Hudson river.
Article 3
Pilot Steve Bolle of a light aircraft in northern Australia was forced to make an emergency landing in water after
suffering engine trouble on take-off. He landed the Piper Chieftain plane in shallow waters after realising he would
not make it back to the airport. Mr Bolle and his five passengers were able to wade safely to shore.
Which of the articles above are similar to one another? (Ack to sources: Wikipedia and BBC)
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
11. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Motivational Problems
The theme problem:
Article 1
A US Airways Airbus A320 suffered a bird hit immediately after take off from New York’s La Gaurdia airport and
was forced to land on the Hudson river. All 150 passengers and 5 crew members are reported to be safe with only
minor injuries.
Article 2
La Guardia International Airport is conveniently located to serve the citizens of both New York and New Jersey.
Airlines that operate from here include: US Airways, Delta, Continental and Virgin. Its MRO routinely serves a
number of aircraft types including Boeing 73x series and the Airbus A320 and A330 series. La Guardia is easily
reachable from New Jersey through the Lincoln tunnel that runs under the Hudson river.
Article 3
Pilot Steve Bolle of a light aircraft in northern Australia was forced to make an emergency landing in water after
suffering engine trouble on take-off. He landed the Piper Chieftain plane in shallow waters after realising he would
not make it back to the airport. Mr Bolle and his five passengers were able to wade safely to shore.
Which of the articles above are similar to one another? (Ack to sources: Wikipedia and BBC)
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
12. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Motivational Problems
The theme problem:
Article 1
A US Airways Airbus A320 suffered a bird hit immediately after take off from New York’s La Gaurdia airport and
was forced to land on the Hudson river. All 150 passengers and 5 crew members are reported to be safe with only
minor injuries.
Article 2
La Guardia International Airport is conveniently located to serve the citizens of both New York and New Jersey.
Airlines that operate from here include: US Airways, Delta, Continental and Virgin. Its MRO routinely serves a
number of aircraft types including Boeing 73x series and the Airbus A320 and A330 series. La Guardia is easily
reachable from New Jersey through the Lincoln tunnel that runs under the Hudson river.
Article 3
Pilot Steve Bolle of a light aircraft in northern Australia was forced to make an emergency landing in water after
suffering engine trouble on take-off. He landed the Piper Chieftain plane in shallow waters after realising he would
not make it back to the airport. Mr Bolle and his five passengers were able to wade safely to shore.
Which of the articles above are similar to one another? (Ack to sources: Wikipedia and BBC)
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
13. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Co-occurrence and Meaning
Hebbian learning
Co-occurrence plays a central role in the Hebbian theory of the semantic organization of the human mind,
which states that synaptic plasticity between neurons are determined by repeated and persistent
stimulation of the pre- and post-synaptic cells [2].
This is also summarized as: Cells that fire together, wire together
Co-occurrence and the language instinct
Language structures such as pluralization, is often learnt by analyzing co-occurrence patterns. An
interesting example is the “wug” test (cf. [5]):
That is a pig; these are pigs. That is a dog; these are dogs. That is a cat; these are cats. That is a wug;
these are .
The use of co-occurrence is even more apparent in this example, that leads to confusion (even if for a
moment):
The plural of radius is radii; the plural of thesis is theses; the plural of bus is buses. The plural of lotus is
lotii? lotes? lotuses?
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
14. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Co-occurrence and Meaning
Hebbian learning
Co-occurrence plays a central role in the Hebbian theory of the semantic organization of the human mind,
which states that synaptic plasticity between neurons are determined by repeated and persistent
stimulation of the pre- and post-synaptic cells [2].
This is also summarized as: Cells that fire together, wire together
Co-occurrence and the language instinct
Language structures such as pluralization, is often learnt by analyzing co-occurrence patterns. An
interesting example is the “wug” test (cf. [5]):
That is a pig; these are pigs. That is a dog; these are dogs. That is a cat; these are cats. That is a wug;
these are .
The use of co-occurrence is even more apparent in this example, that leads to confusion (even if for a
moment):
The plural of radius is radii; the plural of thesis is theses; the plural of bus is buses. The plural of lotus is
lotii? lotes? lotuses?
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
15. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Co-occurrence and meaning
Meaning is usage
The analytic philosophy worldview: Meaning is usage [1] can be explained by
representing usage as co-occurrence analysis.
Consider the following paragraphs:
Everyday, I go to work in my pqer. My pqer runs on diesel and gives one of the
best mileage for pqers in its category. My pqer can seat five people and is a
good candidate for pqer-pooling.
On December 26 2004, a massive earthquake measuring 9.1 jolted Java. This
earthquake triggered a huge tsunami that has been the deadliest in history. We
have developed an applet to simulate the path taken by the tsunami. You can
run this applet in any browser that has Java enabled.
In the first paragraph, the meaning of the word “pqer” and in the second paragraph, the word-sense of the term
“Java” are both resolved by looking at other terms that co-occur with them.
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
16. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Co-occurrence and meaning
Meaning is usage
The analytic philosophy worldview: Meaning is usage [1] can be explained by
representing usage as co-occurrence analysis.
Consider the following paragraphs:
Everyday, I go to work in my pqer. My pqer runs on diesel and gives one of the
best mileage for pqers in its category. My pqer can seat five people and is a
good candidate for pqer-pooling.
On December 26 2004, a massive earthquake measuring 9.1 jolted Java. This
earthquake triggered a huge tsunami that has been the deadliest in history. We
have developed an applet to simulate the path taken by the tsunami. You can
run this applet in any browser that has Java enabled.
In the first paragraph, the meaning of the word “pqer” and in the second paragraph, the word-sense of the term
“Java” are both resolved by looking at other terms that co-occur with them.
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
17. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Co-occurrence and meaning
Meaning is usage
The analytic philosophy worldview: Meaning is usage [1] can be explained by
representing usage as co-occurrence analysis.
Consider the following paragraphs:
Everyday, I go to work in my pqer. My pqer runs on diesel and gives one of the
best mileage for pqers in its category. My pqer can seat five people and is a
good candidate for pqer-pooling.
On December 26 2004, a massive earthquake measuring 9.1 jolted Java. This
earthquake triggered a huge tsunami that has been the deadliest in history. We
have developed an applet to simulate the path taken by the tsunami. You can
run this applet in any browser that has Java enabled.
In the first paragraph, the meaning of the word “pqer” and in the second paragraph, the word-sense of the term
“Java” are both resolved by looking at other terms that co-occur with them.
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
18. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Co-occurrence and meaning
Meaning is usage
The analytic philosophy worldview: Meaning is usage [1] can be explained by
representing usage as co-occurrence analysis.
Consider the following paragraphs:
Everyday, I go to work in my pqer. My pqer runs on diesel and gives one of the
best mileage for pqers in its category. My pqer can seat five people and is a
good candidate for pqer-pooling.
On December 26 2004, a massive earthquake measuring 9.1 jolted Java. This
earthquake triggered a huge tsunami that has been the deadliest in history. We
have developed an applet to simulate the path taken by the tsunami. You can
run this applet in any browser that has Java enabled.
In the first paragraph, the meaning of the word “pqer” and in the second paragraph, the word-sense of the term
“Java” are both resolved by looking at other terms that co-occur with them.
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
19. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Outline
1 Co-occurrence and Meaning
2 Co-occurrence graphs
3 Interpretation of Co-citations
4 Topical Anchors
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
20. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Capturing co-occurrence
We are given a document corpus that is represented as a set
of “contexts”:
C = {C1 , C2 , . . . Cn }
Depending on the specific problem, a context may take
various forms like: sentence, paragraph, document, etc.
Two entities ei and ej are said to co-occur (denoted as
ei ej ) if there is some context C such that ei , ej ∈ C
The support for a co-occurring pair ei ej is the probability
of finding this co-occurrence in any given context C in the
corpus. In other words, the support is the joint probability
P(ei , ej )
Note that co-occurrence is an n-ary relation. But for purposes of simplicity, we
focus on pairwise co-occurrences and derive higher order semantics when
required.
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
21. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Capturing co-occurrence
We are given a document corpus that is represented as a set
of “contexts”:
C = {C1 , C2 , . . . Cn }
Depending on the specific problem, a context may take
various forms like: sentence, paragraph, document, etc.
Two entities ei and ej are said to co-occur (denoted as
ei ej ) if there is some context C such that ei , ej ∈ C
The support for a co-occurring pair ei ej is the probability
of finding this co-occurrence in any given context C in the
corpus. In other words, the support is the joint probability
P(ei , ej )
Note that co-occurrence is an n-ary relation. But for purposes of simplicity, we
focus on pairwise co-occurrences and derive higher order semantics when
required.
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
22. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Capturing co-occurrence
We are given a document corpus that is represented as a set
of “contexts”:
C = {C1 , C2 , . . . Cn }
Depending on the specific problem, a context may take
various forms like: sentence, paragraph, document, etc.
Two entities ei and ej are said to co-occur (denoted as
ei ej ) if there is some context C such that ei , ej ∈ C
The support for a co-occurring pair ei ej is the probability
of finding this co-occurrence in any given context C in the
corpus. In other words, the support is the joint probability
P(ei , ej )
Note that co-occurrence is an n-ary relation. But for purposes of simplicity, we
focus on pairwise co-occurrences and derive higher order semantics when
required.
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
23. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Co-occurrence graphs
Co-occurrence graph
A co-occurrence graph is a weighted, undirected graph G = (E , , w ), where
E is a set of “entities”, ⊆ E × E is a set of co-occurrences, and w : →
indicates support for the co-occurrence
Co-occurrence versus n-partite graphs
Semantic co-occurrence graphs
A semantic co-occurrence graph is a co-occurrence graph that is augmented
with a concept hierarchy. A concept hierarchy is defined by one or more partial
orders of the form: ⊆ E × E , representing relationships like is-a and is-in,
that are reflexive, anti-symmetric and transitive.
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
24. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Co-occurrence graphs
Co-occurrence graph
A co-occurrence graph is a weighted, undirected graph G = (E , , w ), where
E is a set of “entities”, ⊆ E × E is a set of co-occurrences, and w : →
indicates support for the co-occurrence
Co-occurrence versus n-partite graphs
Semantic co-occurrence graphs
A semantic co-occurrence graph is a co-occurrence graph that is augmented
with a concept hierarchy. A concept hierarchy is defined by one or more partial
orders of the form: ⊆ E × E , representing relationships like is-a and is-in,
that are reflexive, anti-symmetric and transitive.
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
25. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Co-occurrence graphs
Co-occurrence graph
A co-occurrence graph is a weighted, undirected graph G = (E , , w ), where
E is a set of “entities”, ⊆ E × E is a set of co-occurrences, and w : →
indicates support for the co-occurrence
Co-occurrence versus n-partite graphs
Semantic co-occurrence graphs
A semantic co-occurrence graph is a co-occurrence graph that is augmented
with a concept hierarchy. A concept hierarchy is defined by one or more partial
orders of the form: ⊆ E × E , representing relationships like is-a and is-in,
that are reflexive, anti-symmetric and transitive.
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
26. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Co-occurrence graph
Example:
Concept hierarchy construction
1 Start with a base
Ontology
2 Use co-occurrence
patterns to guess
conceptual relationships
across terms
3 Use concept hierarchy
to identify deeper
co-occurrence patterns
4 Repeat from step 2 in a
semi-automated fashion
until algorithm
stabilizes
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
27. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Co-occurrence graph
Example:
Concept hierarchy construction
1 Start with a base
Ontology
2 Use co-occurrence
patterns to guess
conceptual relationships
across terms
3 Use concept hierarchy
to identify deeper
co-occurrence patterns
4 Repeat from step 2 in a
semi-automated fashion
until algorithm
stabilizes
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
28. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Co-occurrence graphs
Characteristics of co-occurrence graphs
Triadic closure (highly clustered)
Disconnected components or a single component of very small
diameter
Co-occurrence graph of all noun phrases in Wikipedia has a
diameter of 4
Co-occurrence support for entity pairs follow a power-law
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
29. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Outline
1 Co-occurrence and Meaning
2 Co-occurrence graphs
3 Interpretation of Co-citations
4 Topical Anchors
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
30. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Co-citation
Co-citation and bibliographic coupling are important metrics in several
datasets like scientific literature, web pages, wikis, tagging systems like
delicious, etc.
Co-citation of a pair of documents corresponds to the co-occurrence of
these references (Ex. URLs) in a context
Pair-wise co-citation graphs have the same properties as co-occurrence
graphs
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
31. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Co-citation Patterns
Hyperlink distance across pairs of highly co-cited pages [8]
300 12000
250 10000
8000
200
6000
F
150
F
4000
100
2000
50
0
0
1
2
3
4
5
6
7
ax
ax
km
m
1 2 3 4 5 6 7 kmax >kmax
>k
k k
Figure: Hyperlink distance across pairs of Figure: Hyperlink distance across pairs of
highly co-cited Web pages highly co-cited Wikipedia pages
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
32. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Co-citation Patterns
Hyperlink distance across pairs of highly co-cited pages
Endorsement of a citation
Topical aggregation
Page A endorses the content of page B Document A represents content about a
Users reading page A, traverses this link and “higher-level” topic in terms of is-a or is-in
finds page B useful too relationships; and links to (hence co-cites)
several pages on “lower-level” topics
Users create their own pages citing both A
and B Pages on the “lower-level” topics usually cite
back the page on the “higher-level” topic,
If A has several outgoing links, and only some
hence giving a citation distance of 2 among
pairs of outlinks are co-cited, then co-citation
themselves
can be seen as an endorsement of the citation
Nepotistic co-citations
Another major source of co-citation (primarily on web pages) are “nepotistic links” in the form of navigational tabs
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
33. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Co-citation Patterns
Hyperlink distance across pairs of highly co-cited pages
Endorsement of a citation
Topical aggregation
Page A endorses the content of page B Document A represents content about a
Users reading page A, traverses this link and “higher-level” topic in terms of is-a or is-in
finds page B useful too relationships; and links to (hence co-cites)
several pages on “lower-level” topics
Users create their own pages citing both A
and B Pages on the “lower-level” topics usually cite
back the page on the “higher-level” topic,
If A has several outgoing links, and only some
hence giving a citation distance of 2 among
pairs of outlinks are co-cited, then co-citation
themselves
can be seen as an endorsement of the citation
Nepotistic co-citations
Another major source of co-citation (primarily on web pages) are “nepotistic links” in the form of navigational tabs
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
34. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Co-citation Patterns
Hyperlink distance across pairs of highly co-cited pages
Endorsement of a citation
Topical aggregation
Page A endorses the content of page B Document A represents content about a
Users reading page A, traverses this link and “higher-level” topic in terms of is-a or is-in
finds page B useful too relationships; and links to (hence co-cites)
several pages on “lower-level” topics
Users create their own pages citing both A
and B Pages on the “lower-level” topics usually cite
back the page on the “higher-level” topic,
If A has several outgoing links, and only some
hence giving a citation distance of 2 among
pairs of outlinks are co-cited, then co-citation
themselves
can be seen as an endorsement of the citation
Nepotistic co-citations
Another major source of co-citation (primarily on web pages) are “nepotistic links” in the form of navigational tabs
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
35. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Co-citation graph of a web crawl
Pairs of pages with at least 100 non-nepotistic co-citations
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
36. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Co-citation graph of a web crawl
Co-citation graph depicts non-nepotistic co-citations of at
least 100 or more across pairs of pages
In addition to being made of disconnected components, the
graph also shows various recurring structural motifs like:
Star
Clique
Clique chain
Dumb-bell
Interpretations for the above motifs along with examples are
explained in Mutalikdesai and Srinivasa (2009) [4]
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
37. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Endorsed hyperlink graph (EHG)
On the web, co-citations usually implies a citation. Hence the EHG
is essentially a directed version of the co-citation graph. Some
EHG components are depicted below:
EHG clique chain
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
38. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Endorsed citation graph (ECG) for scientific literature
ECG of citation info obtained from CiteSeer
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
39. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Endorsed citation graph
The ECG over scientific literature data (using CiteSeer) shows
similar componentization of the graph, except, the ECG has
one giant component
Citation in scientific literature has some subtle differences
from hyperlink citations
Scientific literature citations are always into the past
Very rarely (if at all) do scientific literature citations form
cyclic structures
ECG comprises mostly of weakly connected directed graph
components, while EHG may contain strongly connected
components
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
40. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
ERank
Importance of a page within an EHG
ERank is an authority score of a page within an EHG (ECG)
component
Depicts reachability of the page within the component
ERank scores in a component shown to be uncorrelated to the
PageRank scores of pages of that component
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
41. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
EndorSeer
A Firefox plugin for augmented browsing of Citeseer
Currently shows endorsed citations from among the list of
citations from any paper
Currently underway: Show the ECG component and ECG
neighbourhood of a paper
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
42. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Outline
1 Co-occurrence and Meaning
2 Co-occurrence graphs
3 Interpretation of Co-citations
4 Topical Anchors
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
43. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Topical Anchors [6, 7]
Motivation
Example: “Will my oral insulin drugs, along with my hypertension
and high blood glucose, have any side effects on the health of my
pancreas?”
Can a machine detect diabetes as the context?
Another example: A document containing the words, Andy
Roddick, Roger Federer and Rafael Nadal.
How likely is it that the word Tennis will be mentioned
(semantically) when discussing about these players?
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
44. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Topical Anchors [6, 7]
Motivation
Example: “Will my oral insulin drugs, along with my hypertension
and high blood glucose, have any side effects on the health of my
pancreas?”
Can a machine detect diabetes as the context?
Another example: A document containing the words, Andy
Roddick, Roger Federer and Rafael Nadal.
How likely is it that the word Tennis will be mentioned
(semantically) when discussing about these players?
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
45. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Co-occurrence context
Given a set of query terms, the co-occurrence context is
defined as the subgraph formed by the query terms and the
set of terms that co-occur with at least one of the terms
Conjecture: The topical anchor of a set of terms, is a highly authoritative term
that lies with the co-occurrence context of the query terms
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
46. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Online Page Importance Computation
Each node i in the context is intialised with a cash ci .
A node a is picked at random and the cash ca is added to its history ha .
Then ca is distributed amongst all its neighbours proportional to the edge
weights.
This process is iterated till the ratio of hi s becomes a near constant.
Node with the largest hi is chosen as the most central node.
Unfortunately OPIC was seen to be unsuitable for determining topical anchors
since it tends to find central nodes for the entire graph
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
47. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Cash Leaking Random Walk
Cooccurrence graphs have extremely small diameters (4-5).
Roger Federer to feral child in two hops.
Football becomes most central to Roger Federer and Rafael
Nadal instead of Tennis.
Solution: Cash Leakage
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
48. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Bias and History Vectors
There is a hidden bias between query words for the way
centrality is computed.
Example: Jim Carrey, Hugh Grant, Rajkumar
Bias due to difference in neighbourhood sizes
Bias due to polysemy
Example: Java, Beans, Kaffe
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
49. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Bias examples
Query Terms Topical Anchors
Java, Beans, Kaffe Programming language, Indonesia,
Food
United States Dollar, Euro, West French language, Guinea, Guinea-
African CFA franc Bissau
Bayes, Euclid, Ramanujan, Probability, Mathematics, Number
Bernoulli
MIT, Stanford, IIT University, Indian Institute of Tech-
nology, Bombay
Leaf, Fruit, Stem, Photosynthesis Linguistics, Plant, Tree
Bernoulli, Poisson, Weibull, Bino- Godwin, Norway, Harold Godwin-
mial son
Table: Examples with irrelevant topical anchors
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
50. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Solution to the topic bias problem
Labelled cash.
Vector models of CLRW
Cash from each of the query term qi is given a “colour” ci . The cash history at
any node is hence a vector of the form (v1 , v2 , . . . vn ) showing cash flow history
for each of the colours. The vector is then normalized as:
vi
vi =
v
ˆ
where v = max vi and vi ∈ [0, 1]
ˆ
i
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
51. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Projection
Projection
The line joining 0n to 1n
represents points where all
query terms have contributed
equally to the cash history.
This is called the baseline
Hence, for any given node, its
projection onto the baseline
represents the importance of
the node in being a topical
anchor
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
52. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Euclidean Distance
Eucledian distance
Eucledian metric computes the
L2 distance from the
normalized cash history vector
of a candidate node with 1n
Favours uniformity in cash
history distribution over overall
magnitude of the cash history
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
53. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Cosine Similarity
Cosine similarity
Computes the cosine between a
given node’s normalized cash
history vector and 1n
Another metric for factoring
both uniformity in cash
distribution and magnitude
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
54. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Example results
Query Terms Projection Eucledian Cosine
United States Dol- French language, Currency, Bank, Currency, Bank,
lar, Euro, West Guinea, Guinea- France France
African CFA franc Bissau
Bayes, Euclid, Ra- Probability, Math- Mathematics, Mathematics,
manujan, Bernoulli ematics, Number Mathematician, Mathematician,
Euler Probability distri-
bution
MIT, Stanford, IIT University, Indian University, Col- University, Col-
Institute of Tech- lege, Technology lege, Science
nology, Bombay
Leaf, Fruit, Stem, Linguistics, Plant, Plant, Tree, Plant, Tree,
Photosynthesis Tree Species Species
Bernoulli, Poisson, Godwin, Norway, Mathematics, Mathematics,
Weibull, Binomial Harold Godwinson Probability, Ex- Probability, Statis-
pected Value tics
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
55. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
User evaluation
Experimental Setup:
86 volunteer users were given a set of queries and asked to provide topical
labels for these queries ranked according to their perceived importance
66 volunteers answered 100 questions, while the rest answered 30 random
questions chosen from the 100 questions
User responses were charted for consistency in results (chart shown below)
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
56. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
User evaluation
CLRW against tf-idf and OPIC
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
57. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Comparison
Comparison with Automatic Topic Labeling algorithm [3]
Caveats: Comparison with Eucledian algorithm. ATL requires document
contexts where the topical anchor is present (unlike CLRW which searches on
the co-occurrence graph built over a corpus)
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
58. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
Future Work
Several open questions..
Topical markers, semantic siblings
Co-occurrence semantics when coupled with concept
hierarchies
Automatic detection of semantic relations based on
co-occurrence
Automatic attribute identification
Thank You!
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore
59. Co-occurrence and Meaning
Co-occurrence graphs
Interpretation of Co-citations
Topical Anchors
References
[1] A. Biletzki and A. Matar. Ludwig wittgenstein (second revision). Stanford Encyclopedia of Philosophy, May
2009.
[2] Gerstner and Kistler. Spiking Neuron Models. Single Neurons, Populations, Plasticity. Cambridge University
Press, 2002.
[3] Q. Mei, X. Shen, and C. Zhai. Automatic labeling of multinomial topic models. In KDD ’07: Proceedings of
the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 490–499,
New York, NY, USA, 2007. ACM.
[4] M. R. Mutalikdesai and S. Srinivasa. Co-citations as endorsements of citations. Submitted for publication,
2009.
[5] S. Pinker. The Language Instinct. Harper Perennial Modern Classics, 2007.
[6] A. R. Rachakonda and S. Srinivasa. Finding the topical anchors of a context using lexical cooccurrence data.
In Proceedings of ACM Conference on Information and Knowledge Management (CIKM), 2009.
[7] A. R. Rachakonda and S. Srinivasa. Vector-based ranking techniques for identifying the topical anchors of a
context. In Proceedings of the 15th International Conference on Management of Data (COMAD), 2009.
[8] S. Reddy, S. Srinivasa, and M. R. Mutalikdesai. Measures of ”ignorance” on the web. In Proceedings of the
International Conference on Management of Data (COMAD), Dec 2006.
IEEE Computer Society talk. Nov 20 2009. c Srinath Srinivasa, IIIT-Bangalore