SlideShare une entreprise Scribd logo
1  sur  305
Web Dynamics
Analysis and Implications from
Search Perspective
1
Lecturers
Dr. Ismail Sengör Altingövde
– Senior researcher
– L3S Research Center
– Hannover, Germany
Dr. Nattiya Kanhabua
– Postdoctoral researcher
– L3S Research Center
– Hannover, Germany
2
L3S Research Center
 Computer science and interdisciplinary research
on all aspects related to the Web
3
People
 80 researchers: 53% post-docs & PhD students,
38% scientific research assistants & int’l interns
 15 countries from US and Latin-America, Asia,
Europe and Africa
 Equal opportunity: 27% females
4
All project inventory!
www.l3s.de
5
L3S: Some Projects
Cubrik: Human-enhanced
time-aware multimedia search
ARCOMEM: From collect-all
archives to community memories
Terence: Adaptive learning
for poor comprehenders
M-Eco: The medical ecosystem
detecting disease outbreaks
6
Wanted!
7
Outline
Day 1: Introduction
• Evolution of the Web
• Overview of research topics
• Content and query analysis
Day 3: Indexing the Past
• Indexing and searching
versioned documents
Day 2: Evolution of Web
Search Results
• Short-term impacts
• Longitudinal analysis
Day 4: Retrieval and
Ranking
• Searching the past
• Searching the future
8
Evolution of the Web
• Web is changing over time in many
aspects:
– Size: web pages are added/deleted all the time
– Content: web pages are edited/modified
– Query: users’ information needs changes,
entity-relationship changes over time
– Usage: users’ behaviors change over time
[Ke et al., CN 2006; Risvik et al., CN 2002]
[Dumais, SIAM-SDM 2012; WebDyn 2010]9
Size dynamics
• Challenge
– Crawling, indexing, and caching
10
2000
First billion-URL index
The world’s largest!
≈5000 PCs in clusters!
1995 2012
Web and index sizes
11
2000
First billion-URL index
The world’s largest!
≈5000 PCs in clusters!2004
Index grows to
4.2 billion pages
1995 2012
Web and index sizes
12
2000
First billion-URL index
The world’s largest!
≈5000 PCs in clusters!2004
Index grows to
4.2 billion pages
1995 2012
2008
Google counts
1 trillion
unique URLs
Web and index sizes
13
Web and index sizes
14
Web and index sizes
http://www.worldwidewebsize.com/
15
Content dynamics
• Challenge
– Document representation and retrieval
WayBack Machine: a web archive search tool by the Internet Archive
16
Content dynamics
• Challenge
– Document representation and retrieval
17
Query dynamics
• Challenge
– Time-sensitive queries
– Query understanding and processing
Google Insights for Search: http://www.google.com/insights/search/
Query: Halloween
18
Usage dynamics
• Challenge
– Browsing and search behavior
– User preference (e.g., “likes”)
19
• Searching documents created/edited over time
– E.g., web archives, news archives, blogs, or emails
Example application
20
• Searching documents created/edited over time
– E.g., web archives, news archives, blogs, or emails
Example application
Web
archives
news
archives
blogs emails
“temporal document
collections”
21
• Searching documents created/edited over time
– E.g., web archives, news archives, blogs, or emails
Example application
Web
archives
news
archives
blogs emails
“temporal document
collections”
Retrieve documents
about Pope Benedict
XVI written before 2005
Term-based IR approaches
may give unsatisfied results
22
• A web archive search tool by the Internet Archive
– Query by a URL, e.g., http://www.ntnu.no
Wayback Machine1
1
Retrieved on 15 January 2011
23
• A web archive search tool by the Internet Archive
– Query by a URL, e.g., http://www.ntnu.no
Wayback Machine1
No keyword query
1
Retrieved on 15 January 2011
24
• A web archive search tool by the Internet Archive
– Query by a URL, e.g., http://www.ntnu.no
Wayback Machine1
No keyword query
No relevance ranking
1
Retrieved on 15 January 2011
25
• A news archive search tool by Google
– Query by keywords
– Rank results by relevance or date
Google News Archive Search
26
• A news archive search tool by Google
– Query by keywords
– Rank results by relevance or date
Google News Archive Search
Not consider
terminology
changes over time
27
Research topics
• Content Analysis
– Determining timestamps of documents
– Temporal information extraction
• Query Analysis
– Determining time of queries
– Named entity evolution
– Query performance prediction
• Evolution of Search Results
– Short-term impacts on result caches
– Longitudinal analysis of search results
28
Research topics (cont’)
• Indexing
– Indexing and query processing techniques for the
versioned document collections
• Retrieval and Ranking
– Searching the past
– Searching the future
29
Content Analysis
(1) Determining timestamps of documents
(2) Temporal information extraction
Motivation
• Incorporating the time dimension into
search can increase retrieval effectiveness
– Only if temporal information is available
• Research question
– How to determine the temporal information of
documents?
[Kanhabua, SIGIR Forum 2012]
31
Two time aspects
1. Publication or modified time
• Problem: determining timestamps of documents
• Method: rule-based technique, or temporal language
models
2. Content or event time
• Problem: temporal information extraction
• Method: natural language processing, or time and
event recognition algorithms
32
Two time aspects
1. Publication or modified time
• Problem: determining timestamps of documents
• Method: rule-based technique, or temporal language
models
2. Content or event time
• Problem: temporal information extraction
• Method: natural language processing, or time and
event recognition algorithms
content time
publication time
33
Problem Statements
• Difficult to find the trustworthy time for web documents
– Time gap between crawling and indexing
– Decentralization and relocation of web documents
– No standard metadata for time/date
Determining time of documents
34
Problem Statements
• Difficult to find the trustworthy time for web documents
– Time gap between crawling and indexing
– Decentralization and relocation of web documents
– No standard metadata for time/date
Determining time of documents
I found a bible-like
document. But I have
no idea when it was
created?
““ For a given document with uncertain
timestamp, can the contents be used to
determine the timestamp with a sufficiently
high confidence? ””
35
Problem Statements
• Difficult to find the trustworthy time for web documents
– Time gap between crawling and indexing
– Decentralization and relocation of web documents
– No standard metadata for time/date
Determining time of documents
Let’s me see…
This document is
probably
written in 850 A.C.
with 95% confidence.
I found a bible-like
document. But I have
no idea when it was
created?
““ For a given document with uncertain
timestamp, can the contents be used to
determine the timestamp with a sufficiently
high confidence? ””
36
Current approaches
1. Content-based
2. Link-based
3. Hybrid
37
Content-based approach
Partition Word
1999 tsunami
1999 Japan
1999 tidal wave
2004 tsunami
2004 Thailand
2004 earthquake
Temporal Language Models
Temporal Language
Models
• Based on the statistic usage
of words over time
• Compare each word of a
non-timestamped document
with a reference corpus
• Tentative timestamp -- a
time partition mostly
overlaps in word usage
[de Jong et al., AHC 2005]
Freq
1
1
1
1
1
1
38
Content-based approach
Partition Word
1999 tsunami
1999 Japan
1999 tidal wave
2004 tsunami
2004 Thailand
2004 earthquake
Temporal Language Models
tsunami
Thailand
A non-timestamped
document
Temporal Language
Models
• Based on the statistic usage
of words over time
• Compare each word of a
non-timestamped document
with a reference corpus
• Tentative timestamp -- a
time partition mostly
overlaps in word usage
[de Jong et al., AHC 2005]
Freq
1
1
1
1
1
1
39
Content-based approach
Partition Word
1999 tsunami
1999 Japan
1999 tidal wave
2004 tsunami
2004 Thailand
2004 earthquake
Temporal Language Models
tsunami
Thailand
A non-timestamped
document
Temporal Language
Models
• Based on the statistic usage
of words over time
• Compare each word of a
non-timestamped document
with a reference corpus
• Tentative timestamp -- a
time partition mostly
overlaps in word usage
[de Jong et al., AHC 2005]
Freq
1
1
1
1
1
1
40
Content-based approach
Partition Word
1999 tsunami
1999 Japan
1999 tidal wave
2004 tsunami
2004 Thailand
2004 earthquake
Temporal Language Models
tsunami
Thailand
A non-timestamped
document
Temporal Language
Models
• Based on the statistic usage
of words over time
• Compare each word of a
non-timestamped document
with a reference corpus
• Tentative timestamp -- a
time partition mostly
overlaps in word usage
[de Jong et al., AHC 2005]
Freq
1
1
1
1
1
1
41
Content-based approach
Partition Word
1999 tsunami
1999 Japan
1999 tidal wave
2004 tsunami
2004 Thailand
2004 earthquake
Temporal Language Models
tsunami
Thailand
A non-timestamped
document
Similarity ScoresSimilarity Scores
Score(1999) = 1
Score(2004) = 1 + 1 = 2 Most likely timestamp is 2004
Temporal Language
Models
• Based on the statistic usage
of words over time
• Compare each word of a
non-timestamped document
with a reference corpus
• Tentative timestamp -- a
time partition mostly
overlaps in word usage
[de Jong et al., AHC 2005]
Freq
1
1
1
1
1
1
42
Normalized log-likelihood ratio
Partition Word
1999 tsunami
1999 Japan
1999 tidal wave
2004 tsunami
2004 Thailand
2004 earthquake
Temporal Language Models
tsunami
Thailand
A non-timestamped
document
Similarity ScoresSimilarity Scores
Score(1999) = 1
Score(2004) = 1 + 1 = 2 Most likely timestamp is 2004
Normalized log-likelihood
ratio
• Variant of Kullback-Leibler
divergence
• Similarity of a document and
time partitions
• C is the background model
estimated on the corpus
• Linear interpolation
smoothing to avoid the zero
probability of unseen words
[Kraaij, SIGIR Forum 2005]
43
Improving temporal LMs
• Enhancement techniques
1. Semantic-based data preprocessing
2. Search statistics to enhance similarity scores
3. Temporal entropy as term weights
Intuition: Direct comparison between extracted words
and corpus partitions has limited accuracy
Approach: Integrate semantic-based techniques into
document preprocessing
[Kanhabua et al., ECDL 2008]
44
Improving temporal LMs
• Enhancement techniques
1. Semantic-based data preprocessing
2. Search statistics to enhance similarity scores
3. Temporal entropy as term weights
Intuition: Direct comparison between extracted words
and corpus partitions has limited accuracy
Approach: Integrate semantic-based techniques into
document preprocessing
Intuition: Search statistics Google Zeitgeist (GZ) can
increase the probability of a tentative time partition
Approach: Linearly combine a GZ score with the
normalized log-likelihood ratio
[Kanhabua et al., ECDL 2008]
45
Improving temporal LMs
• Enhancement techniques
1. Semantic-based data preprocessing
2. Search statistics to enhance similarity scores
3. Temporal entropy as term weights
Intuition: Direct comparison between extracted words
and corpus partitions has limited accuracy
Approach: Integrate semantic-based techniques into
document preprocessing
Intuition: Search statistics Google Zeitgeist (GZ) can
increase the probability of a tentative time partition
Approach: Linearly combine a GZ score with the
normalized log-likelihood ratio
Intuition: A term weight depends on how good the
term is for separating time partitions (discriminative)
Approach: Propose temporal entropy, based on a
term selection presented in Lochbaum and Streeter
[Kanhabua et al., ECDL 2008]
46
Semantic-based preprocessing
Intuition: Direct comparison between extracted
words and corpus partitions has limited accuracy
Approach: Integrate semantic-based techniques
into document preprocessing
Semantic-based
Preprocessing
Description
Part-of-speech tagging Select only interesting classes of words, e.g. nouns, verbs, and adjectives
Collocation extraction Co-occurrence of different words can alter the meaning, e.g. “United States”
Word sense disambiguation Identify the correct sense of a word from context, e.g. “bank”
Concept extraction Compare concepts instead of original words, e.g. “tsunami” and “tidal wave”
have the common concept of “disaster”
Word filtering Select the top-ranked words according to TF-IDF scores for a comparison
[Kanhabua et al., ECDL 2008]
47
Leveraging search statistics
Intuition: Search statistics Google Zeitgeist (GZ) can
increase the probability of a tentative time partition
Approach: Linearly combine a GZ score with the
normalized log-likelihood ratio
[Kanhabua et al., ECDL 2008]
48
Leveraging search statistics
Intuition: Search statistics Google Zeitgeist (GZ) can
increase the probability of a tentative time partition
Approach: Linearly combine a GZ score with the
normalized log-likelihood ratio
[Kanhabua et al., ECDL 2008]
49
Leveraging search statistics
Intuition: Search statistics Google Zeitgeist (GZ) can
increase the probability of a tentative time partition
Approach: Linearly combine a GZ score with the
normalized log-likelihood ratio
P(wi) is the probability that wi occurs:
P(wi) = 1.0 if a gaining query
P(wi) = 0.5 if a declining query
f(R) converts a ranked
number into weight. The
higher ranked query is
more important.
An inverse partition
frequency, ipf = log N/n
[Kanhabua et al., ECDL 2008]
50
Temporal entropy
Temporal Entropy
A measure of temporal information which a word conveys.
Captures the importance of a term in a document collection
whereas TF-IDF weights a term in a particular document.
Tells how good a term is in separating a partition from others.
A term occurring in few partitions has higher temporal
entropy compared to one appearing in many partitions.
The higher temporal entropy a term has, the better
representative of a partition.
Intuition: A term weight depends on how good the
tern is for separating time partitions (discriminative)
Approach: Propose temporal entropy, based on a
term selection presented in Lochbaum and Streeter
[Kanhabua et al., ECDL 2008]
51
Temporal entropy
Intuition: A term weight depends on how good the
tern is for separating time partitions (discriminative)
Approach: Propose temporal entropy, based on a
term selection presented in Lochbaum and Streeter
[Kanhabua et al., ECDL 2008]
52
Temporal entropy
Np is the total number of
partitions in a corpus
Intuition: A term weight depends on how good the
tern is for separating time partitions (discriminative)
Approach: Propose temporal entropy, based on a
term selection presented in Lochbaum and Streeter
[Kanhabua et al., ECDL 2008]
53
Temporal entropy
A probability of a partition
p containing a term wi
Np is the total number of
partitions in a corpus
Intuition: A term weight depends on how good the
tern is for separating time partitions (discriminative)
Approach: Propose temporal entropy, based on a
term selection presented in Lochbaum and Streeter
[Kanhabua et al., ECDL 2008]
54
Link-based approach
• Dating a document using its neighbors
1. Web pages linking to the document
• Incoming links
1. Web pages pointed by the document
• Outgoing links
1. Media assets associated with the document
• E.g., images
• Averaging the last-modified dates of its
neighbors as timestamps
[Nunes et al., WIDM 2007]
55
Hybrid approach
• Inferring timestamps using machine
learning
– Exploit links, contents of a web pages and its neighbors
– Features: linguistic, position, page formats, and tags
[Chen et al., SIGIR 2010]
56
Temporal information extraction
• Three types of temporal expressions
1. Explicit: time mentions being mapped directly to
a time point or interval, e.g., “July 4, 2012”
2. Implicit: imprecise time point or interval, e.g.,
“Independence Day 2012”
3. Relative: resolved to a time point or interval
using other types or the publication date, e.g.,
“next month”
[Alonso et al., SIGIR Forum 2007]
57
Current approaches
• Extract temporal expressions from
unstructured text
– Time and event recognition algorithms
• Harvest temporal knowledge from semi-
structured contents like Wikipedia
infoboxes
– Rule-base methods
[Verhagen et al., ACL 2005; Strötgen et al., SemEval 2010]
[Hoffart et al., AIJ 2012]
58
References
• [Alonso et al., SIGIR Forum 2007] Omar Alonso, Michael Gertz, Ricardo A. Baeza-Yates: On the
value of temporal information in information retrieval. SIGIR Forum 41(2): 35-41 (2007)
• [Chen et al., SIGIR 2010] Zhumin Chen, Jun Ma, Chaoran Cui, Hongxing Rui, Shaomang Huang: Web
page publication time detection and its application for page rank. SIGIR 2010: 859-860
• [Dumais, SIAM-SDM 2012] Susan T. Dumais: Temporal Dynamics and Information Retrieval. SIAM-
SDM 2012
• [de Jong et al., AHC 2005] Franciska de Jong, Henning Rode, Djoerd Hiemstra: Temporal language
models for the disclosure of historical text. AHC 2005: 161-168
• [Kanhabua et al., ECDL 2008] Nattiya Kanhabua, Kjetil Nørvåg: Improving Temporal Language
Models for Determining Time of Non-timestamped Documents. ECDL 2008: 358-370
• [Kanhabua, SIGIR Forum 2012] Nattiya Kanhabua: Time-aware approaches to information retrieval.
SIGIR Forum 46(1): 85 (2012)
• [Ke et al., CN 2006] Yiping Ke, Lin Deng, Wilfred Ng, Dik Lun Lee: Web dynamics and their
ramifications for the development of Web search engines. Computer Networks 50(10): 1430-1447
(2006)
• [Kraaij, SIGIR Forum 2005] Wessel Kraaij: Variations on language modeling for information retrieval.
SIGIR Forum 39(1): 61 (2005)
• [Nunes et al., WIDM 2007] Sérgio Nunes, Cristina Ribeiro, Gabriel David: Using neighbors to date
web documents. WIDM 2007: 129-136
• [Risvik et al., CN 2002] Knut Magne Risvik, Rolf Michelsen: Search engines and Web dynamics.
Computer Networks 39(3): 289-302 (2002) 59
References (cont’)
• [Risvik et al., CN 2002] Knut Magne Risvik, Rolf Michelsen: Search engines and Web dynamics.
Computer Networks 39(3): 289-302 (2002)
• [Strötgen et al., SemEval 2010] Jannik Strötgen, Michael Gertz: Heideltime: High quality rule-based
extraction and normalization of temporal expressions. SemEval 2010: 321-324
• [WebDyn 2010] Web Dynamics course: http://www.mpi-
inf.mpg.de/departments/d5/teaching/ss10/dyn/, Max-Planck Institute for Informatics, Saarbrücken,
Germany, 2010
• [Verhagen et al., ACL 2005] Marc Verhagen, Inderjeet Mani, Roser Sauri, Jessica Littman, Robert
Knippen, Seok Bae Jang, Anna Rumshisky, John Phillips, James Pustejovsky: Automating Temporal
Annotation with TARSQI. ACL 2005
60
Question?
61
Query Analysis
(1) Determining time of queries
(2) Named entity evolution
(3) Query performance prediction
Temporal queries
• Temporal information needs
– Searching temporal document collections
• E.g., digital libraries, web/news archives
– Historians, librarians, journalists or students
• Temporal queries exist in both standard
collections and the Web
– Relevancy is dependent on time
– Documents are about events at particular time
[Berberich et al., ECIR 2010]
63
Distribution over time of Qrel
Recency query Time-sensitive query
Time-insensitive query
[Li et al., CIKM 2003]
64
Types of temporal queries
• Two types of temporal queries
1. Explicit: time is provided, "Presidential election 2012“
2. Implicit: time is not provided, "Germany World Cup"
• Temporal intent can be implicitly inferred
• I.e., refer to the World Cup event in 2006
• Studies of web search query logs show a
significant fraction of temporal queries
– 1.5% of web queries are explicit
– ~7% of web queries are implicit
– 13.8% of queries contain explicit time and 17.1% of
queries have temporal intent implicitly provided
[Nunes et al., ECIR 2008; Metzler et al., SIGIR 2009; Zhang et al., EMNLP 2010]
65
• Semantic gaps: lacking knowledge about
1. possibly relevant time of queries
2. terminology changes over time
Challenges with temporal queries
66
Challenges with temporal queries
• Semantic gaps: lacking knowledge about
1. possibly relevant time of queries
2. terminology changes over time
query
time1
time2
…
timek
suggest
67
Challenges with temporal queries
• Semantic gaps: lacking knowledge about
1. possibly relevant time of queries
2. terminology changes over time
query
time1
time2
…
timek
suggest
68
Challenges with temporal queries
• Semantic gaps: lacking knowledge about
1. possibly relevant time of queries
2. terminology changes over time
• Semantic gaps: lacking knowledge about
1. Possibly relevant time of queries
2. Named entity changes over time
69
Determining time of queries
• Problem statements
– Implicit temporal queries: users have no knowledge
about the relevant time of a query
– Difficult to achieve high accuracy using only keywords
– Relevant results associated to particular time not given
• Research question
– How to determine the time of an implicit temporal query
and use the determined time for improving search
results?
70
Current approaches
1. Query log analysis
2. Search result analysis
[Kanhabua et al., ECDL 2010]
71
Query log analysis
• Mining query logs
– Analyze query frequencies over time for identifying
the relevant time of queries
– Re-rank search results of implicit temporal queries
using the determined time
[Metzler et al., SIGIR 2009; Zhang et al., EMNLP 2010]
72
Search result analysis
• Using temporal language models
– Determine time of queries when no time is given explicitly
– Re-rank search results using the determined time
• Exploiting time from search snippets
– Extract temporal expressions (i.e., years) from the
contents of top-k retrieved web snippets for a given query
– Content-based language-independent approach
[Kanhabua et al., ECDL 2010; Campos et al., TempWeb
73
Approach I. Dating using keywords*
Approach II. Dating using top-k documents*
– Queries are short keywords
– Inspired by pseudo-relevance feedback
Approach III. Using timestamp of top-k documents
– No temporal language models are used
*Using Temporal Language Models proposed by de Jong et al.
Determining time of queries
[Kanhabua et al., ECDL 2010]
74
I. Dating using keywords
[Kanhabua et al., ECDL 2010]
75
I. Dating using keywords
Query’s temporal
profiles
[Kanhabua et al., ECDL 2010]
76
II. Dating using top-k documents
[Kanhabua et al., ECDL 2010]
77
II. Dating using top-k documents
Query’s temporal
profiles
[Kanhabua et al., ECDL 2010]
78
III. Using timestamp of documents
[Kanhabua et al., ECDL 2010]
79
III. Using timestamp of documents
Query’s tempora
profiles
[Kanhabua et al., ECDL 2010]
80
• Intuition: documents published closely to the
time of queries are more relevant
– Assign document priors based on publication dates
Re-ranking search results
query
News archive
Determine time 2005, 2004, 2006, ...
D2009
Initial retrieved results
[Kanhabua et al., ECDL 2010]
81
• Intuition: documents published closely to the
time of queries are more relevant
– Assign document priors based on publication dates
Re-ranking search results
query
News archive
Determine time 2005, 2004, 2006, ...
D2009
Initial retrieved results
D2005
Re-ranked results
[Kanhabua et al., ECDL 2010]
82
Challenges of temporal search
• Semantic gaps: lacking knowledge about
1. Possibly relevant time of queries
2. Named entity changes over time
query
synonym@2001
synonym@2002
…
synonym@2011
suggest
83
Problem Statements
• Queries of named entities (people, company, place)
– Highly dynamic in appearance, i.e., relationships between
terms changes over time
– E.g. changes of roles, name alterations, or semantic shift
Named entity evolution
84
Problem Statements
• Queries of named entities (people, company, place)
– Highly dynamic in appearance, i.e., relationships between
terms changes over time
– E.g. changes of roles, name alterations, or semantic shift
Named entity evolution
Scenario 1
Query: “Pope Benedict XVI” and written before 2005
Documents about “Joseph Alois Ratzinger” are relevant
85
Problem Statements
• Queries of named entities (people, company, place)
– Highly dynamic in appearance, i.e., relationships between
terms changes over time
– E.g. changes of roles, name alterations, or semantic shift
Named entity evolution
Scenario 1
Query: “Pope Benedict XVI” and written before 2005
Documents about “Joseph Alois Ratzinger” are relevant
Scenario 2
Query: “Hillary R. Clinton” and written from 1997 to 2002
Documents about “New York Senator” and “First Lady of
the United States” are relevant
86
Research question
• How to detect named entity changes in web
documents?
Named entity evolution
87
QUEST Demo: http://research.idi.ntnu.no/wislab/quest/
88
Current approaches
1. Temporal co-occurrence
2. Temporal association rule mining
3. Temporal knowledge extraction
– Ontology
– Wikipedia history
89
Temporal co-occurrence
• Temporal co-occurrence
– Measure the degree of relatedness of two entities at
different times by comparing term contexts
– Require a recurrent computation at querying time,
which reduce efficiency and scalability
[Berberich et al., WebDB 2009]
90
Association rule mining
• Temporal association rule mining
– Discover semantically identical concepts (or named
entities) that are used in different time
– Two entities are semantically related if their
associated events occur multiple times in a collection
– Events are represented as sentences containing a
subject, a verb, objects, and nouns
[Kaluarachchi et al., CIKM 2010]
91
Temporal knowledge extraction
• YAGO ontology
– Extract named entities from the YAGO ontology
– Track named entity evolution using the New York
Times Annotated Corpus
• Wikipedia history
– Define a time-based synonym as a term semantically
related to a named entity at a particular time period
– Extract synonyms of named entities from anchor texts
in article links using the whole history of Wikipedia
[Mazeika et al., CIKM 2011; Kanhabua et al., JCDL 2010]
92
Searching with name changes
• Extract time-based synonyms from Wikipedia
– Synonyms are words with similar meanings
– In this context, synonyms refer name variants (name
changes, titles, or roles) of a named entity
• E.g., "Cardinal Joseph Ratzinger" is a synonym of
"Pope Benedict XVI" before 2005
• Two types of time-based synonyms
1.Time-independent
2.Time-dependent
[Kanhabua et al., JCDL 2010]
93
Recognize named entities
[Kanhabua et al., JCDL 2010]
94
Recognize named entities
[Kanhabua et al., JCDL 2010]
95
Recognize named entities
[Kanhabua et al., JCDL 2010]
96
Find synonyms
• Find a set of entity-synonym relationships at time tk
[Kanhabua et al., JCDL 2010]
97
Find synonyms
• Find a set of entity-synonym relationships at time tk
• For each ei ϵ Etk , extract anchor texts from article
links:
– Entity: President_of_the_United_States
– Synonym: George W. Bush
– Time: 11/2004
[Kanhabua et al., JCDL 2010]
98
Find synonyms
• Find a set of entity-synonym relationships at time tk
• For each ei ϵ Etk , extract anchor texts from article
links:
– Entity: President_of_the_United_States
– Synonym: George W. Bush
– Time: 11/2004
[Kanhabua et al., JCDL 2010]
99
Find synonyms
• Find a set of entity-synonym relationships at time tk
• For each ei ϵ Etk , extract anchor texts from article
links:
– Entity: President_of_the_United_States
– Synonym: George W. Bush
– Time: 11/2004
[Kanhabua et al., JCDL 2010]
100
Initial results
• Time periods are not accurate
Note: the time of synonyms are timestamps of Wikipedia articles (8 years)
[Kanhabua et al., JCDL 2010]
101
• Analyze NYT Corpus to discover accurate time
– 20-year time span (1987-2007)
• Use the burst detection algorithm
– Time periods of synonyms = burst intervals
Enhancement using NYT
[Kanhabua et al., JCDL 2010]
102
• Analyze NYT Corpus to discover accurate time
– 20-year time span (1987-2007)
• Use the burst detection algorithm
– Time periods of synonyms = burst intervals
Enhancement using NYT
[Kanhabua et al., JCDL 2010]
103
• Analyze NYT Corpus to discover accurate time
– 20-year time span (1987-2007)
• Use the burst detection algorithm
– Time periods of synonyms = burst intervals
Enhancement using NYT
Initial results
[Kanhabua et al., JCDL 2010]
104
Query expansion
1. A user enters an entity as a query
QUEST Demo: http://research.idi.ntnu.no/wislab/quest/
[Kanhabua et al., ECML PKDD 2010]
105
Query expansion
1. A user enters an entity as a query
QUEST Demo: http://research.idi.ntnu.no/wislab/quest/
[Kanhabua et al., ECML PKDD 2010]
106
Query expansion
1. A user enters an entity as a query
2. The system retrieves synonyms wrt. the query
QUEST Demo: http://research.idi.ntnu.no/wislab/quest/
[Kanhabua et al., ECML PKDD 2010]
107
Query expansion
1. A user enters an entity as a query
2. The system retrieves synonyms wrt. the query
QUEST Demo: http://research.idi.ntnu.no/wislab/quest/
[Kanhabua et al., ECML PKDD 2010]
108
Query expansion
1. A user enters an entity as a query
2. The system retrieves synonyms wrt. the query
3. The user select synonyms to expand the query
QUEST Demo: http://research.idi.ntnu.no/wislab/quest/
[Kanhabua et al., ECML PKDD 2010]
109
Query expansion
1. A user enters an entity as a query
2. The system retrieves synonyms wrt. the query
3. The user select synonyms to expand the query
QUEST Demo: http://research.idi.ntnu.no/wislab/quest/
[Kanhabua et al., ECML PKDD 2010]
110
1. Performance prediction
– Predict the retrieval effectiveness wrt. a ranking model
Query prediction problems
111
1. Performance prediction
– Predict the retrieval effectiveness wrt. a ranking model
Query prediction problems
query
precision = ?
recall = ?
MAP = ?
predict
112
1. Performance prediction
– Predict the retrieval effectiveness wrt. a ranking model
2. Retrieval model prediction
– Predict the retrieval model that is most suitable
Query prediction problems
113
1. Performance prediction
– Predict the retrieval effectiveness wrt. a ranking model
2. Retrieval model prediction
– Predict the retrieval model that is most suitable
Query prediction problems
query ranking = ?
predict max(precision)
max(recall)
max(MAP)
114
Problem Statement
• Predict the effectiveness (e.g., MAP) that a query will
achieve in advance of, or during retrieval
– high MAP  “good”
– low MAP  “poor”
Objective
• Apply query enhancement techniques to improve the
overall performance
– Query suggestion is applied for “poor” queries
Query performance prediction
[Hauff et al., CIKM 2008 ;Hauff et al., ECIR 2010; Carmel et al., 2010]
115
• First study of performance prediction for temporal
queries
– Propose 10 time-based pre-retrieval predictors
• Both text and time are considered
• Experiment
– Collection: NYT Corpus and 40 temporal queries
• Results
– Time-based predictors outperform keyword-based predictors
– Combined predictors outperform single predictors in most
cases
• Open issue
– Consider time uncertainty
Temporal query performance prediction
[Kanhabua et al., SIGIR 2011]
116
• Problem statement
– Two time aspects: publication time and content time
• Content time = temporal expressions mentioned in documents
– Difference in effectiveness for temporal queries when
ranking using publication time or content time
Time-aware ranking prediction
[Kanhabua et al., SIGIR 2012]
117
• First study of the impact on retrieval
effectiveness of ranking models using two time
aspects
• Three features from analyzing top-k results
– Temporal KL-divergence [Diaz et al., SIGIR 2004]
– Content Clarity [Cronen-Townsend et al., SIGIR 2002]
– Divergence of retrieval scores [Peng et al., ECIR 2010]
Learning to select time-aware ranking
[Kanhabua et al., SIGIR 2012]
118
• Measure the difference between the distribution
over time of top-k retrieved documents of q and the
collection
– Consider both time dimensions
Temporal KL-divergence
[Diaz et al., SIGIR 2004]
119
• The content clarity is measured by the Kullback-
Leibler (KL) divergence between the distribution of
terms of retrieved documents and the background
collection
Content Clarity
[Cronen-Townsend et al., SIGIR 2002]
120
• Measure the divergence of scores from the base
ranking, e.g., a non time-aware ranking model
– To determine the extent that a ranking model alters the
scores of the initial ranking
• Features
1. averaged scores of the base ranking
2. averaged scores of PT-Rank
3. averaged scores of CT-Rank
4. divergence from the base ranking model
Divergence of ranking scores
[Peng et al., ECIR 2010]
121
Discussion
• Results
– A small number of top-k documents achieves
better performance
– The larger number k, the more irrelevant
documents are introduced into the analysis
• Open issue
– When comparing with the optimal case there
is still room for further improvements
[Kanhabua et al., SIGIR 2012]
122
References
• [Berberich et al., WebDB 2009] Klaus Berberich, Srikanta J. Bedathur, Mauro Sozio, Gerhard
Weikum: Bridging the Terminology Gap in Web Archive Search. WebDB 2009
• [Berberich et al., ECIR 2010] Klaus Berberich, Srikanta J. Bedathur, Omar Alonso, Gerhard
Weikum: A Language Modeling Approach for Temporal Information Needs. ECIR 2010: 13-25
• [Campos et al., TempWeb 2012] Ricardo Campos, Gaël Dias, Alípio Jorge, Celia Nunes:
Enriching temporal query understanding through date identification: How to tag implicit temporal
queries? TWAW 2012: 41-48
• [Carmel et al., 2010] David Carmel, Elad Yom-Tov: Estimating the Query Difficulty for
Information Retrieval Morgan & Claypool Publishers 2010
• [Cronen-Townsend et al., SIGIR 2002] Stephen Cronen-Townsend, Yun Zhou, W. Bruce Croft:
Predicting query performance. SIGIR 2002: 299-306
• [Diaz et al., SIGIR 2004] Fernando Diaz, Rosie Jones: Using temporal profiles of queries for
precision prediction. SIGIR 2004: 18-24
• [Hauff et al., CIKM 2008] Claudia Hauff, Vanessa Murdock, Ricardo A. Baeza-Yates: Improved
query difficulty prediction for the web. CIKM 2008: 439-448
• [Hauff et al., ECIR 2010] Claudia Hauff, Leif Azzopardi, Djoerd Hiemstra, Franciska de Jong:
Query Performance Prediction: Evaluation Contrasted with Effectiveness. ECIR 2010: 204-216
• [Kaluarachchi et al., CIKM 2010] Amal Chaminda Kaluarachchi, Aparna S. Varde, Srikanta J.
Bedathur, Gerhard Weikum, Jing Peng, Anna Feldman: Incorporating terminology evolution for
query translation in text retrieval with association rules. CIKM 2010: 1789-1792
123
References (cont’)
• [Kanhabua et al., JCDL 2010] Nattiya Kanhabua, Kjetil Nørvåg: Exploiting time-based synonyms
in searching document archives. JCDL 2010: 79-88
• [Kanhabua et al., ECDL 2010] Nattiya Kanhabua, Kjetil Nørvåg: Determining Time of Queries for
Re-ranking Search Results. ECDL 2010: 261-272
• [Kanhabua et al., SIGIR 2011] Nattiya Kanhabua, Kjetil Nørvåg: Time-based query performance
predictors. SIGIR 2011: 1181-1182
• [Kanhabua et al., SIGIR 2012] Nattiya Kanhabua, Klaus Berberich, Kjetil Nørvåg: Learning to
select a time-aware retrieval model. (To appear) SIGIR 2012
• [Li et al., CIKM 2003] Xiaoyan Li, W. Bruce Croft: Time-based language models. CIKM 2003:
469-475
• [Metzler et al., SIGIR 2009] Donald Metzler, Rosie Jones, Fuchun Peng, Ruiqiang Zhang:
Improving search relevance for implicitly temporal queries. SIGIR 2009:
• [Mazeika et al., CIKM 2011] Arturas Mazeika, Tomasz Tylenda, Gerhard Weikum: Entity
timelines: visual analytics and named entity evolution. CIKM 2011: 2585-2588
• [Nunes et al., ECIR 2008] Sérgio Nunes, Cristina Ribeiro, Gabriel David: Use of Temporal
Expressions in Web Search. ECIR 2008: 580-584
• 700-701
• [Peng et al., ECIR 2010] Jie Peng, Craig Macdonald, Iadh Ounis: Learning to Select a Ranking
Function. ECIR 2010: 114-126
• [Zhang et al., EMNLP 2010] Ruiqiang Zhang, Yuki Konda, Anlei Dong, Pranam Kolari, Yi Chang,
Zhaohui Zheng: Learning Recurrent Event Queries for Web Search. EMNLP 2010: 1129-1139124
Question?
125
Evolution of Web Search
Results
(1) Short-term impacts: Caching Results
(2) Longitudinal analysis of search results
126RuSSIR 2012 126
Architecture of a Search Engine
• How search results change?
1. Crawl the Web
2. Index the content
3. Go to line 1
Crawler
Document
Collection
Indexer
Indexes
Query
processor
AnswersWWW
127
Architecture of a Search Engine
• How search results change?
1. Crawl the Web
2. Index the content
3. Go to line 1
Crawler
Document
Collection
Indexer
Indexes
Query
processor
AnswersWWW
on-line
128
Crawling the Dynamic Web
129
Indexing the Dynamic Web
• How to refresh the index?
– Batch update (re-build)
– Re-merge
– In-place
• Batch update
– Shadowing
– Simplest, the old index can keep serving at
high rates
[Lester et al., IPM 2006]
130
Indexing the Dynamic Web
• Re-merge
– A “buffer” B of new index entries
– Compute queries over index I and B and
merge results
– Merge B with I when size(B) > threshold
– Optimizations: logarithmic and geometric
merging
[Lester et al., IPM 2006]
131
Indexing the Dynamic Web
• In-place
– Over-allocation: Leave free space at the end
of each list
– Add new entries to the free space; o.w.,
relocate to a new position on disk
[Lester et al., IPM 2006]
132
Query Processing
• Is updating the underlying index enough?
If a query is submitted for the first time, it is
processed over an up-to-date index
But... what about the cached results?
133
Why Result Caching?
• Around 50% of a query stream is
composed of repeating queries
[Fagni et al., TOIS 2006]
Query popularity: heavy tail distribution
134
Result Caches
• Result cache: top-k urls with snippets per query
• Results caching helps to reduce
– query response time
– traffic volume hitting the backend
“bin laden dead”
BACKEND
miss (compulsory)
hit
135
Problem: dynamicity!
• Key observations:
– Caches can be very large (thanks to cheap hw)
– Web changes rapidly (thanks to users)
“bin laden dead”
BACKEND
Minutes, days, or
weeks, based on Q
136
A deeper look: Capacity vs. Hits
• Query log :Yahoo! Search
engine
– Few cache nodes during 9 days
• Cache capacity
– can be very large
– eviction policy is less of a
concern
• Hit rate: fraction of queries
that are hits
• Higher capacity  more hits!
[Cambazoglu et al., WWW 2010] (Slide provided by the authors)
137
Capacity vs. age
• Hit age: time
since the last
update
• Higher capacity
 higher age!
[Cambazoglu et al., WWW 2010] (Slide provided by the authors)
138
Strategies for cache freshness
• Decoupled approaches: cache does not
know what changed
• Coupled approaches: cache uses clues
from the index to predict what changed
[Cambazoglu et al., WWW 2010]
139
Decoupled approaches: Flushing
• Solution: flush periodically
– Coincides with new index
– Bounds average age
– Impacts hit rate negatively
[Cambazoglu et al., WWW 2010] (Slide provided by the authors)
140
Decoupled approaches:TTL
• Time-to-live (TTL)
– Assign a fixed TTL value to each cached result
– Pros: Practical, almost no implementation cost
– Cons: Blind strategy, may be sub-optimal
…
qi
Result Cache
q1 R1 TTL(q1)
q2 R2 TTL(q2)
qk Rk TTL(qk)
141
Decoupled approaches:TTL
• TTL per query
– Stale results  Acceptable: search engines
do not have a perfect view!
– Stable hit rate; average age is still bounded!
[Cambazoglu et al., WWW 2010] (Slide provided by the authors)
142
Intelligent Refreshing
• Enhancement to the TTL mechanism
• Updates entries
– Re-execute queries
– Uses idle cycles
• Ideally
– update: low activity
– use: high activity
• How to select queries for refreshing?
[Cambazoglu et al., WWW 2010] (Slide provided by the authors)
143
Intelligent refreshing
Temperature
[Cambazoglu et al., WWW 2010] (Slide provided by the authors)
• A new query goes
to young and cold
bucket!
• Fixed “query
interval” for shifting
age
• Lazy update of
temperature
• Refresh hot
and old first! 144
Refresh-rate adjustment
• Idle cycles?
• Critical mechanism
– Prevent overloads
• Feedback from the
query processors
– Latency to process
query
• Track recent latency
– Adjust rate accordingly
Query
Query Processors
Cache
Results,Latency
[Cambazoglu et al., WWW 2010] (Slide provided by the authors)
145
Design summary
• Large capacity: millions of entries
• TTL: bounds the age of entries
• Refreshes: updates cache entries
• Refresh rate adjustment: latency feedback
[Cambazoglu et al., WWW 2010] (Slide provided by the authors)
146
Cache evaluation
• Simulation
– Yahoo! query log
• Baseline policies
– No refresh
– Cyclic refresh
[Cambazoglu et al., WWW 2010] (Slide provided by the authors)
147
Cache evaluation
• Simulation
– Yahoo! query log
• Baseline policies
– No refresh
– Cyclic refresh
[Cambazoglu et al., WWW 2010] (Slide provided by the authors)
148
Performance in production
[Cambazoglu et al., WWW 2010] (Slide provided by the authors)
Refreshes on
Refreshes off
Refreshes on
Refreshes off
149
Degradation
• Under high load
– Processors degrade results
– Shorter life
• TTL mechanism
– Prioritize for refreshing
– AdjustsTTL when degraded
• Refreshes
– Replace with better results
[Cambazoglu et al., WWW 2010] (Slide provided by the authors)
150
Adaptive TTL
• Up to now we considered fixed TTL values
• Are all queries equal?
– “quantum physics”
– “barcelona FC”
– “ecir 2012”
• Another promising direction is assigning
adaptive TTL values
[Alici et al., ECIR 2012] (Slide provided by the authors)
151
Adaptive TTL
• Up to now we considered fixed TTL values
• Are all queries equal?
– “quantum physics”
– “barcelona FC”
– “ecir 2012”
• Another promising direction is assigning
adaptive TTL values
One size does not fit all!!!
[Alici et al., ECIR 2012] (Slide provided by the authors)
152
Adaptive I: Average TTL
• Observe past update frequency of for
top-k results of a query and compute
average
• Simple, but
– Needs history
– May not capture bursty update periods
5 1 6 4
NOW: q submitted again!
TTL =( 5+1+6+4)/ 4= 4PAST: Update history
[Alici et al., ECIR 2012] (Slide provided by the authors)
153
Adaptive II: Incremental TTL
• Adjust the new TTL value based on the
current TTL value
• Each time a cached result Rcached with an
expired TTLcurr is requested:
compute Rnew
if Rnew ≠ Rcached // STALE
TTLnew  TTLmin // catch bursty updates!
else
[Alici et al., ECIR 2012] (Slide provided by the authors)
154
Adaptive III: Machine-learned
• Build an ML model
– Query and result-specific feature set F
– Assume a query result changes at time point ti
• Create a training instance with features Fi
• Set target feature (TTLnew) as the time period from ti
to the time point tj where query result changes
again: tj-ti
5 1 6 4
<F1, 5> <F2, 1> <Fn, ?>
t1
t2
[Alici et al., ECIR 2012] (Slide provided by the authors)
155
Features for ML
[Alici et al., ECIR 2012] (Slide provided by the authors)
156
Analysis of Web Search Results
• Setup:
– 4,500 queries sampled from AOL log
– Submitted to Yahoo! API daily from Nov 2010
to April 2011 (120 days)
– For days di and di+1
If Ri ≠ Ri+1 (consider top-10 urls and their order), we
consider the result as updated!
[Alici et al., ECIR 2012] (Slide provided by the authors)
157
Distribution of result updates
• On average, query results are updated in
every 2 days!
[Alici et al., ECIR 2012] (Slide provided by the authors)
158
Effect of query frequency
• More frequent  results updated more frequently
• Less frequent  scattered[Alici et al., ECIR 2012] (Slide provided by the authors)
159
Effect of query length
• No correlation between query length and
update frequency
[Alici et al., ECIR 2012] (Slide provided by the authors)
160
Simulation
• Assume all 4,500 queries are submitted
daily for 120 days
– On day-0, all results are cached
– On the following days, whenever the TTL
expires, the new result is computed and
replaces old ones (i.e., result for that day from
Yahoo! API)
[Alici et al., ECIR 2012] (Slide provided by the authors)
161
Simulation setup
• Evaluation metrics [Blanco 2010]
– At the end of each day, we compute:
False Positive Ratio =
Redundant query executions
Number of unique queries
Stale Traffic Ratio =
Stale results returned
Number of query occurrences
[Alici et al., ECIR 2012] (Slide provided by the authors)
162
Simulation setup
• While computing ST ratio for a given query
at day i:
Strict policy:
– if Rcached is not strictly equal to Rday-i
staleCount++
Relaxed policy:
– if Rcached is not strictly equal to Rday-i
staleness += 1 – JaccardSim(Rcached, Rday-i)
[Alici et al., ECIR 2012] (Slide provided by the authors)
163
Performance: Incremental TTL
Strict policy Relaxed policy
[Alici et al., ECIR 2012] (Slide provided by the authors)
164
Less-often updated queries More-often updated queries
Performance: Incremental TTL
[Alici et al., ECIR 2012] (Slide provided by the authors)
165
Performance: Average and Machine
Learned TTL
Strict policy Relaxed policy
[Alici et al., ECIR 2012] (Slide provided by the authors)
166
Coupled approaches: CIP
• Cache invalidation policy (CIP)
• Key idea: Compare all
added/updated/deleted documents to
cached queries
– Incremental index update  updates reflected
on-line!
[Blanco et al., SIGIR 2010]
all queries in cache(s) all changes in the backend index
CIP module
167
Coupled approaches: CIP
Incremental index update: updates reflected on-line!
Content changes sent to CIP
module to invalidate queries (offline)
[Blanco et al., SIGIR 2010]
168
Synopses
• Synopsis: A vector of document’s top-
scoring TF-IDF terms
• η: length of synopsis
• δ: modification threshold  consider a document d
as updated only if diff(d, dold) > δ
[Blanco et al., SIGIR 2010]
169
Invalidation Approaches
• Invalidation logic (for added docs)
[Blanco et al., SIGIR 2010]
Basic: If the synopis matches the query
(i.e., involve all query terms)
Scored: Compute Sim(synopsis, query)
to the score of k-th result
Example
170
Invalidation Approaches
• Invalidation logic (for deleted docs):
– Invalidate all query results including a deleted
document
• Update is a deletion followed by an
addition
• Further apply TTL to guarantee an age-
bound
• Scored + TTL
[Blanco et al., SIGIR 2010]
171
Evaluation
• Really hard to evaluate if you are not
sitting at the production department in a
real search engine company!
172
Simulation Setup
• Data: English wikipedia dump
– snapshot at Jan 1, 2006 ≈ 1 million pages
– All add/deletes/updates for following 30 days
• Queries: 10,000 from AOL log
[Blanco et al., SIGIR 2010; Alici et al. SIGIR 2011]
173
Simulation setup
• Evaluation metric
– The query result is updated if two top-10 lists
are not exactly the same
False Positive Ratio =
Redundant query executions
Number of unique queries
Stale Traffic Ratio =
Stale results returned
Number of query occurrences
[Blanco et al., SIGIR 2010]
174
Performance
• Setup
– Score + TTL
– Full synopsis
• Smaller δ or TTL:
– Lower ST, higher FP
[Blanco et al., SIGIR 2010]
TTL: 1≤t≤5
t=1
t=2
t=3
t=4
t=5
δ=0, 2≤t≤10
δ=0.005, 3≤t≤10
δ=0.01, 5≤t≤20
175
TIF: Timestamp-based Invalidation
Framewrok
• Devise a new invalidation mechanism
– better than TTL and close to CIP in detecting
stale results
– better than CIP and close to TTL in efficiency
and practicality
[Alici et al., SIGIR 2011] (Slide provided by the authors)
176
Timestamp-based Invalidation
• The value of the TS on an item shows the
last time the item was updated
• TIF has two components:
– Offline (indexing time) : Decide on term and
document timestamps
– Online (query time): Decide on the staleness of the
query result
[Alici et al., SIGIR 2011] (Slide provided by the authors)
177
TIF Architecture
…
Doc.
parser
documents assigned
to the node
SEARCH
NODE
index
updates
…
t1
t2
tT
[Alici et al., SIGIR 2011] (Slide provided by the authors)
178
TIF Architecture
…
Doc.
parser
documents assigned
to the node
SEARCH
NODE
index
updates
…
t1
t2
tT
…
Document timestamps
document TS
updates
TS(d1) TS(d2) TS(dD)
Document timestamps
[Alici et al., SIGIR 2011] (Slide provided by the authors)
179
TIF Architecture
…
Doc.
parser
documents assigned
to the node
SEARCH
NODE
index
updates
…
Document timestamps
document TS
updates
TS(d1) TS(d2) TS(dD)
Document timestamps
[Alici et al., SIGIR 2011] (Slide provided by the authors)
180
TIF Architecture
…
Doc.
parser
documents assigned
to the node
SEARCH
NODE
index
updates
term TS updates
…
TS(t1)
TS(t2)
TS(tT)
…
Document timestamps
document TS
updates
TS(d1) TS(d2) TS(dD)
Document timestamps
[Alici et al., SIGIR 2011] (Slide provided by the authors)
181
TIF Architecture
…
Doc.
parser
documents assigned
to the node
SEARCH
NODE
index
updates
qi
results
Result cache
0/1
Invalidation
logic
miss/stale
qi, Ri, TS(qi)
…
q1 R1 TS(q1)
q2 R2 TS(q2)
qk Rk TS(qk)
qi
results
Result cache
0/1
Invalidation
logic
miss/stale
qi, Ri, TS(qi)
…
q1 R1 TS(q1)
q2 R2 TS(q2)
qk Rk TS(qk)
term TS updates
…
TS(t1)
TS(t2)
TS(tT)
…
Document timestamps
document TS
updates
TS(d1) TS(d2) TS(dD)
Document timestamps
[Alici et al., SIGIR 2011] (Slide provided by the authors)
182
TS Update Policies: Documents
• For a newly added document d
– TS(d) = now()
• For a deleted document d
– TS(d) = infinite
• For an updated document d
– if diff(dnew, dold) > L
TS(d) = now()
– diff(di, dj): |length(di) – length(dj)|
[Alici et al., SIGIR 2011] (Slide provided by the authors)
183
TS Update Policies: Terms
t
• Frequency based update
t
TS(t) = T0, PLLTS= 5
Number of added postings > F x PLLTS
TS(t) = now()
PLLTS= 6
[Alici et al., SIGIR 2011] (Slide provided by the authors)
184
TS Update Policies: Terms
• Score based update
t p1 p2 p3 p4 p5
TS(t) = T0, STS = Score(p3)p4 p3 p2 p5 p1
t p6
Score of added posting > STS TS(t) = now()
STS = re-sort & compute
p1 p2 p3 p4 p5
sort w.r.t. scoring function
[Alici et al., SIGIR 2011] (Slide provided by the authors)
185
Result Invalidation Policy
• A search node decides a result stale if:
– C1: ∃d ϵ R, s.t. TS(d) > TS(q)
(d is deleted or revised after the generation of query
result)
or,
– C2: ∀t ϵ q, s.t. TS(t) > TS(q)
(all query terms appeared in new documents
after the generation of query result)
• Also apply TTL to avoid stale accumulation
[Alici et al., SIGIR 2011] (Slide provided by the authors)
186
Simulation Setup
• Data: English wikipedia dump
– snapshot at Jan 1, 2006 ≈ 1 million pages
– All add/deletes/updates for following 30 days
• Queries: 10,000 from AOL log
[Alici et al., SIGIR 2011]
187
Simulation setup
• Evaluation metrics [Blanco 2010]
– The query result is updated if two top-10 lists
are not exactly the same
False Positive Ratio =
Redundant query executions
Number of unique queries
Stale Traffic Ratio =
Stale results returned
Number of query occurrences
[Alici et al., SIGIR 2011] (Slide provided by the authors)
188
Performance: all queries
Frequency-based term TS update Score-based term TS update
[Alici et al., SIGIR 2011] (Slide provided by the authors)
189
Performance: single-term queries
Frequency-based term TS uıpdate Score-based term TS update
[Alici et al., SIGIR 2011] (Slide provided by the authors)
190
Invalidation Cost
Send <q, R, TS(q)> to
the search nodes
Send all <q, R> to CIP
Send all docs to CIP
TIF CIP
Data
transfer
Invalidation
operations Compare TS values
Traverse the query index
for every document
[Alici et al., SIGIR 2011] (Slide provided by the authors)
191
Performance: TIF
• A simple yet effective invalidation approach
• Predicting stale queries
– Better than TTL, close to CIP
• Efficiency and practicality
– Straightforward in a distributed system
192
Grand Summary
• Fixed TTL
– With refreshing
• Adaptive TTL
• TIF
• CIP
Decoupled
Decoupled
Coupled
Coupled
Hit rate
Hit age
ST Ratio
FP Ratio
ST Ratio
FP Ratio
ST Ratio
FP Ratio
STandFPratiodecreases
Invalidationcostdecreases
Yahoo!
Web sample
AOL
Wikipedia
AOL
Wikipedia
AOL
193
Open Research Directions
Investigate:
• User satisfaction vs. freshness
• Complex ranking functions
• Alternative index update strategies
• Combinations
• Adaptive TTL + TIF or CIP
• Adaptive TTL + refresh strategy
194
Evolution of Web Search Results
• Earlier works consider the changes in the
content of the
–Web
–queries
• How do real life search engines react this
dynamicity?
• Compare results from Yahoo! API
–for 630K real life queries (from AOL log)
–obtained in 2007 and 2010
[Altingovde et al., SIGIR 2011] (Slide provided by the authors)
195
What is Novel?
• Queries are real, not synthetic
• Query set is large
• Results from a search engine at two very
distant points in time
• Focus on the properties of results, but not
the underlying content
[Altingovde et al., SIGIR 2011] (Slide provided by the authors)
196
We Seek to Answer:
• How is the growth in Web reflected to top-
ranked query results?
• Do the query results totally change within
time?
• Are results located deeper in sites?
• Is there any change in result title and
snippet properties?
[Altingovde et al., SIGIR 2011] (Slide provided by the authors)
197
Avg no. of results
• Almost tripled (from 16M to 52M), but not
uniformly
[Altingovde et al., SIGIR 2011] (Slide provided by the authors)
198
No. of unique URLs
• 20% of the URLs returned at the highest rank in
2010 were at the same position in 2007!
[Altingovde et al., SIGIR 2011] (Slide provided by the authors)
199
No. of unique domains
• The increase in unique domain names in 2010 is
more emphasized in comparison to the increase
in the number of unique URLs (diversity?
coverage?)
• Even higher overlap for top-1 domains
[Altingovde et al., SIGIR 2011] (Slide provided by the authors)
200
Research Directions
• How are the results are diversified at
these two different time points?
–Can we deduce these from snippets?
• How does the level of bias changes in
query results?
[Altingovde et al., SIGIR 2011] (Slide provided by the authors)
201
References
• [Alici et al., SIGIR 2011] Sadiye Alici, Ismail Sengör Altingövde, Rifat Ozcan,
Berkant Barla Cambazoglu, Özgür Ulusoy: Timestamp-based result cache
invalidation for web search engines. SIGIR 2011: 973-982
• [Alici et al., ECIR 2012] Sadiye Alici, Ismail Sengör Altingövde, Rifat Ozcan, Berkant
Barla Cambazoglu, Özgür Ulusoy: Adaptive Time-to-Live Strategies for Query Result
Caching in Web Search Engines. ECIR 2012: 401-412
• [Altingovde et al., SIGIR 2011] Ismail Sengör Altingövde, Rifat Ozcan, Özgür
Ulusoy: Evolution of web search results within years. SIGIR 2011: 1237-1238
• [Blanco et al., SIGIR 2010] Roi Blanco, Edward Bortnikov, Flavio Junqueira, Ronny
Lempel, Luca Telloli, Hugo Zaragoza: Caching search engine results over
incremental indices. SIGIR 2010: 82-89
• [Cambazoglu et al., WWW 2010] Berkant Barla Cambazoglu, Flavio Paiva
Junqueira, Vassilis Plachouras, Scott A. Banachowski, Baoqiu Cui, Swee Lim, Bill
Bridge: A refreshing perspective of search engine caching. WWW 2010: 181-190
• [Fagni et al., TOIS 2006] Tiziano Fagni, Raffaele Perego, Fabrizio Silvestri,
Salvatore Orlando: Boosting the performance of Web search engines: Caching and
prefetching query results by exploiting historical usage data. ACM Trans. Inf. Syst.
24(1): 51-78 (2006)
• [Lester et al., IPM 2006] Nicholas Lester, Justin Zobel, Hugh E. Williams: Efficient
online index maintenance for contiguous inverted lists. Inf. Process. Manage. 42(4):
916-933 (2006)
202
Thank you!
Questions???
203
Indexing the Past
(1) Indexing and searching methods for
versioned document collections
204RuSSIR 2012 204
Time Travel?
• Time travel is easy (in IR)! You only need:
– The data
– The mechanisms to search
• indexing + query processing
– (And fasten your seat belts!)
205
Why Search the Past?
• Historical information needs (an analyst
working on the social reactions on the net
after 9/11)
• To find relevant resources that do not exist
anymore
• To discover trends, opinions, etc. for a
certain time period (what people think
about UFOs at the beginning of
millenium?)
206
Data: Preserving the Web
• Non-profit organizations:
– Internet archive, European archive,...
• Several EU projects
– Planets, PROTAGE, PrestoPRIME, Scape,
ENSURE, BlogForever, LiWA, LAWA,
ARCOMEM...
http://archive.org/
207
Data
• There are also other “versioned” data
collections
– Wikis (Wikipedia)
– Repositories (code, document, organization
intranets)
– RSS feeds, news, blogs etc. with continuous
updates
– Personal desktops
Collections including multiple versions of
each document with a different timestamp
208
Indexing
• Various earlier approaches:
– [Anick et al., SIGIR 1992 ], [Herscovici et al, ECIR 2007]
• Focus on recent work from two different
perspectives
– Indexing schemes that are more concentrated
on the partitioning-related issues
• [Berberich et al., SIGIR 2007; Anand et al. CIKM 2010; Anand et al.
SIGIR 2011]
– Indexing schemes that are more concentrated
on the index size
• [He et al., CIKM 2009; He et al., CIKM 2010; He et al., SIGIR 2011]209
Time-travel Queries
• “Queries that combine the content and temporal
predicates” [Berberich et al., SIGIR 2007]
• Interesting query types
– Point-in-time
• “euro 2012 articles” @ 1/July/2012
– Interval
• “euro 2012 articles” between 01.06-01.07 2012
– Durability in a time interval [Hou U et al., SIGMOD 2010]:
• “search engine research papers” that are in top-10 results for
75% of the time between years 2000 and 2012
210
Time-point Queries: Indexing
• Formal model:
– For a document d, each version has
begin/end timestamps (validity interval):
• For the current version, de = ∞
• For d, all validity times are disjoint
t0
t1 t2
t3 ∞
d0
: [t0, t1) d3
: [t3, ∞)
[Berberich et al., SIGIR 2007]
211
Indexing
• Key idea
– Keep “validity intervals” within posting lists
v <d, tf> <d, tf, db, de>
t0
t1 t2
t3 ∞
tf(v)=1 tf(v)=0 tf(v)=2 tf(v)=2
v <d, 1, t0, t1>
From
<d, 2, t2, t3> <d, 2, t3, ∞>
• Problem: Index size explodes!
– Even for Wikipedia, 9*109
entries
[Berberich et al., SIGIR 2007]
212
Remedy: Coalescing
• Combine adjacent postings with similar
pay-loads
v <d, 1, t0, t1> <d, 2, t2, t3> <d, 2, t3, ∞>
v <d, 1, t0, t1> <d, 2, t2, ∞>
[Berberich et al., SIGIR 2007]
213
Optimization Problem
[Berberich et al., SIGIR 2007]
214
ATC
• Linear-time approximate algorithm
[Berberich et al., SIGIR 2007]
215
It works!
[Berberich et al., SIGIR 2007]
216
Temporal Partitioning (Slicing)
• Even after colescing, wasted I/O for
redundant postings that does not overlap
with query’s time point
v, [t0, t1) <d, 1, t0, t1>
<d, 2, t2, ∞>
Remark: Validity intervals still reside in the postings!
v, [t2, ∞)
[Berberich et al., SIGIR 2007]
217
Trade-off
• Optimal approaches
– Space-optimal
– Performance-optimal
• Trading off space and performance:
– Performance-Guarantee apparoach
– Space-Bound apparoach
[Berberich et al., SIGIR 2007]
218
Computational Complexity
• Performance-Guarantee Approach
– DP: time and space complexity O(n2
)
• Space-Bound Approach
– DP solution is prohibitively expensive
– Approximate solution using simulated
annealing
• Time O(n2
), Space O(n)
• Notice n is the number of all unique time interval
boundaries for a given term’s posting list
[Berberich et al., SIGIR 2007]
219
Time-point Queries: Result
• Given a space-bound of 3 (i.e., 3x of the
optimal space), close-to-optimal
performance is achievable!
– (Reminder: Partitioning is not very cheap!)
[Berberich et al., SIGIR 2007]
220
Time-point Queries: Result
• Time-point queries can be handled like
this:
• Example
221
Time-interval Queries?
• When more than one partitions should be
accessed, wasted I/O due to repeated
postings! (e.g., 3x more postings in SB!)
• Example
[Anand et al., CIKM 2010]
222
Solution Approaches
Partition selection [Anand et al., CIKM 2010]
Can we avoid accessing all partitions related to a given query?
Document partitioning (sharding) [Anand et al., SIGIR 2011]
Can we partition postings in a list in a different way
i.e., other than using time information?
223
Partition Selection
• Problem: Optimize result quality by
accessing a subset of “affected partitions”
without exceeding a given I/O upper-
bound
Optimization criterion: maximize the fraction of
original query results (relative recall).
Two types of constraints:
a) Size-based: allowed to read up to a fixed
number of postings (focus on sequential access)
b) Equi-cost: allowed to read up to a fixed
number of partitions (focus on random access)
[Anand et al., CIKM 2010]
224
Assumption
• Assume an oracle exists “providing us the
cardinalities of the individual partitions as
well as the result of their intersection/union”
• Later, KMV synopses will be employed as
an approximation
[Anand et al., CIKM 2010]
225
Single-term queries
• Size-based partition selection
• Equi-cost partition selection
• DP solutions exist, but expensive!
• Approximation
– Reduce the problem to budgeted max
coverage
N
[Anand et al., CIKM 2010]
226
GreedySelect Algorithm
• Cost(P):
– Size-based: no. of postings in the partition
– Equi-cost: 1
• Benefit(P):
– no of unread postings in P
• At each step:
– Select the partition with highest B(P) / C(P)
– Update benefits of the unselected partitions
Recall: we assume an oracle
providing these numbers!
[Anand et al., CIKM 2010]
227
Example
228
Multi-term Queries
• Conjunctive semantics intersect
partitions of query terms
• Optimization objective: increase the
coverage of postings that are in this
intersection space of partitions
Size-based
N
Equi-cost
[Anand et al., CIKM 2010]
229
Simple Math does the job!
• Compute “union of intersections”
• x: a tuple of intersection from each term
• x is a tuple from the Cartesian product of
partitions for each query term
– X = Pterm1 x Pterm2 x..... X Ptermk and then,
– x = {(Pterm1,i, Pterm2,j, ..., x Ptermk, l)}
Choose a partition from this term’s partitions
[Anand et al., CIKM 2010]
230
GreedySelect, again!
• Apply GreedySelect over X to pick x ϵ X
• C(x)
– Sum of the sizes of unselected partitions in x,
or, number of unselected partitions in x
• B(x)
– No of new documents that appear in the
intersection of the partitions in x
• Modify algorithm to update benefits and
costs after each picking a tuple x.
[Anand et al., CIKM 2010]
231
Cartesian Product or Join?
• Problem: Set X might be very large!
• Remedy: Observe that partitions of
different terms that have no temporal
overlap cannot have any intersecting
docs!
Apply the algorithm
over the t-join set!
[Anand et al., CIKM 2010]
232
Performance of Partition Selection
• Reletive recall: 50% of affected partitions
might be enough!
– 3 datasets, Wikipedia seems harder!
Wikipedia UKGOV NY Times
[Anand et al., CIKM 2010]
233
Performance of Partition Selection
• Compare query run times for:
– Unpartitioned index,
– No partition selection,
– Partition selection with I/O bound (all index
files are compressed!)
[Anand et al., CIKM 2010]
234
Real Life: KMV Synopses
• Instead of assuming an “oracle” for
cardinality estimation, use KMV synopses
• A KMV synopsis for a multiset S:
– Fix a hash function h
– Apply h to each distinct value in S
– k-smallest of the hashed values form KMV
[Beyer et al., SIGMOD 2007]
Effective sketches for sets supporting arbitrary
multiset operations: ⋃, ∩, /
[Anand et al., CIKM 2010]
235
Partition Selection with KMV
• For each partition of each term:
– create a flat file storing KMV synopses (5%
and 10% samples)
[Anand et al., CIKM 2010]
236
Partition Selection with KMV
• KMV is promising
– 5% is enough!
Wikipedia UKGOV NY Times
[Anand et al., CIKM 2010]
237
Document partitioning (Sharding)
• Another solution to avoid reading and
processing repeated postings with cross-
cutting validity intervals in temporal
partitioning
• Example
[Anand et al., SIGIR 2011]
238
Sharding
• Key idea:
– Instead of partitioning temporally, partition
postings based on doc-ids
– No repetition as in slicing
– Example
[Anand et al., SIGIR 2011]
239
Sharding
• Entries in a shard are ordered according to
their begin time
• For each shard, an axuliary data structure:
– List of pairs: (query begin times, offset in shard)
– Maintain for practical granularities of begin
times (like, days)  can fit in memory!
– For a query with a begin time
• Seek to the offset position in the shard and then read
sequentially
[Anand et al., SIGIR 2011]
240
Why several shards?
• There can be still wasted disk I/O while reading a
shard:
• Problem is the postings with long validity intervals
(i.e., subsuming lot of other postings)
Query begin time
Earliest posting that includes
the query begin time!
This causes reading 5 useless
postings!
[Anand et al., SIGIR 2011]
241
Idealized Sharding
• For a given posting list L, find a minimal set of
shards that satisfy staircase property (SP)
if begin(p) ≤ begin(q) → end(p) ≤ end(q)
• Greedy solution:
– For each posting p in L (in begin-time order)
• Add p in an exiting shard s if SP is satisfied
• Otherwise, start a new shard with p
[Anand et al., SIGIR 2011]
242
Merging Shards
• Idealized sharding can yield several
shards
– random access per shard is expensive!
• A greedy merging algortihm
– Penalty <= CostRan /CostSeq
– Penalty (pairwise): wasted sequential reads
for merging two shards
– Merge in ascending choice order and then
smallest size order
More on this later!
[Anand et al., SIGIR 2011]
243
Performance
• Sharding improves query processing
times w.r.t. temporal partitioning or no
partitioning et al.
• Merging shards results shorter QP times
• Time measurements are taken on warm
caches!!!
[Anand et al., SIGIR 2011]
244
A Grand Summary
• A full posting list with validity intervals
– High sequential access + CPU cost
- 1 random access per query term
• Temporal partitioning (slicing)
– Reduce seq access (but, repetitions)
– Reduce CPU cost
– ≥1 random accesses per query term (with partition
selection)
• Document partitioning (sharding)
– Reduce seq access (no repetitions + time map)
– Reduce CPU cost (staircase property)
– ≥1 random accesses per query term (read all
shards?!)
TRADE-OFF: May increase random
accesses
for reducing sequantial accesses and CPU
cost
245
An Alternative Perspective
• Work from Polytechnique Ins. of NYU
• Focus on the index size
Approaches up to now consider each version
of a document separately: no special attention
on the overlap between versions
246
Index Compression
• Key ideas
– Small integers can be represented with
smaller codes
– Doc ids are not so small: instead, compress
the gaps between the ids
– Term frequencies are already small
– Example
247
Indexing Versions
• Assume validity-intervals are stored separately
 Time-constraints @ post-processing
• Versions of di are represented as di,j
t0
t1 t2
t3 ∞
tf(v)=1 tf(v)=0 tf(v)=2 tf(v)=2
v <d1, 1, t0, t1> <d1, 2, t2, t3> <d1, 2, t3, ∞>
v <d11, 1> <d1,2, 2> <d1,3, 2>
Question: How can we reduce the storage space for such a representation?
[He et al., CIKM 2009]
248
Versions Are Similar
• There is a high overlap between versions!
Snapshot@
May 15
Snapshot@
June 15
Snapshot@
July 15
249
How to Exploit?
• Simplest idea:
– Assign consecutive doc IDs to consecutive
versions of the same document
– Allows small d-gaps for overlapping terms
among the versions
<d1000, 1> <d3000, 2> <d6000, 2>
russir <d1, 1> <d2, 2> <d3, 2>
d1: http://romip.ru/russir2012/, [May 15, June 15]
d2: http://romip.ru/russir2012/, [June 15, July 15]
d3: http://romip.ru/russir2012/, [July 15, Aug 15]
russir Random ids
Consecutive ids
[He et al., CIKM 2009]
250
MSA Approach
• Multiple Segment Alignment [Herscovici et al., ECIR’07]
– Given d with some versions, a virtual
document Vi,j is all terms occuring in (only)
versions i through j
– Reduces the number of postings but
increases the document space! (theoretically,
up to N2
virtual documents!)
[He et al., CIKM 2009]
251
MSA: Example
Herscovici et al., ECIR 2007]
252
Indexing the Differences
• DIFF approach [Anick et al., SIGIR’92]
– For every pair of consecutive versions di and
di+1; create a virtual document that is the
symmetric difference between these versions.
[He et al., CIKM 2009]
253
DIFF: Example
• Example:
• QP
254
Performance
• Two index compression schemes
– Interpolative coding (IPC)
– PForDelta with optimizations (PFD)
• Two datasets: Wiki, Ireland
Wikipedia Ireland
Random-IPC 4957 2908
Random-PFD 5499 3289Index sizes for doc ids
Query processing times (in-memory): Sorted is the best!
[He et al., CIKM 2009]
255
Two-level Indexing: Roots
• Assume we have clusters of documents
t1
t2
d1
d2 2 d4 2
1 d2 2 d3 6
C1
d1
d2
d3
d4
C2
C3
• We have queries restricted to clusters:
– {t1, t2} in C3  intersect lists  d2, d4  d4
d4 1
256
Cluster-skipping index
• Skip redundant postings during QP
– Reduce decompression cost [Altingovde et al., TOIS 2008]
• Skipping is widely used for QP in practice
t1
t2
C1 d1 1 C2 d2 2 d3 6
C2 d2 2 C3 d4 2
C3 d4 1
[Altingovde et al., TOIS 2008]
257
Two-level indexing
• Apply the same idea for versions:
– First level index: document ids
– Second level index: a bitvector for versions
No need for indexing seperate ids for each
version  implicit from the bicvector.
[He et al., CIKM 2009]
258
Two-level indexing
• In actual implementation, group blocks of
doc ids and bitvectors of versions
• Bitvectors best compressed by hierarchical
Huffman coding!
russir <d1, 1> <d2, 2> <d3, 2>
russir <d> 1 1 1 0 0/1: First three versions include “russir”
russir <d> 1 2 2 0 TFs: Actual term frequencies
[He et al., CIKM 2009]
259
Two-level indexing: HUFF
• Query Processing: Decompress bitvectors
of documents that are in the intersection
of query terms
russir <d1> 1 2 2 0 <d2> 1 1 2 0 <d4> 1 0 1 0
schedule <d3> 1 2 2 1 <d4> 1 2 2 0 <d5> 1 3 2 1
Final result: d4,1 and d4,3
[He et al., CIKM 2009]
260
Two level indexing
• Same idea can be applied to MSA and
DIFF
– For virtual documents, create bitvectors as
before
– Compress first level IPC, second level PFD
• Even better compression performance
[He et al., CIKM 2010]
Index sizes for doc ids
HUFF 213 577 261
Two level indexing: QP
• Queries without temporal constaints
[He et al., SIGIR 2011]
262
Two level indexing: QP
• Queries with temporal constaints
– Post-processing
[He et al., SIGIR 2011]
263
Two level indexing: QP
• Queries with temporal constaints
– Partitioning, again!
[He et al., SIGIR 2011]
264
QP Performance
[He et al., SIGIR 2011]
265
Grand Summary
Partition-focused
– Time information kept
in postings
– Redundancy solved
by partitioning
• Process only relevant
partitions
– Lossy compression of
versions (coalescing)
Size-focused
– Time information kept
separately
– Redundancy solved by
2-level compression
• Process compact blocks
• Hierarchical partitions
– Lossless compression
of versions
266
References
• [Altingovde et al., TOIS 2008] Ismail Sengör Altingövde, Engin Demir, Fazli Can, Özgür Ulusoy:
Incremental cluster-based retrieval using compressed cluster-skipping inverted files. ACM Trans.
Inf. Syst. 26(3): (2008)
• [Anand et al., CIKM 2010] Avishek Anand, Srikanta J. Bedathur, Klaus Berberich, Ralf
Schenkel: Efficient temporal keyword search over versioned text. CIKM 2010: 699-708
• [Anand et al., SIGIR2011] Avishek Anand, Srikanta J. Bedathur, Klaus Berberich, Ralf Schenkel:
Temporal index sharding for space-time efficiency in archive search. SIGIR 2011: 545-554
• [Anick et al., SIGIR 1992] Peter G. Anick, Rex A. Flynn: Versioning a Full-text Information
Retrieval System. SIGIR 1992: 98-111
• [Berberich et al., SIGIR 2007] Klaus Berberich, Srikanta J. Bedathur, Thomas Neumann,
Gerhard Weikum: A time machine for text search. SIGIR 2007: 519-526
• [Beyer et al., SIGMOD 2007] Kevin S. Beyer, Peter J. Haas, Berthold Reinwald, Yannis
Sismanis, Rainer Gemulla: On synopses for distinct-value estimation under multiset operations.
SIGMOD Conference 2007: 199-210
• [He et al., CIKM 2009] Jinru He, Hao Yan, Torsten Suel: Compact full-text indexing of versioned
document collections. CIKM 2009: 415-424
• [He et al., CIKM 2010] Jinru He, Junyuan Zeng, Torsten Suel: Improved index compression
techniques for versioned document collections. CIKM 2010: 1239-1248
• [He et al., SIGIR 2011] Jinru He, Torsten Suel: Faster temporal range queries over versioned
text. SIGIR 2011: 565-574
• [Herscovici et al, ECIR 2007] Michael Herscovici, Ronny Lempel, Sivan Yogev: Efficient
Indexing of Versioned Document Sequences. ECIR 2007: 76-87
• [Hou U et al., SIGMOD 2010] Leong Hou U, Nikos Mamoulis, Klaus Berberich, Srikanta J.
Bedathur: Durable top-k search in document archives. SIGMOD Conference 2010: 555-566
267
Thank you!
Questions???
268
Retrieval and Ranking Models
(1)Searching the past
(2)Searching the future
RECAP
Two time dimensions
1. Publication or modified time
2. Content or event time
270
RECAP
Two time dimensions
1. Publication or modified time
2. Content or event time
content time
publication time
271
Searching the past
• Historical or temporal information needs
– A journalist working the historical story of a particular
news article
– A Wikipedia contributor finding relevant information
that has not been written about yet
272
Searching the past
• Historical or temporal information needs
– A journalist working the historical story of a particular
news article
– A Wikipedia contributor finding relevant information
that has not been written about yet
273
• Time must be explicitly modeled in order to
increase the effectiveness of ranking
– To order search results so that the most relevant
ones come first
• Time uncertainty should be taken into account
– Two temporal expressions can refer to the same time
period even though they are not equally written
– E.g. the query “Independence Day 2011”
• A retrieval model relying on term-matching only will fail to
retrieve documents mentioning “July 4, 2011”
Challenge
274
Query- and document model
• A temporal query consists of:
– Query keywords
– Temporal expressions
• A document consists of:
– Terms, i.e., bag-of-words
– Publication time and temporal expressions
[Berberich et al., ECIR 2010]
275
Query- and document model
• A temporal query consists of:
– Query keywords
– Temporal expressions
• A document consists of:
– Terms, i.e., bag-of-words
– Publication time and temporal expressions
[Berberich et al., ECIR 2010]
276
Time-aware ranking models
• Follow two main approaches
1. Mixture model [Kanhabua et al., ECDL 2010]
• Linearly combining textual- and temporal similarity
1. Probabilistic model [Berberich et al., ECIR 2010]
• Generating a query from the textual part and temporal part
of a document independently
277
Mixture model
• Linearly combine textual- and temporal similarity
– α indicates the importance of similarity scores
• Both scores are normalized before combining
– Textual similarity can be determined using any term-
based retrieval model
• E.g., tf.idf or a unigram language model
278
Mixture model
• Linearly combine textual- and temporal similarity
– α indicates the importance of similarity scores
• Both scores are normalized before combining
– Textual similarity can be determined using any term-
based retrieval model
• E.g., tf.idf or a unigram language model
How to determine temporal similarity?
279
Temporal similarity
• Assume that temporal expressions in the query are
generated independently from a two-step
generative model:
– P(tq|td) can be estimated based on publication time
using an exponential decay function [Kanhabua et al.,
ECDL 2010]
– Linear interpolation smoothing is applied to eliminates
zero probabilities
• I.e., an unseen temporal expression tq in d
280
Temporal similarity
• Assume that temporal expressions in the query are
generated independently from a two-step
generative model:
– P(tq|td) can be estimated based on publication time
using an exponential decay function [Kanhabua et al.,
ECDL 2010]
– Linear interpolation smoothing is applied to eliminates
zero probabilities
• I.e., an unseen temporal expression tq in d
Similarityscore
Time distance
d1 d2
281
• Five time-aware ranking models
– LMT [Berberich et al., ECIR 2010]
– LMTU [Berberich et al., ECIR 2010]
– TS [Kanhabua et al., ECLD 2010]
– TSU [Kanhabua et al., ECLD 2010]
– FuzzySet [Kalczynski et al., Inf. Process. 2005]
Comparison of time-aware ranking
[Kanhabua et al., SIGIR 2011]
282
• Experiment
– New York Times Annotated Corpus
– 40 temporal queries [Berberich et al., ECIR 2010]
• Result
– TSU outperforms other methods significantly for most
metrics
• Conclusions
– Although TSU gains the best performance, but only
applied to a collection with time metadata
– LMT, LMTU can be applied to any collection without time
metadata, but time extraction is needed
Discussion
[Kanhabua et al., SIGIR 2011]
283
Searching the future
• People are naturally curious about the future
– What will happen to EU economies in next 5 years?
– What will be potential effects of climate changes?
284
Previous work
• Searching the future
– Extract temporal expressions from news articles
– Retrieve future information using a probabilistic
model, i.e., multiplying textual similarity and a time
confidence
• Supporting analysis of future-related
information in news and Web
– Extract future mentions from news snippets obtained
from search engines
– Summarize and aggregate results using clustering
methods, but no ranking
[Baeza-Yates SIGIR Forum 2005; Jatowt et al., JCDL 2009]
285
Recorded Future
[http://www.recordedfuture.com/]
286
Yahoo! Time Explorer
[Matthews et al., HCIR 2010]
287
Ranking news predictions
• Over 32% of 2.5M documents from Yahoo! News
(July’09 – July’10) contain at least one prediction
• Retrieve predictions related to a news story in news
archives and rank by relevance
[Kanhabua et al., SIGIR 2011a]
288
Related news predictions
[Kanhabua et al., SIGIR 2011a]
289
Related news predictions
[Kanhabua et al., SIGIR 2011a]
290
Related news predictions
[Kanhabua et al., SIGIR 2011a]
291
• Four classes of features
– Term similarity, entity-based similarity, topic similarity
and temporal similarity
• Rank results using a learning-to-rank technique
Approach
[Kanhabua et al., SIGIR 2011a]
292
• Step 1: Document annotation.
– Extract temporal expressions using
time and event recognition
– Normalize them to dates so they can
be anchored on a timeline
– Output: sentences annotated with
named entities and dates, i.e.,
predictions
• Step 2: Retrieving predictions.
– Automatically generate a query from
a news article being read
– Retrieve predictions that match the
query
– Rank predictions by relevance (i.e.,
a prediction is “relevant” if it is about
the topics of the article)
System architecture
[Kanhabua et al., SIGIR 2011a]
293
• Capture the term similarity between q and p
1. TF-IDF scoring function
• Problem: keyword matching, short texts
• Predictions not match with query terms
1. Field-aware ranking function, e.g., bm25f
• Search the context of a prediction, i.e., surrounding
sentences
Term similarity
[Kanhabua et al., SIGIR 2011a]
294
• Measure the similarity
between q and p using
annotated entities in dp, p,
q
– Features commonly employed
in entity ranking
Entity-based similarity
[Kanhabua et al., SIGIR 2011a]
295
• Compute the similarity between q and p on topic
– Latent Dirichlet allocation [Blei et al., J. Mach. Learn.
2003] for modeling topics
1. Train a topic model
2. Infer topics
3. Compute topic similarity
Topic similarity
[Kanhabua et al., SIGIR 2011a]
296
• Compute the similarity between q and p on topic
– Latent Dirichlet allocation [Blei et al., J. Mach. Learn.
2003] for modeling topics
1. Train a topic model
2. Infer topics
3. Compute topic similarity
Topic similarity
[Kanhabua et al., SIGIR 2011a]
297
• Hypothesis I. Predictions that are more recent to the
query are more relevant
Temporal similarity
[Kanhabua et al., SIGIR 2011a]
298
• Hypothesis I. Predictions that are more recent to the
query are more relevant
Temporal similarity
• Hypothesis II. Predictions extracted from more recent
documents are more relevant
[Kanhabua et al., SIGIR 2011a]
299
• Learning-to-rank: Given an unseen (q, p), p is
ranked using a model trained over a set of labeled
query/prediction
– SVM-MAP [Yue et al., SIGIR 2007]
– RankSVM [Joachims, KDD 2002]
– SGD-SVM [Zhang, ICML 2004]
– PegasosSVM [Shalev-Shwartz et al., ICML 2007]
– PA-Perceptron [Crammer et al., J. Mach. Learn. 2006]
Ranking method
[Kanhabua et al., SIGIR 2011a]
300
• 42 future-related topics
Relevance judgments
[Kanhabua et al., SIGIR 2011a]
301
• New York Times Annotated Corpus
– 1.8 million articles, over 20 years
– More than 25% contain at least one prediction
• Annotation process uses several language processing
tools
– OpenNLP for tokenizing, sentence splitting, part-of-
speech tagging, shallow parsing
– SuperSense tagger for named entity recognition
– TARSQI for extracting temporal expressions
• Apache Lucene for indexing and retrieving.
– 44,335,519 sentences and 548,491 predictions
– 939,455 future dates (avg. future date/prediction is 1.7)
Experiments
[Kanhabua et al., SIGIR 2011a]
302
• Results
– Topic features play an important role in ranking
– Features in top-5 features with lowest weights are
entity-based features
• Open issues
– Extract predictions from other sources, e.g.,
Wikipedia, blogs, comments, etc.
– Sentiment analysis for future-related information
Discussion
[Kanhabua et al., SIGIR 2011a]
303
References
• [Baeza-Yates SIGIR Forum 2005] Ricardo A. Baeza-Yates: Searching the future. SIGIR workshop MF/IR 2005
• [Berberich et al., ECIR 2010] Klaus Berberich, Srikanta J. Bedathur, Omar Alonso, Gerhard Weikum: A
Language Modeling Approach for Temporal Information Needs. ECIR 2010: 13-25
• [Blei et al., J. Mach. Learn. 2003] David M. Blei, Andrew Y. Ng, Michael I. Jordan: Latent Dirichlet Allocation.
Journal of Machine Learning Research 3: 993-1022 (2003)
• [Crammer et al., J. Mach. Learn. 2006] Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz,
Yoram Singer: Online Passive-Aggressive Algorithms. Journal of Machine Learning Research 7: 551-585 (2006)
• [Jatowt et al., JCDL 2009] Adam Jatowt, Kensuke Kanazawa, Satoshi Oyama, Katsumi Tanaka: Supporting
analysis of future-related information in news archives and the web. JCDL 2009: 115-124
• [Joachims, KDD 2002] Thorsten Joachims: Optimizing search engines using clickthrough data. KDD 2002: 133-
142
• [Kalczynski et al., Inf. Process. 2005] Pawel Jan Kalczynski, Amy Chou: Temporal Document Retrieval Model
for business news archives. Inf. Process. Manage. 41(3): 635-650 (2005)
• [Kanhabua et al., SIGIR 2011] Nattiya Kanhabua, Kjetil Nørvåg: A comparison of time-aware ranking methods.
SIGIR 2011: 1257-1258
• [Kanhabua et al., SIGIR 2011a] Nattiya Kanhabua, Roi Blanco, Michael Matthews: Ranking related news
predictions. SIGIR 2011: 755-764
• [Matthews et al., HCIR 2010] Michael Matthews, Pancho Tolchinsky, Roi Blanco, Jordi Atserias, Peter Mika,
Hugo Zaragoza: Searching through time in the new york times. HCIR workshop 2010
• [Shalev-Shwartz et al., ICML 2007] Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, Andrew Cotter:
Pegasos: primal estimated sub-gradient solver for SVM. Math. Program. 127(1): 3-30 (2011)
• [Yue et al., SIGIR 2007] Yisong Yue, Thomas Finley, Filip Radlinski, Thorsten Joachims: A support vector
method for optimizing average precision. SIGIR 2007: 271-278
• [Zhang, ICML 2004] Tong Zhang: Solving large scale linear prediction problems using stochastic gradient
descent algorithms. ICML 2004
304
Question?
305

Contenu connexe

En vedette

Can Twitter & Co. Save Lives?
Can Twitter & Co. Save Lives?Can Twitter & Co. Save Lives?
Can Twitter & Co. Save Lives?Nattiya Kanhabua
 
Improving Temporal Language Models For Determining Time of Non-Timestamped Do...
Improving Temporal Language Models For Determining Time of Non-Timestamped Do...Improving Temporal Language Models For Determining Time of Non-Timestamped Do...
Improving Temporal Language Models For Determining Time of Non-Timestamped Do...Nattiya Kanhabua
 
What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalyst...
What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalyst...What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalyst...
What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalyst...Nattiya Kanhabua
 
Temporal summarization of event related updates
Temporal summarization of event related updatesTemporal summarization of event related updates
Temporal summarization of event related updatesNattiya Kanhabua
 
Ranking Related News Predictions
Ranking Related News PredictionsRanking Related News Predictions
Ranking Related News PredictionsNattiya Kanhabua
 
Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...
Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...
Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...Nattiya Kanhabua
 
On the Value of Temporal Anchor Texts in Wikipedia
On the Value of Temporal Anchor Texts in WikipediaOn the Value of Temporal Anchor Texts in Wikipedia
On the Value of Temporal Anchor Texts in WikipediaNattiya Kanhabua
 
Identifying Relevant Temporal Expressions for Real-world Events
Identifying Relevant Temporal Expressions for Real-world EventsIdentifying Relevant Temporal Expressions for Real-world Events
Identifying Relevant Temporal Expressions for Real-world EventsNattiya Kanhabua
 
Concise Preservation by Combining Managed Forgetting and Contextualized Remem...
Concise Preservation by Combining Managed Forgetting and Contextualized Remem...Concise Preservation by Combining Managed Forgetting and Contextualized Remem...
Concise Preservation by Combining Managed Forgetting and Contextualized Remem...Nattiya Kanhabua
 
Exploiting Time-based Synonyms in Searching Document Archives
Exploiting Time-based Synonyms in Searching Document ArchivesExploiting Time-based Synonyms in Searching Document Archives
Exploiting Time-based Synonyms in Searching Document ArchivesNattiya Kanhabua
 
Determining Time of Queries for Re-ranking Search Results
Determining Time of Queries for Re-ranking Search ResultsDetermining Time of Queries for Re-ranking Search Results
Determining Time of Queries for Re-ranking Search ResultsNattiya Kanhabua
 
Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...
Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...
Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...Nattiya Kanhabua
 
Understanding the Diversity of Tweets in the Time of Outbreaks
Understanding the Diversity of Tweets in the Time of OutbreaksUnderstanding the Diversity of Tweets in the Time of Outbreaks
Understanding the Diversity of Tweets in the Time of OutbreaksNattiya Kanhabua
 
Why Is It Difficult to Detect Outbreaks in Twitter?
Why Is It Difficult to Detect Outbreaks in Twitter?Why Is It Difficult to Detect Outbreaks in Twitter?
Why Is It Difficult to Detect Outbreaks in Twitter?Nattiya Kanhabua
 
Time-aware Approaches to Information Retrieval
Time-aware Approaches to Information RetrievalTime-aware Approaches to Information Retrieval
Time-aware Approaches to Information RetrievalNattiya Kanhabua
 
Search, Exploration and Analytics of Evolving Data
Search, Exploration and Analytics of Evolving DataSearch, Exploration and Analytics of Evolving Data
Search, Exploration and Analytics of Evolving DataNattiya Kanhabua
 
Temporal Web Dynamics and Implications for Information Retrieval
Temporal Web Dynamics and Implications for Information RetrievalTemporal Web Dynamics and Implications for Information Retrieval
Temporal Web Dynamics and Implications for Information RetrievalNattiya Kanhabua
 
Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)
Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)
Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)Nattiya Kanhabua
 

En vedette (18)

Can Twitter & Co. Save Lives?
Can Twitter & Co. Save Lives?Can Twitter & Co. Save Lives?
Can Twitter & Co. Save Lives?
 
Improving Temporal Language Models For Determining Time of Non-Timestamped Do...
Improving Temporal Language Models For Determining Time of Non-Timestamped Do...Improving Temporal Language Models For Determining Time of Non-Timestamped Do...
Improving Temporal Language Models For Determining Time of Non-Timestamped Do...
 
What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalyst...
What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalyst...What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalyst...
What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalyst...
 
Temporal summarization of event related updates
Temporal summarization of event related updatesTemporal summarization of event related updates
Temporal summarization of event related updates
 
Ranking Related News Predictions
Ranking Related News PredictionsRanking Related News Predictions
Ranking Related News Predictions
 
Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...
Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...
Leveraging Learning To Rank in an Optimization Framework for Timeline Summari...
 
On the Value of Temporal Anchor Texts in Wikipedia
On the Value of Temporal Anchor Texts in WikipediaOn the Value of Temporal Anchor Texts in Wikipedia
On the Value of Temporal Anchor Texts in Wikipedia
 
Identifying Relevant Temporal Expressions for Real-world Events
Identifying Relevant Temporal Expressions for Real-world EventsIdentifying Relevant Temporal Expressions for Real-world Events
Identifying Relevant Temporal Expressions for Real-world Events
 
Concise Preservation by Combining Managed Forgetting and Contextualized Remem...
Concise Preservation by Combining Managed Forgetting and Contextualized Remem...Concise Preservation by Combining Managed Forgetting and Contextualized Remem...
Concise Preservation by Combining Managed Forgetting and Contextualized Remem...
 
Exploiting Time-based Synonyms in Searching Document Archives
Exploiting Time-based Synonyms in Searching Document ArchivesExploiting Time-based Synonyms in Searching Document Archives
Exploiting Time-based Synonyms in Searching Document Archives
 
Determining Time of Queries for Re-ranking Search Results
Determining Time of Queries for Re-ranking Search ResultsDetermining Time of Queries for Re-ranking Search Results
Determining Time of Queries for Re-ranking Search Results
 
Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...
Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...
Towards Concise Preservation by Managed Forgetting: Research Issues and Case ...
 
Understanding the Diversity of Tweets in the Time of Outbreaks
Understanding the Diversity of Tweets in the Time of OutbreaksUnderstanding the Diversity of Tweets in the Time of Outbreaks
Understanding the Diversity of Tweets in the Time of Outbreaks
 
Why Is It Difficult to Detect Outbreaks in Twitter?
Why Is It Difficult to Detect Outbreaks in Twitter?Why Is It Difficult to Detect Outbreaks in Twitter?
Why Is It Difficult to Detect Outbreaks in Twitter?
 
Time-aware Approaches to Information Retrieval
Time-aware Approaches to Information RetrievalTime-aware Approaches to Information Retrieval
Time-aware Approaches to Information Retrieval
 
Search, Exploration and Analytics of Evolving Data
Search, Exploration and Analytics of Evolving DataSearch, Exploration and Analytics of Evolving Data
Search, Exploration and Analytics of Evolving Data
 
Temporal Web Dynamics and Implications for Information Retrieval
Temporal Web Dynamics and Implications for Information RetrievalTemporal Web Dynamics and Implications for Information Retrieval
Temporal Web Dynamics and Implications for Information Retrieval
 
Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)
Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)
Learning to Rank Search Results for Time-Sensitive Queries (poster presentation)
 

Similaire à Web Dynamics Insights

Searching over the past, present and future
Searching over the past, present and futureSearching over the past, present and future
Searching over the past, present and futureRoi Blanco
 
Towards Research Engines: Supporting Search Stages in Web Archives (2015)
Towards Research Engines: Supporting Search Stages in Web Archives (2015)Towards Research Engines: Supporting Search Stages in Web Archives (2015)
Towards Research Engines: Supporting Search Stages in Web Archives (2015)TimelessFuture
 
C01 silvia schenkolewski_retention_disposition
C01 silvia schenkolewski_retention_dispositionC01 silvia schenkolewski_retention_disposition
C01 silvia schenkolewski_retention_dispositionevaminerva
 
C01 silvia schenkolewski_retention_disposition
C01 silvia schenkolewski_retention_dispositionC01 silvia schenkolewski_retention_disposition
C01 silvia schenkolewski_retention_dispositionevaminerva
 
Melissa Terras' Report on the #UKMHLiveLab
Melissa Terras' Report on the #UKMHLiveLabMelissa Terras' Report on the #UKMHLiveLab
Melissa Terras' Report on the #UKMHLiveLabUniversity of Edinburgh
 
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012lljohnston
 
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)TimelessFuture
 
Introduction to digital curation
Introduction to digital curationIntroduction to digital curation
Introduction to digital curationGarethKnight
 
Data Management for Undergraduate Researchers
Data Management for Undergraduate ResearchersData Management for Undergraduate Researchers
Data Management for Undergraduate ResearchersRebekah Cummings
 
State of the Art: Methods and Tools for Archival Processing Metrics
State of the Art: Methods and Tools for Archival Processing MetricsState of the Art: Methods and Tools for Archival Processing Metrics
State of the Art: Methods and Tools for Archival Processing MetricsAudra Eagle Yun
 
2013 RBMS Premodern manuscript application profile presentation
2013 RBMS Premodern manuscript application profile presentation2013 RBMS Premodern manuscript application profile presentation
2013 RBMS Premodern manuscript application profile presentationssteuer
 
Indexing Techniques: Their Usage in Search Engines for Information Retrieval
Indexing Techniques: Their Usage in Search Engines for Information RetrievalIndexing Techniques: Their Usage in Search Engines for Information Retrieval
Indexing Techniques: Their Usage in Search Engines for Information RetrievalVikas Bhushan
 
Charting the Digital Library Evaluation Domain with a Semantically Enhanced M...
Charting the Digital Library Evaluation Domain with a Semantically Enhanced M...Charting the Digital Library Evaluation Domain with a Semantically Enhanced M...
Charting the Digital Library Evaluation Domain with a Semantically Enhanced M...Giannis Tsakonas
 

Similaire à Web Dynamics Insights (20)

Searching over the past, present and future
Searching over the past, present and futureSearching over the past, present and future
Searching over the past, present and future
 
LR.pptx
LR.pptxLR.pptx
LR.pptx
 
Envs100
Envs100Envs100
Envs100
 
Envs100
Envs100Envs100
Envs100
 
Towards Research Engines: Supporting Search Stages in Web Archives (2015)
Towards Research Engines: Supporting Search Stages in Web Archives (2015)Towards Research Engines: Supporting Search Stages in Web Archives (2015)
Towards Research Engines: Supporting Search Stages in Web Archives (2015)
 
C01 silvia schenkolewski_retention_disposition
C01 silvia schenkolewski_retention_dispositionC01 silvia schenkolewski_retention_disposition
C01 silvia schenkolewski_retention_disposition
 
C01 silvia schenkolewski_retention_disposition
C01 silvia schenkolewski_retention_dispositionC01 silvia schenkolewski_retention_disposition
C01 silvia schenkolewski_retention_disposition
 
Digital libraries
Digital librariesDigital libraries
Digital libraries
 
Melissa Terras' Report on the #UKMHLiveLab
Melissa Terras' Report on the #UKMHLiveLabMelissa Terras' Report on the #UKMHLiveLab
Melissa Terras' Report on the #UKMHLiveLab
 
Poli120e guide
Poli120e guidePoli120e guide
Poli120e guide
 
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
 
IRT Unit_I.pptx
IRT Unit_I.pptxIRT Unit_I.pptx
IRT Unit_I.pptx
 
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)
 
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
 
Introduction to digital curation
Introduction to digital curationIntroduction to digital curation
Introduction to digital curation
 
Data Management for Undergraduate Researchers
Data Management for Undergraduate ResearchersData Management for Undergraduate Researchers
Data Management for Undergraduate Researchers
 
State of the Art: Methods and Tools for Archival Processing Metrics
State of the Art: Methods and Tools for Archival Processing MetricsState of the Art: Methods and Tools for Archival Processing Metrics
State of the Art: Methods and Tools for Archival Processing Metrics
 
2013 RBMS Premodern manuscript application profile presentation
2013 RBMS Premodern manuscript application profile presentation2013 RBMS Premodern manuscript application profile presentation
2013 RBMS Premodern manuscript application profile presentation
 
Indexing Techniques: Their Usage in Search Engines for Information Retrieval
Indexing Techniques: Their Usage in Search Engines for Information RetrievalIndexing Techniques: Their Usage in Search Engines for Information Retrieval
Indexing Techniques: Their Usage in Search Engines for Information Retrieval
 
Charting the Digital Library Evaluation Domain with a Semantically Enhanced M...
Charting the Digital Library Evaluation Domain with a Semantically Enhanced M...Charting the Digital Library Evaluation Domain with a Semantically Enhanced M...
Charting the Digital Library Evaluation Domain with a Semantically Enhanced M...
 

Dernier

Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...Sebastiano Panichella
 
Application of GIS in Landslide Disaster Response.pptx
Application of GIS in Landslide Disaster Response.pptxApplication of GIS in Landslide Disaster Response.pptx
Application of GIS in Landslide Disaster Response.pptxRoquia Salam
 
Quality by design.. ppt for RA (1ST SEM
Quality by design.. ppt for  RA (1ST SEMQuality by design.. ppt for  RA (1ST SEM
Quality by design.. ppt for RA (1ST SEMCharmi13
 
GESCO SE Press and Analyst Conference on Financial Results 2024
GESCO SE Press and Analyst Conference on Financial Results 2024GESCO SE Press and Analyst Conference on Financial Results 2024
GESCO SE Press and Analyst Conference on Financial Results 2024GESCO SE
 
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRRINDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRRsarwankumar4524
 
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptxerickamwana1
 
Engaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptxEngaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptxAsifArshad8
 
A Guide to Choosing the Ideal Air Cooler
A Guide to Choosing the Ideal Air CoolerA Guide to Choosing the Ideal Air Cooler
A Guide to Choosing the Ideal Air Coolerenquirieskenstar
 
Chizaram's Women Tech Makers Deck. .pptx
Chizaram's Women Tech Makers Deck.  .pptxChizaram's Women Tech Makers Deck.  .pptx
Chizaram's Women Tech Makers Deck. .pptxogubuikealex
 
Internship Presentation | PPT | CSE | SE
Internship Presentation | PPT | CSE | SEInternship Presentation | PPT | CSE | SE
Internship Presentation | PPT | CSE | SESaleh Ibne Omar
 
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...Sebastiano Panichella
 
cse-csp batch4 review-1.1.pptx cyber security
cse-csp batch4 review-1.1.pptx cyber securitycse-csp batch4 review-1.1.pptx cyber security
cse-csp batch4 review-1.1.pptx cyber securitysandeepnani2260
 
General Elections Final Press Noteas per M
General Elections Final Press Noteas per MGeneral Elections Final Press Noteas per M
General Elections Final Press Noteas per MVidyaAdsule1
 
proposal kumeneger edited.docx A kumeeger
proposal kumeneger edited.docx A kumeegerproposal kumeneger edited.docx A kumeeger
proposal kumeneger edited.docx A kumeegerkumenegertelayegrama
 
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...漢銘 謝
 
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRachelAnnTenibroAmaz
 
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunity
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunityDon't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunity
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunityApp Ethena
 

Dernier (17)

Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
 
Application of GIS in Landslide Disaster Response.pptx
Application of GIS in Landslide Disaster Response.pptxApplication of GIS in Landslide Disaster Response.pptx
Application of GIS in Landslide Disaster Response.pptx
 
Quality by design.. ppt for RA (1ST SEM
Quality by design.. ppt for  RA (1ST SEMQuality by design.. ppt for  RA (1ST SEM
Quality by design.. ppt for RA (1ST SEM
 
GESCO SE Press and Analyst Conference on Financial Results 2024
GESCO SE Press and Analyst Conference on Financial Results 2024GESCO SE Press and Analyst Conference on Financial Results 2024
GESCO SE Press and Analyst Conference on Financial Results 2024
 
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRRINDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRR
 
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx
 
Engaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptxEngaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptx
 
A Guide to Choosing the Ideal Air Cooler
A Guide to Choosing the Ideal Air CoolerA Guide to Choosing the Ideal Air Cooler
A Guide to Choosing the Ideal Air Cooler
 
Chizaram's Women Tech Makers Deck. .pptx
Chizaram's Women Tech Makers Deck.  .pptxChizaram's Women Tech Makers Deck.  .pptx
Chizaram's Women Tech Makers Deck. .pptx
 
Internship Presentation | PPT | CSE | SE
Internship Presentation | PPT | CSE | SEInternship Presentation | PPT | CSE | SE
Internship Presentation | PPT | CSE | SE
 
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
 
cse-csp batch4 review-1.1.pptx cyber security
cse-csp batch4 review-1.1.pptx cyber securitycse-csp batch4 review-1.1.pptx cyber security
cse-csp batch4 review-1.1.pptx cyber security
 
General Elections Final Press Noteas per M
General Elections Final Press Noteas per MGeneral Elections Final Press Noteas per M
General Elections Final Press Noteas per M
 
proposal kumeneger edited.docx A kumeeger
proposal kumeneger edited.docx A kumeegerproposal kumeneger edited.docx A kumeeger
proposal kumeneger edited.docx A kumeeger
 
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
 
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
 
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunity
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunityDon't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunity
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunity
 

Web Dynamics Insights

  • 1. Web Dynamics Analysis and Implications from Search Perspective 1
  • 2. Lecturers Dr. Ismail Sengör Altingövde – Senior researcher – L3S Research Center – Hannover, Germany Dr. Nattiya Kanhabua – Postdoctoral researcher – L3S Research Center – Hannover, Germany 2
  • 3. L3S Research Center  Computer science and interdisciplinary research on all aspects related to the Web 3
  • 4. People  80 researchers: 53% post-docs & PhD students, 38% scientific research assistants & int’l interns  15 countries from US and Latin-America, Asia, Europe and Africa  Equal opportunity: 27% females 4
  • 6. L3S: Some Projects Cubrik: Human-enhanced time-aware multimedia search ARCOMEM: From collect-all archives to community memories Terence: Adaptive learning for poor comprehenders M-Eco: The medical ecosystem detecting disease outbreaks 6
  • 8. Outline Day 1: Introduction • Evolution of the Web • Overview of research topics • Content and query analysis Day 3: Indexing the Past • Indexing and searching versioned documents Day 2: Evolution of Web Search Results • Short-term impacts • Longitudinal analysis Day 4: Retrieval and Ranking • Searching the past • Searching the future 8
  • 9. Evolution of the Web • Web is changing over time in many aspects: – Size: web pages are added/deleted all the time – Content: web pages are edited/modified – Query: users’ information needs changes, entity-relationship changes over time – Usage: users’ behaviors change over time [Ke et al., CN 2006; Risvik et al., CN 2002] [Dumais, SIAM-SDM 2012; WebDyn 2010]9
  • 10. Size dynamics • Challenge – Crawling, indexing, and caching 10
  • 11. 2000 First billion-URL index The world’s largest! ≈5000 PCs in clusters! 1995 2012 Web and index sizes 11
  • 12. 2000 First billion-URL index The world’s largest! ≈5000 PCs in clusters!2004 Index grows to 4.2 billion pages 1995 2012 Web and index sizes 12
  • 13. 2000 First billion-URL index The world’s largest! ≈5000 PCs in clusters!2004 Index grows to 4.2 billion pages 1995 2012 2008 Google counts 1 trillion unique URLs Web and index sizes 13
  • 14. Web and index sizes 14
  • 15. Web and index sizes http://www.worldwidewebsize.com/ 15
  • 16. Content dynamics • Challenge – Document representation and retrieval WayBack Machine: a web archive search tool by the Internet Archive 16
  • 17. Content dynamics • Challenge – Document representation and retrieval 17
  • 18. Query dynamics • Challenge – Time-sensitive queries – Query understanding and processing Google Insights for Search: http://www.google.com/insights/search/ Query: Halloween 18
  • 19. Usage dynamics • Challenge – Browsing and search behavior – User preference (e.g., “likes”) 19
  • 20. • Searching documents created/edited over time – E.g., web archives, news archives, blogs, or emails Example application 20
  • 21. • Searching documents created/edited over time – E.g., web archives, news archives, blogs, or emails Example application Web archives news archives blogs emails “temporal document collections” 21
  • 22. • Searching documents created/edited over time – E.g., web archives, news archives, blogs, or emails Example application Web archives news archives blogs emails “temporal document collections” Retrieve documents about Pope Benedict XVI written before 2005 Term-based IR approaches may give unsatisfied results 22
  • 23. • A web archive search tool by the Internet Archive – Query by a URL, e.g., http://www.ntnu.no Wayback Machine1 1 Retrieved on 15 January 2011 23
  • 24. • A web archive search tool by the Internet Archive – Query by a URL, e.g., http://www.ntnu.no Wayback Machine1 No keyword query 1 Retrieved on 15 January 2011 24
  • 25. • A web archive search tool by the Internet Archive – Query by a URL, e.g., http://www.ntnu.no Wayback Machine1 No keyword query No relevance ranking 1 Retrieved on 15 January 2011 25
  • 26. • A news archive search tool by Google – Query by keywords – Rank results by relevance or date Google News Archive Search 26
  • 27. • A news archive search tool by Google – Query by keywords – Rank results by relevance or date Google News Archive Search Not consider terminology changes over time 27
  • 28. Research topics • Content Analysis – Determining timestamps of documents – Temporal information extraction • Query Analysis – Determining time of queries – Named entity evolution – Query performance prediction • Evolution of Search Results – Short-term impacts on result caches – Longitudinal analysis of search results 28
  • 29. Research topics (cont’) • Indexing – Indexing and query processing techniques for the versioned document collections • Retrieval and Ranking – Searching the past – Searching the future 29
  • 30. Content Analysis (1) Determining timestamps of documents (2) Temporal information extraction
  • 31. Motivation • Incorporating the time dimension into search can increase retrieval effectiveness – Only if temporal information is available • Research question – How to determine the temporal information of documents? [Kanhabua, SIGIR Forum 2012] 31
  • 32. Two time aspects 1. Publication or modified time • Problem: determining timestamps of documents • Method: rule-based technique, or temporal language models 2. Content or event time • Problem: temporal information extraction • Method: natural language processing, or time and event recognition algorithms 32
  • 33. Two time aspects 1. Publication or modified time • Problem: determining timestamps of documents • Method: rule-based technique, or temporal language models 2. Content or event time • Problem: temporal information extraction • Method: natural language processing, or time and event recognition algorithms content time publication time 33
  • 34. Problem Statements • Difficult to find the trustworthy time for web documents – Time gap between crawling and indexing – Decentralization and relocation of web documents – No standard metadata for time/date Determining time of documents 34
  • 35. Problem Statements • Difficult to find the trustworthy time for web documents – Time gap between crawling and indexing – Decentralization and relocation of web documents – No standard metadata for time/date Determining time of documents I found a bible-like document. But I have no idea when it was created? ““ For a given document with uncertain timestamp, can the contents be used to determine the timestamp with a sufficiently high confidence? ”” 35
  • 36. Problem Statements • Difficult to find the trustworthy time for web documents – Time gap between crawling and indexing – Decentralization and relocation of web documents – No standard metadata for time/date Determining time of documents Let’s me see… This document is probably written in 850 A.C. with 95% confidence. I found a bible-like document. But I have no idea when it was created? ““ For a given document with uncertain timestamp, can the contents be used to determine the timestamp with a sufficiently high confidence? ”” 36
  • 37. Current approaches 1. Content-based 2. Link-based 3. Hybrid 37
  • 38. Content-based approach Partition Word 1999 tsunami 1999 Japan 1999 tidal wave 2004 tsunami 2004 Thailand 2004 earthquake Temporal Language Models Temporal Language Models • Based on the statistic usage of words over time • Compare each word of a non-timestamped document with a reference corpus • Tentative timestamp -- a time partition mostly overlaps in word usage [de Jong et al., AHC 2005] Freq 1 1 1 1 1 1 38
  • 39. Content-based approach Partition Word 1999 tsunami 1999 Japan 1999 tidal wave 2004 tsunami 2004 Thailand 2004 earthquake Temporal Language Models tsunami Thailand A non-timestamped document Temporal Language Models • Based on the statistic usage of words over time • Compare each word of a non-timestamped document with a reference corpus • Tentative timestamp -- a time partition mostly overlaps in word usage [de Jong et al., AHC 2005] Freq 1 1 1 1 1 1 39
  • 40. Content-based approach Partition Word 1999 tsunami 1999 Japan 1999 tidal wave 2004 tsunami 2004 Thailand 2004 earthquake Temporal Language Models tsunami Thailand A non-timestamped document Temporal Language Models • Based on the statistic usage of words over time • Compare each word of a non-timestamped document with a reference corpus • Tentative timestamp -- a time partition mostly overlaps in word usage [de Jong et al., AHC 2005] Freq 1 1 1 1 1 1 40
  • 41. Content-based approach Partition Word 1999 tsunami 1999 Japan 1999 tidal wave 2004 tsunami 2004 Thailand 2004 earthquake Temporal Language Models tsunami Thailand A non-timestamped document Temporal Language Models • Based on the statistic usage of words over time • Compare each word of a non-timestamped document with a reference corpus • Tentative timestamp -- a time partition mostly overlaps in word usage [de Jong et al., AHC 2005] Freq 1 1 1 1 1 1 41
  • 42. Content-based approach Partition Word 1999 tsunami 1999 Japan 1999 tidal wave 2004 tsunami 2004 Thailand 2004 earthquake Temporal Language Models tsunami Thailand A non-timestamped document Similarity ScoresSimilarity Scores Score(1999) = 1 Score(2004) = 1 + 1 = 2 Most likely timestamp is 2004 Temporal Language Models • Based on the statistic usage of words over time • Compare each word of a non-timestamped document with a reference corpus • Tentative timestamp -- a time partition mostly overlaps in word usage [de Jong et al., AHC 2005] Freq 1 1 1 1 1 1 42
  • 43. Normalized log-likelihood ratio Partition Word 1999 tsunami 1999 Japan 1999 tidal wave 2004 tsunami 2004 Thailand 2004 earthquake Temporal Language Models tsunami Thailand A non-timestamped document Similarity ScoresSimilarity Scores Score(1999) = 1 Score(2004) = 1 + 1 = 2 Most likely timestamp is 2004 Normalized log-likelihood ratio • Variant of Kullback-Leibler divergence • Similarity of a document and time partitions • C is the background model estimated on the corpus • Linear interpolation smoothing to avoid the zero probability of unseen words [Kraaij, SIGIR Forum 2005] 43
  • 44. Improving temporal LMs • Enhancement techniques 1. Semantic-based data preprocessing 2. Search statistics to enhance similarity scores 3. Temporal entropy as term weights Intuition: Direct comparison between extracted words and corpus partitions has limited accuracy Approach: Integrate semantic-based techniques into document preprocessing [Kanhabua et al., ECDL 2008] 44
  • 45. Improving temporal LMs • Enhancement techniques 1. Semantic-based data preprocessing 2. Search statistics to enhance similarity scores 3. Temporal entropy as term weights Intuition: Direct comparison between extracted words and corpus partitions has limited accuracy Approach: Integrate semantic-based techniques into document preprocessing Intuition: Search statistics Google Zeitgeist (GZ) can increase the probability of a tentative time partition Approach: Linearly combine a GZ score with the normalized log-likelihood ratio [Kanhabua et al., ECDL 2008] 45
  • 46. Improving temporal LMs • Enhancement techniques 1. Semantic-based data preprocessing 2. Search statistics to enhance similarity scores 3. Temporal entropy as term weights Intuition: Direct comparison between extracted words and corpus partitions has limited accuracy Approach: Integrate semantic-based techniques into document preprocessing Intuition: Search statistics Google Zeitgeist (GZ) can increase the probability of a tentative time partition Approach: Linearly combine a GZ score with the normalized log-likelihood ratio Intuition: A term weight depends on how good the term is for separating time partitions (discriminative) Approach: Propose temporal entropy, based on a term selection presented in Lochbaum and Streeter [Kanhabua et al., ECDL 2008] 46
  • 47. Semantic-based preprocessing Intuition: Direct comparison between extracted words and corpus partitions has limited accuracy Approach: Integrate semantic-based techniques into document preprocessing Semantic-based Preprocessing Description Part-of-speech tagging Select only interesting classes of words, e.g. nouns, verbs, and adjectives Collocation extraction Co-occurrence of different words can alter the meaning, e.g. “United States” Word sense disambiguation Identify the correct sense of a word from context, e.g. “bank” Concept extraction Compare concepts instead of original words, e.g. “tsunami” and “tidal wave” have the common concept of “disaster” Word filtering Select the top-ranked words according to TF-IDF scores for a comparison [Kanhabua et al., ECDL 2008] 47
  • 48. Leveraging search statistics Intuition: Search statistics Google Zeitgeist (GZ) can increase the probability of a tentative time partition Approach: Linearly combine a GZ score with the normalized log-likelihood ratio [Kanhabua et al., ECDL 2008] 48
  • 49. Leveraging search statistics Intuition: Search statistics Google Zeitgeist (GZ) can increase the probability of a tentative time partition Approach: Linearly combine a GZ score with the normalized log-likelihood ratio [Kanhabua et al., ECDL 2008] 49
  • 50. Leveraging search statistics Intuition: Search statistics Google Zeitgeist (GZ) can increase the probability of a tentative time partition Approach: Linearly combine a GZ score with the normalized log-likelihood ratio P(wi) is the probability that wi occurs: P(wi) = 1.0 if a gaining query P(wi) = 0.5 if a declining query f(R) converts a ranked number into weight. The higher ranked query is more important. An inverse partition frequency, ipf = log N/n [Kanhabua et al., ECDL 2008] 50
  • 51. Temporal entropy Temporal Entropy A measure of temporal information which a word conveys. Captures the importance of a term in a document collection whereas TF-IDF weights a term in a particular document. Tells how good a term is in separating a partition from others. A term occurring in few partitions has higher temporal entropy compared to one appearing in many partitions. The higher temporal entropy a term has, the better representative of a partition. Intuition: A term weight depends on how good the tern is for separating time partitions (discriminative) Approach: Propose temporal entropy, based on a term selection presented in Lochbaum and Streeter [Kanhabua et al., ECDL 2008] 51
  • 52. Temporal entropy Intuition: A term weight depends on how good the tern is for separating time partitions (discriminative) Approach: Propose temporal entropy, based on a term selection presented in Lochbaum and Streeter [Kanhabua et al., ECDL 2008] 52
  • 53. Temporal entropy Np is the total number of partitions in a corpus Intuition: A term weight depends on how good the tern is for separating time partitions (discriminative) Approach: Propose temporal entropy, based on a term selection presented in Lochbaum and Streeter [Kanhabua et al., ECDL 2008] 53
  • 54. Temporal entropy A probability of a partition p containing a term wi Np is the total number of partitions in a corpus Intuition: A term weight depends on how good the tern is for separating time partitions (discriminative) Approach: Propose temporal entropy, based on a term selection presented in Lochbaum and Streeter [Kanhabua et al., ECDL 2008] 54
  • 55. Link-based approach • Dating a document using its neighbors 1. Web pages linking to the document • Incoming links 1. Web pages pointed by the document • Outgoing links 1. Media assets associated with the document • E.g., images • Averaging the last-modified dates of its neighbors as timestamps [Nunes et al., WIDM 2007] 55
  • 56. Hybrid approach • Inferring timestamps using machine learning – Exploit links, contents of a web pages and its neighbors – Features: linguistic, position, page formats, and tags [Chen et al., SIGIR 2010] 56
  • 57. Temporal information extraction • Three types of temporal expressions 1. Explicit: time mentions being mapped directly to a time point or interval, e.g., “July 4, 2012” 2. Implicit: imprecise time point or interval, e.g., “Independence Day 2012” 3. Relative: resolved to a time point or interval using other types or the publication date, e.g., “next month” [Alonso et al., SIGIR Forum 2007] 57
  • 58. Current approaches • Extract temporal expressions from unstructured text – Time and event recognition algorithms • Harvest temporal knowledge from semi- structured contents like Wikipedia infoboxes – Rule-base methods [Verhagen et al., ACL 2005; Strötgen et al., SemEval 2010] [Hoffart et al., AIJ 2012] 58
  • 59. References • [Alonso et al., SIGIR Forum 2007] Omar Alonso, Michael Gertz, Ricardo A. Baeza-Yates: On the value of temporal information in information retrieval. SIGIR Forum 41(2): 35-41 (2007) • [Chen et al., SIGIR 2010] Zhumin Chen, Jun Ma, Chaoran Cui, Hongxing Rui, Shaomang Huang: Web page publication time detection and its application for page rank. SIGIR 2010: 859-860 • [Dumais, SIAM-SDM 2012] Susan T. Dumais: Temporal Dynamics and Information Retrieval. SIAM- SDM 2012 • [de Jong et al., AHC 2005] Franciska de Jong, Henning Rode, Djoerd Hiemstra: Temporal language models for the disclosure of historical text. AHC 2005: 161-168 • [Kanhabua et al., ECDL 2008] Nattiya Kanhabua, Kjetil Nørvåg: Improving Temporal Language Models for Determining Time of Non-timestamped Documents. ECDL 2008: 358-370 • [Kanhabua, SIGIR Forum 2012] Nattiya Kanhabua: Time-aware approaches to information retrieval. SIGIR Forum 46(1): 85 (2012) • [Ke et al., CN 2006] Yiping Ke, Lin Deng, Wilfred Ng, Dik Lun Lee: Web dynamics and their ramifications for the development of Web search engines. Computer Networks 50(10): 1430-1447 (2006) • [Kraaij, SIGIR Forum 2005] Wessel Kraaij: Variations on language modeling for information retrieval. SIGIR Forum 39(1): 61 (2005) • [Nunes et al., WIDM 2007] Sérgio Nunes, Cristina Ribeiro, Gabriel David: Using neighbors to date web documents. WIDM 2007: 129-136 • [Risvik et al., CN 2002] Knut Magne Risvik, Rolf Michelsen: Search engines and Web dynamics. Computer Networks 39(3): 289-302 (2002) 59
  • 60. References (cont’) • [Risvik et al., CN 2002] Knut Magne Risvik, Rolf Michelsen: Search engines and Web dynamics. Computer Networks 39(3): 289-302 (2002) • [Strötgen et al., SemEval 2010] Jannik Strötgen, Michael Gertz: Heideltime: High quality rule-based extraction and normalization of temporal expressions. SemEval 2010: 321-324 • [WebDyn 2010] Web Dynamics course: http://www.mpi- inf.mpg.de/departments/d5/teaching/ss10/dyn/, Max-Planck Institute for Informatics, Saarbrücken, Germany, 2010 • [Verhagen et al., ACL 2005] Marc Verhagen, Inderjeet Mani, Roser Sauri, Jessica Littman, Robert Knippen, Seok Bae Jang, Anna Rumshisky, John Phillips, James Pustejovsky: Automating Temporal Annotation with TARSQI. ACL 2005 60
  • 62. Query Analysis (1) Determining time of queries (2) Named entity evolution (3) Query performance prediction
  • 63. Temporal queries • Temporal information needs – Searching temporal document collections • E.g., digital libraries, web/news archives – Historians, librarians, journalists or students • Temporal queries exist in both standard collections and the Web – Relevancy is dependent on time – Documents are about events at particular time [Berberich et al., ECIR 2010] 63
  • 64. Distribution over time of Qrel Recency query Time-sensitive query Time-insensitive query [Li et al., CIKM 2003] 64
  • 65. Types of temporal queries • Two types of temporal queries 1. Explicit: time is provided, "Presidential election 2012“ 2. Implicit: time is not provided, "Germany World Cup" • Temporal intent can be implicitly inferred • I.e., refer to the World Cup event in 2006 • Studies of web search query logs show a significant fraction of temporal queries – 1.5% of web queries are explicit – ~7% of web queries are implicit – 13.8% of queries contain explicit time and 17.1% of queries have temporal intent implicitly provided [Nunes et al., ECIR 2008; Metzler et al., SIGIR 2009; Zhang et al., EMNLP 2010] 65
  • 66. • Semantic gaps: lacking knowledge about 1. possibly relevant time of queries 2. terminology changes over time Challenges with temporal queries 66
  • 67. Challenges with temporal queries • Semantic gaps: lacking knowledge about 1. possibly relevant time of queries 2. terminology changes over time query time1 time2 … timek suggest 67
  • 68. Challenges with temporal queries • Semantic gaps: lacking knowledge about 1. possibly relevant time of queries 2. terminology changes over time query time1 time2 … timek suggest 68
  • 69. Challenges with temporal queries • Semantic gaps: lacking knowledge about 1. possibly relevant time of queries 2. terminology changes over time • Semantic gaps: lacking knowledge about 1. Possibly relevant time of queries 2. Named entity changes over time 69
  • 70. Determining time of queries • Problem statements – Implicit temporal queries: users have no knowledge about the relevant time of a query – Difficult to achieve high accuracy using only keywords – Relevant results associated to particular time not given • Research question – How to determine the time of an implicit temporal query and use the determined time for improving search results? 70
  • 71. Current approaches 1. Query log analysis 2. Search result analysis [Kanhabua et al., ECDL 2010] 71
  • 72. Query log analysis • Mining query logs – Analyze query frequencies over time for identifying the relevant time of queries – Re-rank search results of implicit temporal queries using the determined time [Metzler et al., SIGIR 2009; Zhang et al., EMNLP 2010] 72
  • 73. Search result analysis • Using temporal language models – Determine time of queries when no time is given explicitly – Re-rank search results using the determined time • Exploiting time from search snippets – Extract temporal expressions (i.e., years) from the contents of top-k retrieved web snippets for a given query – Content-based language-independent approach [Kanhabua et al., ECDL 2010; Campos et al., TempWeb 73
  • 74. Approach I. Dating using keywords* Approach II. Dating using top-k documents* – Queries are short keywords – Inspired by pseudo-relevance feedback Approach III. Using timestamp of top-k documents – No temporal language models are used *Using Temporal Language Models proposed by de Jong et al. Determining time of queries [Kanhabua et al., ECDL 2010] 74
  • 75. I. Dating using keywords [Kanhabua et al., ECDL 2010] 75
  • 76. I. Dating using keywords Query’s temporal profiles [Kanhabua et al., ECDL 2010] 76
  • 77. II. Dating using top-k documents [Kanhabua et al., ECDL 2010] 77
  • 78. II. Dating using top-k documents Query’s temporal profiles [Kanhabua et al., ECDL 2010] 78
  • 79. III. Using timestamp of documents [Kanhabua et al., ECDL 2010] 79
  • 80. III. Using timestamp of documents Query’s tempora profiles [Kanhabua et al., ECDL 2010] 80
  • 81. • Intuition: documents published closely to the time of queries are more relevant – Assign document priors based on publication dates Re-ranking search results query News archive Determine time 2005, 2004, 2006, ... D2009 Initial retrieved results [Kanhabua et al., ECDL 2010] 81
  • 82. • Intuition: documents published closely to the time of queries are more relevant – Assign document priors based on publication dates Re-ranking search results query News archive Determine time 2005, 2004, 2006, ... D2009 Initial retrieved results D2005 Re-ranked results [Kanhabua et al., ECDL 2010] 82
  • 83. Challenges of temporal search • Semantic gaps: lacking knowledge about 1. Possibly relevant time of queries 2. Named entity changes over time query synonym@2001 synonym@2002 … synonym@2011 suggest 83
  • 84. Problem Statements • Queries of named entities (people, company, place) – Highly dynamic in appearance, i.e., relationships between terms changes over time – E.g. changes of roles, name alterations, or semantic shift Named entity evolution 84
  • 85. Problem Statements • Queries of named entities (people, company, place) – Highly dynamic in appearance, i.e., relationships between terms changes over time – E.g. changes of roles, name alterations, or semantic shift Named entity evolution Scenario 1 Query: “Pope Benedict XVI” and written before 2005 Documents about “Joseph Alois Ratzinger” are relevant 85
  • 86. Problem Statements • Queries of named entities (people, company, place) – Highly dynamic in appearance, i.e., relationships between terms changes over time – E.g. changes of roles, name alterations, or semantic shift Named entity evolution Scenario 1 Query: “Pope Benedict XVI” and written before 2005 Documents about “Joseph Alois Ratzinger” are relevant Scenario 2 Query: “Hillary R. Clinton” and written from 1997 to 2002 Documents about “New York Senator” and “First Lady of the United States” are relevant 86
  • 87. Research question • How to detect named entity changes in web documents? Named entity evolution 87
  • 89. Current approaches 1. Temporal co-occurrence 2. Temporal association rule mining 3. Temporal knowledge extraction – Ontology – Wikipedia history 89
  • 90. Temporal co-occurrence • Temporal co-occurrence – Measure the degree of relatedness of two entities at different times by comparing term contexts – Require a recurrent computation at querying time, which reduce efficiency and scalability [Berberich et al., WebDB 2009] 90
  • 91. Association rule mining • Temporal association rule mining – Discover semantically identical concepts (or named entities) that are used in different time – Two entities are semantically related if their associated events occur multiple times in a collection – Events are represented as sentences containing a subject, a verb, objects, and nouns [Kaluarachchi et al., CIKM 2010] 91
  • 92. Temporal knowledge extraction • YAGO ontology – Extract named entities from the YAGO ontology – Track named entity evolution using the New York Times Annotated Corpus • Wikipedia history – Define a time-based synonym as a term semantically related to a named entity at a particular time period – Extract synonyms of named entities from anchor texts in article links using the whole history of Wikipedia [Mazeika et al., CIKM 2011; Kanhabua et al., JCDL 2010] 92
  • 93. Searching with name changes • Extract time-based synonyms from Wikipedia – Synonyms are words with similar meanings – In this context, synonyms refer name variants (name changes, titles, or roles) of a named entity • E.g., "Cardinal Joseph Ratzinger" is a synonym of "Pope Benedict XVI" before 2005 • Two types of time-based synonyms 1.Time-independent 2.Time-dependent [Kanhabua et al., JCDL 2010] 93
  • 94. Recognize named entities [Kanhabua et al., JCDL 2010] 94
  • 95. Recognize named entities [Kanhabua et al., JCDL 2010] 95
  • 96. Recognize named entities [Kanhabua et al., JCDL 2010] 96
  • 97. Find synonyms • Find a set of entity-synonym relationships at time tk [Kanhabua et al., JCDL 2010] 97
  • 98. Find synonyms • Find a set of entity-synonym relationships at time tk • For each ei ϵ Etk , extract anchor texts from article links: – Entity: President_of_the_United_States – Synonym: George W. Bush – Time: 11/2004 [Kanhabua et al., JCDL 2010] 98
  • 99. Find synonyms • Find a set of entity-synonym relationships at time tk • For each ei ϵ Etk , extract anchor texts from article links: – Entity: President_of_the_United_States – Synonym: George W. Bush – Time: 11/2004 [Kanhabua et al., JCDL 2010] 99
  • 100. Find synonyms • Find a set of entity-synonym relationships at time tk • For each ei ϵ Etk , extract anchor texts from article links: – Entity: President_of_the_United_States – Synonym: George W. Bush – Time: 11/2004 [Kanhabua et al., JCDL 2010] 100
  • 101. Initial results • Time periods are not accurate Note: the time of synonyms are timestamps of Wikipedia articles (8 years) [Kanhabua et al., JCDL 2010] 101
  • 102. • Analyze NYT Corpus to discover accurate time – 20-year time span (1987-2007) • Use the burst detection algorithm – Time periods of synonyms = burst intervals Enhancement using NYT [Kanhabua et al., JCDL 2010] 102
  • 103. • Analyze NYT Corpus to discover accurate time – 20-year time span (1987-2007) • Use the burst detection algorithm – Time periods of synonyms = burst intervals Enhancement using NYT [Kanhabua et al., JCDL 2010] 103
  • 104. • Analyze NYT Corpus to discover accurate time – 20-year time span (1987-2007) • Use the burst detection algorithm – Time periods of synonyms = burst intervals Enhancement using NYT Initial results [Kanhabua et al., JCDL 2010] 104
  • 105. Query expansion 1. A user enters an entity as a query QUEST Demo: http://research.idi.ntnu.no/wislab/quest/ [Kanhabua et al., ECML PKDD 2010] 105
  • 106. Query expansion 1. A user enters an entity as a query QUEST Demo: http://research.idi.ntnu.no/wislab/quest/ [Kanhabua et al., ECML PKDD 2010] 106
  • 107. Query expansion 1. A user enters an entity as a query 2. The system retrieves synonyms wrt. the query QUEST Demo: http://research.idi.ntnu.no/wislab/quest/ [Kanhabua et al., ECML PKDD 2010] 107
  • 108. Query expansion 1. A user enters an entity as a query 2. The system retrieves synonyms wrt. the query QUEST Demo: http://research.idi.ntnu.no/wislab/quest/ [Kanhabua et al., ECML PKDD 2010] 108
  • 109. Query expansion 1. A user enters an entity as a query 2. The system retrieves synonyms wrt. the query 3. The user select synonyms to expand the query QUEST Demo: http://research.idi.ntnu.no/wislab/quest/ [Kanhabua et al., ECML PKDD 2010] 109
  • 110. Query expansion 1. A user enters an entity as a query 2. The system retrieves synonyms wrt. the query 3. The user select synonyms to expand the query QUEST Demo: http://research.idi.ntnu.no/wislab/quest/ [Kanhabua et al., ECML PKDD 2010] 110
  • 111. 1. Performance prediction – Predict the retrieval effectiveness wrt. a ranking model Query prediction problems 111
  • 112. 1. Performance prediction – Predict the retrieval effectiveness wrt. a ranking model Query prediction problems query precision = ? recall = ? MAP = ? predict 112
  • 113. 1. Performance prediction – Predict the retrieval effectiveness wrt. a ranking model 2. Retrieval model prediction – Predict the retrieval model that is most suitable Query prediction problems 113
  • 114. 1. Performance prediction – Predict the retrieval effectiveness wrt. a ranking model 2. Retrieval model prediction – Predict the retrieval model that is most suitable Query prediction problems query ranking = ? predict max(precision) max(recall) max(MAP) 114
  • 115. Problem Statement • Predict the effectiveness (e.g., MAP) that a query will achieve in advance of, or during retrieval – high MAP  “good” – low MAP  “poor” Objective • Apply query enhancement techniques to improve the overall performance – Query suggestion is applied for “poor” queries Query performance prediction [Hauff et al., CIKM 2008 ;Hauff et al., ECIR 2010; Carmel et al., 2010] 115
  • 116. • First study of performance prediction for temporal queries – Propose 10 time-based pre-retrieval predictors • Both text and time are considered • Experiment – Collection: NYT Corpus and 40 temporal queries • Results – Time-based predictors outperform keyword-based predictors – Combined predictors outperform single predictors in most cases • Open issue – Consider time uncertainty Temporal query performance prediction [Kanhabua et al., SIGIR 2011] 116
  • 117. • Problem statement – Two time aspects: publication time and content time • Content time = temporal expressions mentioned in documents – Difference in effectiveness for temporal queries when ranking using publication time or content time Time-aware ranking prediction [Kanhabua et al., SIGIR 2012] 117
  • 118. • First study of the impact on retrieval effectiveness of ranking models using two time aspects • Three features from analyzing top-k results – Temporal KL-divergence [Diaz et al., SIGIR 2004] – Content Clarity [Cronen-Townsend et al., SIGIR 2002] – Divergence of retrieval scores [Peng et al., ECIR 2010] Learning to select time-aware ranking [Kanhabua et al., SIGIR 2012] 118
  • 119. • Measure the difference between the distribution over time of top-k retrieved documents of q and the collection – Consider both time dimensions Temporal KL-divergence [Diaz et al., SIGIR 2004] 119
  • 120. • The content clarity is measured by the Kullback- Leibler (KL) divergence between the distribution of terms of retrieved documents and the background collection Content Clarity [Cronen-Townsend et al., SIGIR 2002] 120
  • 121. • Measure the divergence of scores from the base ranking, e.g., a non time-aware ranking model – To determine the extent that a ranking model alters the scores of the initial ranking • Features 1. averaged scores of the base ranking 2. averaged scores of PT-Rank 3. averaged scores of CT-Rank 4. divergence from the base ranking model Divergence of ranking scores [Peng et al., ECIR 2010] 121
  • 122. Discussion • Results – A small number of top-k documents achieves better performance – The larger number k, the more irrelevant documents are introduced into the analysis • Open issue – When comparing with the optimal case there is still room for further improvements [Kanhabua et al., SIGIR 2012] 122
  • 123. References • [Berberich et al., WebDB 2009] Klaus Berberich, Srikanta J. Bedathur, Mauro Sozio, Gerhard Weikum: Bridging the Terminology Gap in Web Archive Search. WebDB 2009 • [Berberich et al., ECIR 2010] Klaus Berberich, Srikanta J. Bedathur, Omar Alonso, Gerhard Weikum: A Language Modeling Approach for Temporal Information Needs. ECIR 2010: 13-25 • [Campos et al., TempWeb 2012] Ricardo Campos, Gaël Dias, Alípio Jorge, Celia Nunes: Enriching temporal query understanding through date identification: How to tag implicit temporal queries? TWAW 2012: 41-48 • [Carmel et al., 2010] David Carmel, Elad Yom-Tov: Estimating the Query Difficulty for Information Retrieval Morgan & Claypool Publishers 2010 • [Cronen-Townsend et al., SIGIR 2002] Stephen Cronen-Townsend, Yun Zhou, W. Bruce Croft: Predicting query performance. SIGIR 2002: 299-306 • [Diaz et al., SIGIR 2004] Fernando Diaz, Rosie Jones: Using temporal profiles of queries for precision prediction. SIGIR 2004: 18-24 • [Hauff et al., CIKM 2008] Claudia Hauff, Vanessa Murdock, Ricardo A. Baeza-Yates: Improved query difficulty prediction for the web. CIKM 2008: 439-448 • [Hauff et al., ECIR 2010] Claudia Hauff, Leif Azzopardi, Djoerd Hiemstra, Franciska de Jong: Query Performance Prediction: Evaluation Contrasted with Effectiveness. ECIR 2010: 204-216 • [Kaluarachchi et al., CIKM 2010] Amal Chaminda Kaluarachchi, Aparna S. Varde, Srikanta J. Bedathur, Gerhard Weikum, Jing Peng, Anna Feldman: Incorporating terminology evolution for query translation in text retrieval with association rules. CIKM 2010: 1789-1792 123
  • 124. References (cont’) • [Kanhabua et al., JCDL 2010] Nattiya Kanhabua, Kjetil Nørvåg: Exploiting time-based synonyms in searching document archives. JCDL 2010: 79-88 • [Kanhabua et al., ECDL 2010] Nattiya Kanhabua, Kjetil Nørvåg: Determining Time of Queries for Re-ranking Search Results. ECDL 2010: 261-272 • [Kanhabua et al., SIGIR 2011] Nattiya Kanhabua, Kjetil Nørvåg: Time-based query performance predictors. SIGIR 2011: 1181-1182 • [Kanhabua et al., SIGIR 2012] Nattiya Kanhabua, Klaus Berberich, Kjetil Nørvåg: Learning to select a time-aware retrieval model. (To appear) SIGIR 2012 • [Li et al., CIKM 2003] Xiaoyan Li, W. Bruce Croft: Time-based language models. CIKM 2003: 469-475 • [Metzler et al., SIGIR 2009] Donald Metzler, Rosie Jones, Fuchun Peng, Ruiqiang Zhang: Improving search relevance for implicitly temporal queries. SIGIR 2009: • [Mazeika et al., CIKM 2011] Arturas Mazeika, Tomasz Tylenda, Gerhard Weikum: Entity timelines: visual analytics and named entity evolution. CIKM 2011: 2585-2588 • [Nunes et al., ECIR 2008] Sérgio Nunes, Cristina Ribeiro, Gabriel David: Use of Temporal Expressions in Web Search. ECIR 2008: 580-584 • 700-701 • [Peng et al., ECIR 2010] Jie Peng, Craig Macdonald, Iadh Ounis: Learning to Select a Ranking Function. ECIR 2010: 114-126 • [Zhang et al., EMNLP 2010] Ruiqiang Zhang, Yuki Konda, Anlei Dong, Pranam Kolari, Yi Chang, Zhaohui Zheng: Learning Recurrent Event Queries for Web Search. EMNLP 2010: 1129-1139124
  • 126. Evolution of Web Search Results (1) Short-term impacts: Caching Results (2) Longitudinal analysis of search results 126RuSSIR 2012 126
  • 127. Architecture of a Search Engine • How search results change? 1. Crawl the Web 2. Index the content 3. Go to line 1 Crawler Document Collection Indexer Indexes Query processor AnswersWWW 127
  • 128. Architecture of a Search Engine • How search results change? 1. Crawl the Web 2. Index the content 3. Go to line 1 Crawler Document Collection Indexer Indexes Query processor AnswersWWW on-line 128
  • 130. Indexing the Dynamic Web • How to refresh the index? – Batch update (re-build) – Re-merge – In-place • Batch update – Shadowing – Simplest, the old index can keep serving at high rates [Lester et al., IPM 2006] 130
  • 131. Indexing the Dynamic Web • Re-merge – A “buffer” B of new index entries – Compute queries over index I and B and merge results – Merge B with I when size(B) > threshold – Optimizations: logarithmic and geometric merging [Lester et al., IPM 2006] 131
  • 132. Indexing the Dynamic Web • In-place – Over-allocation: Leave free space at the end of each list – Add new entries to the free space; o.w., relocate to a new position on disk [Lester et al., IPM 2006] 132
  • 133. Query Processing • Is updating the underlying index enough? If a query is submitted for the first time, it is processed over an up-to-date index But... what about the cached results? 133
  • 134. Why Result Caching? • Around 50% of a query stream is composed of repeating queries [Fagni et al., TOIS 2006] Query popularity: heavy tail distribution 134
  • 135. Result Caches • Result cache: top-k urls with snippets per query • Results caching helps to reduce – query response time – traffic volume hitting the backend “bin laden dead” BACKEND miss (compulsory) hit 135
  • 136. Problem: dynamicity! • Key observations: – Caches can be very large (thanks to cheap hw) – Web changes rapidly (thanks to users) “bin laden dead” BACKEND Minutes, days, or weeks, based on Q 136
  • 137. A deeper look: Capacity vs. Hits • Query log :Yahoo! Search engine – Few cache nodes during 9 days • Cache capacity – can be very large – eviction policy is less of a concern • Hit rate: fraction of queries that are hits • Higher capacity  more hits! [Cambazoglu et al., WWW 2010] (Slide provided by the authors) 137
  • 138. Capacity vs. age • Hit age: time since the last update • Higher capacity  higher age! [Cambazoglu et al., WWW 2010] (Slide provided by the authors) 138
  • 139. Strategies for cache freshness • Decoupled approaches: cache does not know what changed • Coupled approaches: cache uses clues from the index to predict what changed [Cambazoglu et al., WWW 2010] 139
  • 140. Decoupled approaches: Flushing • Solution: flush periodically – Coincides with new index – Bounds average age – Impacts hit rate negatively [Cambazoglu et al., WWW 2010] (Slide provided by the authors) 140
  • 141. Decoupled approaches:TTL • Time-to-live (TTL) – Assign a fixed TTL value to each cached result – Pros: Practical, almost no implementation cost – Cons: Blind strategy, may be sub-optimal … qi Result Cache q1 R1 TTL(q1) q2 R2 TTL(q2) qk Rk TTL(qk) 141
  • 142. Decoupled approaches:TTL • TTL per query – Stale results  Acceptable: search engines do not have a perfect view! – Stable hit rate; average age is still bounded! [Cambazoglu et al., WWW 2010] (Slide provided by the authors) 142
  • 143. Intelligent Refreshing • Enhancement to the TTL mechanism • Updates entries – Re-execute queries – Uses idle cycles • Ideally – update: low activity – use: high activity • How to select queries for refreshing? [Cambazoglu et al., WWW 2010] (Slide provided by the authors) 143
  • 144. Intelligent refreshing Temperature [Cambazoglu et al., WWW 2010] (Slide provided by the authors) • A new query goes to young and cold bucket! • Fixed “query interval” for shifting age • Lazy update of temperature • Refresh hot and old first! 144
  • 145. Refresh-rate adjustment • Idle cycles? • Critical mechanism – Prevent overloads • Feedback from the query processors – Latency to process query • Track recent latency – Adjust rate accordingly Query Query Processors Cache Results,Latency [Cambazoglu et al., WWW 2010] (Slide provided by the authors) 145
  • 146. Design summary • Large capacity: millions of entries • TTL: bounds the age of entries • Refreshes: updates cache entries • Refresh rate adjustment: latency feedback [Cambazoglu et al., WWW 2010] (Slide provided by the authors) 146
  • 147. Cache evaluation • Simulation – Yahoo! query log • Baseline policies – No refresh – Cyclic refresh [Cambazoglu et al., WWW 2010] (Slide provided by the authors) 147
  • 148. Cache evaluation • Simulation – Yahoo! query log • Baseline policies – No refresh – Cyclic refresh [Cambazoglu et al., WWW 2010] (Slide provided by the authors) 148
  • 149. Performance in production [Cambazoglu et al., WWW 2010] (Slide provided by the authors) Refreshes on Refreshes off Refreshes on Refreshes off 149
  • 150. Degradation • Under high load – Processors degrade results – Shorter life • TTL mechanism – Prioritize for refreshing – AdjustsTTL when degraded • Refreshes – Replace with better results [Cambazoglu et al., WWW 2010] (Slide provided by the authors) 150
  • 151. Adaptive TTL • Up to now we considered fixed TTL values • Are all queries equal? – “quantum physics” – “barcelona FC” – “ecir 2012” • Another promising direction is assigning adaptive TTL values [Alici et al., ECIR 2012] (Slide provided by the authors) 151
  • 152. Adaptive TTL • Up to now we considered fixed TTL values • Are all queries equal? – “quantum physics” – “barcelona FC” – “ecir 2012” • Another promising direction is assigning adaptive TTL values One size does not fit all!!! [Alici et al., ECIR 2012] (Slide provided by the authors) 152
  • 153. Adaptive I: Average TTL • Observe past update frequency of for top-k results of a query and compute average • Simple, but – Needs history – May not capture bursty update periods 5 1 6 4 NOW: q submitted again! TTL =( 5+1+6+4)/ 4= 4PAST: Update history [Alici et al., ECIR 2012] (Slide provided by the authors) 153
  • 154. Adaptive II: Incremental TTL • Adjust the new TTL value based on the current TTL value • Each time a cached result Rcached with an expired TTLcurr is requested: compute Rnew if Rnew ≠ Rcached // STALE TTLnew  TTLmin // catch bursty updates! else [Alici et al., ECIR 2012] (Slide provided by the authors) 154
  • 155. Adaptive III: Machine-learned • Build an ML model – Query and result-specific feature set F – Assume a query result changes at time point ti • Create a training instance with features Fi • Set target feature (TTLnew) as the time period from ti to the time point tj where query result changes again: tj-ti 5 1 6 4 <F1, 5> <F2, 1> <Fn, ?> t1 t2 [Alici et al., ECIR 2012] (Slide provided by the authors) 155
  • 156. Features for ML [Alici et al., ECIR 2012] (Slide provided by the authors) 156
  • 157. Analysis of Web Search Results • Setup: – 4,500 queries sampled from AOL log – Submitted to Yahoo! API daily from Nov 2010 to April 2011 (120 days) – For days di and di+1 If Ri ≠ Ri+1 (consider top-10 urls and their order), we consider the result as updated! [Alici et al., ECIR 2012] (Slide provided by the authors) 157
  • 158. Distribution of result updates • On average, query results are updated in every 2 days! [Alici et al., ECIR 2012] (Slide provided by the authors) 158
  • 159. Effect of query frequency • More frequent  results updated more frequently • Less frequent  scattered[Alici et al., ECIR 2012] (Slide provided by the authors) 159
  • 160. Effect of query length • No correlation between query length and update frequency [Alici et al., ECIR 2012] (Slide provided by the authors) 160
  • 161. Simulation • Assume all 4,500 queries are submitted daily for 120 days – On day-0, all results are cached – On the following days, whenever the TTL expires, the new result is computed and replaces old ones (i.e., result for that day from Yahoo! API) [Alici et al., ECIR 2012] (Slide provided by the authors) 161
  • 162. Simulation setup • Evaluation metrics [Blanco 2010] – At the end of each day, we compute: False Positive Ratio = Redundant query executions Number of unique queries Stale Traffic Ratio = Stale results returned Number of query occurrences [Alici et al., ECIR 2012] (Slide provided by the authors) 162
  • 163. Simulation setup • While computing ST ratio for a given query at day i: Strict policy: – if Rcached is not strictly equal to Rday-i staleCount++ Relaxed policy: – if Rcached is not strictly equal to Rday-i staleness += 1 – JaccardSim(Rcached, Rday-i) [Alici et al., ECIR 2012] (Slide provided by the authors) 163
  • 164. Performance: Incremental TTL Strict policy Relaxed policy [Alici et al., ECIR 2012] (Slide provided by the authors) 164
  • 165. Less-often updated queries More-often updated queries Performance: Incremental TTL [Alici et al., ECIR 2012] (Slide provided by the authors) 165
  • 166. Performance: Average and Machine Learned TTL Strict policy Relaxed policy [Alici et al., ECIR 2012] (Slide provided by the authors) 166
  • 167. Coupled approaches: CIP • Cache invalidation policy (CIP) • Key idea: Compare all added/updated/deleted documents to cached queries – Incremental index update  updates reflected on-line! [Blanco et al., SIGIR 2010] all queries in cache(s) all changes in the backend index CIP module 167
  • 168. Coupled approaches: CIP Incremental index update: updates reflected on-line! Content changes sent to CIP module to invalidate queries (offline) [Blanco et al., SIGIR 2010] 168
  • 169. Synopses • Synopsis: A vector of document’s top- scoring TF-IDF terms • η: length of synopsis • δ: modification threshold  consider a document d as updated only if diff(d, dold) > δ [Blanco et al., SIGIR 2010] 169
  • 170. Invalidation Approaches • Invalidation logic (for added docs) [Blanco et al., SIGIR 2010] Basic: If the synopis matches the query (i.e., involve all query terms) Scored: Compute Sim(synopsis, query) to the score of k-th result Example 170
  • 171. Invalidation Approaches • Invalidation logic (for deleted docs): – Invalidate all query results including a deleted document • Update is a deletion followed by an addition • Further apply TTL to guarantee an age- bound • Scored + TTL [Blanco et al., SIGIR 2010] 171
  • 172. Evaluation • Really hard to evaluate if you are not sitting at the production department in a real search engine company! 172
  • 173. Simulation Setup • Data: English wikipedia dump – snapshot at Jan 1, 2006 ≈ 1 million pages – All add/deletes/updates for following 30 days • Queries: 10,000 from AOL log [Blanco et al., SIGIR 2010; Alici et al. SIGIR 2011] 173
  • 174. Simulation setup • Evaluation metric – The query result is updated if two top-10 lists are not exactly the same False Positive Ratio = Redundant query executions Number of unique queries Stale Traffic Ratio = Stale results returned Number of query occurrences [Blanco et al., SIGIR 2010] 174
  • 175. Performance • Setup – Score + TTL – Full synopsis • Smaller δ or TTL: – Lower ST, higher FP [Blanco et al., SIGIR 2010] TTL: 1≤t≤5 t=1 t=2 t=3 t=4 t=5 δ=0, 2≤t≤10 δ=0.005, 3≤t≤10 δ=0.01, 5≤t≤20 175
  • 176. TIF: Timestamp-based Invalidation Framewrok • Devise a new invalidation mechanism – better than TTL and close to CIP in detecting stale results – better than CIP and close to TTL in efficiency and practicality [Alici et al., SIGIR 2011] (Slide provided by the authors) 176
  • 177. Timestamp-based Invalidation • The value of the TS on an item shows the last time the item was updated • TIF has two components: – Offline (indexing time) : Decide on term and document timestamps – Online (query time): Decide on the staleness of the query result [Alici et al., SIGIR 2011] (Slide provided by the authors) 177
  • 178. TIF Architecture … Doc. parser documents assigned to the node SEARCH NODE index updates … t1 t2 tT [Alici et al., SIGIR 2011] (Slide provided by the authors) 178
  • 179. TIF Architecture … Doc. parser documents assigned to the node SEARCH NODE index updates … t1 t2 tT … Document timestamps document TS updates TS(d1) TS(d2) TS(dD) Document timestamps [Alici et al., SIGIR 2011] (Slide provided by the authors) 179
  • 180. TIF Architecture … Doc. parser documents assigned to the node SEARCH NODE index updates … Document timestamps document TS updates TS(d1) TS(d2) TS(dD) Document timestamps [Alici et al., SIGIR 2011] (Slide provided by the authors) 180
  • 181. TIF Architecture … Doc. parser documents assigned to the node SEARCH NODE index updates term TS updates … TS(t1) TS(t2) TS(tT) … Document timestamps document TS updates TS(d1) TS(d2) TS(dD) Document timestamps [Alici et al., SIGIR 2011] (Slide provided by the authors) 181
  • 182. TIF Architecture … Doc. parser documents assigned to the node SEARCH NODE index updates qi results Result cache 0/1 Invalidation logic miss/stale qi, Ri, TS(qi) … q1 R1 TS(q1) q2 R2 TS(q2) qk Rk TS(qk) qi results Result cache 0/1 Invalidation logic miss/stale qi, Ri, TS(qi) … q1 R1 TS(q1) q2 R2 TS(q2) qk Rk TS(qk) term TS updates … TS(t1) TS(t2) TS(tT) … Document timestamps document TS updates TS(d1) TS(d2) TS(dD) Document timestamps [Alici et al., SIGIR 2011] (Slide provided by the authors) 182
  • 183. TS Update Policies: Documents • For a newly added document d – TS(d) = now() • For a deleted document d – TS(d) = infinite • For an updated document d – if diff(dnew, dold) > L TS(d) = now() – diff(di, dj): |length(di) – length(dj)| [Alici et al., SIGIR 2011] (Slide provided by the authors) 183
  • 184. TS Update Policies: Terms t • Frequency based update t TS(t) = T0, PLLTS= 5 Number of added postings > F x PLLTS TS(t) = now() PLLTS= 6 [Alici et al., SIGIR 2011] (Slide provided by the authors) 184
  • 185. TS Update Policies: Terms • Score based update t p1 p2 p3 p4 p5 TS(t) = T0, STS = Score(p3)p4 p3 p2 p5 p1 t p6 Score of added posting > STS TS(t) = now() STS = re-sort & compute p1 p2 p3 p4 p5 sort w.r.t. scoring function [Alici et al., SIGIR 2011] (Slide provided by the authors) 185
  • 186. Result Invalidation Policy • A search node decides a result stale if: – C1: ∃d ϵ R, s.t. TS(d) > TS(q) (d is deleted or revised after the generation of query result) or, – C2: ∀t ϵ q, s.t. TS(t) > TS(q) (all query terms appeared in new documents after the generation of query result) • Also apply TTL to avoid stale accumulation [Alici et al., SIGIR 2011] (Slide provided by the authors) 186
  • 187. Simulation Setup • Data: English wikipedia dump – snapshot at Jan 1, 2006 ≈ 1 million pages – All add/deletes/updates for following 30 days • Queries: 10,000 from AOL log [Alici et al., SIGIR 2011] 187
  • 188. Simulation setup • Evaluation metrics [Blanco 2010] – The query result is updated if two top-10 lists are not exactly the same False Positive Ratio = Redundant query executions Number of unique queries Stale Traffic Ratio = Stale results returned Number of query occurrences [Alici et al., SIGIR 2011] (Slide provided by the authors) 188
  • 189. Performance: all queries Frequency-based term TS update Score-based term TS update [Alici et al., SIGIR 2011] (Slide provided by the authors) 189
  • 190. Performance: single-term queries Frequency-based term TS uıpdate Score-based term TS update [Alici et al., SIGIR 2011] (Slide provided by the authors) 190
  • 191. Invalidation Cost Send <q, R, TS(q)> to the search nodes Send all <q, R> to CIP Send all docs to CIP TIF CIP Data transfer Invalidation operations Compare TS values Traverse the query index for every document [Alici et al., SIGIR 2011] (Slide provided by the authors) 191
  • 192. Performance: TIF • A simple yet effective invalidation approach • Predicting stale queries – Better than TTL, close to CIP • Efficiency and practicality – Straightforward in a distributed system 192
  • 193. Grand Summary • Fixed TTL – With refreshing • Adaptive TTL • TIF • CIP Decoupled Decoupled Coupled Coupled Hit rate Hit age ST Ratio FP Ratio ST Ratio FP Ratio ST Ratio FP Ratio STandFPratiodecreases Invalidationcostdecreases Yahoo! Web sample AOL Wikipedia AOL Wikipedia AOL 193
  • 194. Open Research Directions Investigate: • User satisfaction vs. freshness • Complex ranking functions • Alternative index update strategies • Combinations • Adaptive TTL + TIF or CIP • Adaptive TTL + refresh strategy 194
  • 195. Evolution of Web Search Results • Earlier works consider the changes in the content of the –Web –queries • How do real life search engines react this dynamicity? • Compare results from Yahoo! API –for 630K real life queries (from AOL log) –obtained in 2007 and 2010 [Altingovde et al., SIGIR 2011] (Slide provided by the authors) 195
  • 196. What is Novel? • Queries are real, not synthetic • Query set is large • Results from a search engine at two very distant points in time • Focus on the properties of results, but not the underlying content [Altingovde et al., SIGIR 2011] (Slide provided by the authors) 196
  • 197. We Seek to Answer: • How is the growth in Web reflected to top- ranked query results? • Do the query results totally change within time? • Are results located deeper in sites? • Is there any change in result title and snippet properties? [Altingovde et al., SIGIR 2011] (Slide provided by the authors) 197
  • 198. Avg no. of results • Almost tripled (from 16M to 52M), but not uniformly [Altingovde et al., SIGIR 2011] (Slide provided by the authors) 198
  • 199. No. of unique URLs • 20% of the URLs returned at the highest rank in 2010 were at the same position in 2007! [Altingovde et al., SIGIR 2011] (Slide provided by the authors) 199
  • 200. No. of unique domains • The increase in unique domain names in 2010 is more emphasized in comparison to the increase in the number of unique URLs (diversity? coverage?) • Even higher overlap for top-1 domains [Altingovde et al., SIGIR 2011] (Slide provided by the authors) 200
  • 201. Research Directions • How are the results are diversified at these two different time points? –Can we deduce these from snippets? • How does the level of bias changes in query results? [Altingovde et al., SIGIR 2011] (Slide provided by the authors) 201
  • 202. References • [Alici et al., SIGIR 2011] Sadiye Alici, Ismail Sengör Altingövde, Rifat Ozcan, Berkant Barla Cambazoglu, Özgür Ulusoy: Timestamp-based result cache invalidation for web search engines. SIGIR 2011: 973-982 • [Alici et al., ECIR 2012] Sadiye Alici, Ismail Sengör Altingövde, Rifat Ozcan, Berkant Barla Cambazoglu, Özgür Ulusoy: Adaptive Time-to-Live Strategies for Query Result Caching in Web Search Engines. ECIR 2012: 401-412 • [Altingovde et al., SIGIR 2011] Ismail Sengör Altingövde, Rifat Ozcan, Özgür Ulusoy: Evolution of web search results within years. SIGIR 2011: 1237-1238 • [Blanco et al., SIGIR 2010] Roi Blanco, Edward Bortnikov, Flavio Junqueira, Ronny Lempel, Luca Telloli, Hugo Zaragoza: Caching search engine results over incremental indices. SIGIR 2010: 82-89 • [Cambazoglu et al., WWW 2010] Berkant Barla Cambazoglu, Flavio Paiva Junqueira, Vassilis Plachouras, Scott A. Banachowski, Baoqiu Cui, Swee Lim, Bill Bridge: A refreshing perspective of search engine caching. WWW 2010: 181-190 • [Fagni et al., TOIS 2006] Tiziano Fagni, Raffaele Perego, Fabrizio Silvestri, Salvatore Orlando: Boosting the performance of Web search engines: Caching and prefetching query results by exploiting historical usage data. ACM Trans. Inf. Syst. 24(1): 51-78 (2006) • [Lester et al., IPM 2006] Nicholas Lester, Justin Zobel, Hugh E. Williams: Efficient online index maintenance for contiguous inverted lists. Inf. Process. Manage. 42(4): 916-933 (2006) 202
  • 204. Indexing the Past (1) Indexing and searching methods for versioned document collections 204RuSSIR 2012 204
  • 205. Time Travel? • Time travel is easy (in IR)! You only need: – The data – The mechanisms to search • indexing + query processing – (And fasten your seat belts!) 205
  • 206. Why Search the Past? • Historical information needs (an analyst working on the social reactions on the net after 9/11) • To find relevant resources that do not exist anymore • To discover trends, opinions, etc. for a certain time period (what people think about UFOs at the beginning of millenium?) 206
  • 207. Data: Preserving the Web • Non-profit organizations: – Internet archive, European archive,... • Several EU projects – Planets, PROTAGE, PrestoPRIME, Scape, ENSURE, BlogForever, LiWA, LAWA, ARCOMEM... http://archive.org/ 207
  • 208. Data • There are also other “versioned” data collections – Wikis (Wikipedia) – Repositories (code, document, organization intranets) – RSS feeds, news, blogs etc. with continuous updates – Personal desktops Collections including multiple versions of each document with a different timestamp 208
  • 209. Indexing • Various earlier approaches: – [Anick et al., SIGIR 1992 ], [Herscovici et al, ECIR 2007] • Focus on recent work from two different perspectives – Indexing schemes that are more concentrated on the partitioning-related issues • [Berberich et al., SIGIR 2007; Anand et al. CIKM 2010; Anand et al. SIGIR 2011] – Indexing schemes that are more concentrated on the index size • [He et al., CIKM 2009; He et al., CIKM 2010; He et al., SIGIR 2011]209
  • 210. Time-travel Queries • “Queries that combine the content and temporal predicates” [Berberich et al., SIGIR 2007] • Interesting query types – Point-in-time • “euro 2012 articles” @ 1/July/2012 – Interval • “euro 2012 articles” between 01.06-01.07 2012 – Durability in a time interval [Hou U et al., SIGMOD 2010]: • “search engine research papers” that are in top-10 results for 75% of the time between years 2000 and 2012 210
  • 211. Time-point Queries: Indexing • Formal model: – For a document d, each version has begin/end timestamps (validity interval): • For the current version, de = ∞ • For d, all validity times are disjoint t0 t1 t2 t3 ∞ d0 : [t0, t1) d3 : [t3, ∞) [Berberich et al., SIGIR 2007] 211
  • 212. Indexing • Key idea – Keep “validity intervals” within posting lists v <d, tf> <d, tf, db, de> t0 t1 t2 t3 ∞ tf(v)=1 tf(v)=0 tf(v)=2 tf(v)=2 v <d, 1, t0, t1> From <d, 2, t2, t3> <d, 2, t3, ∞> • Problem: Index size explodes! – Even for Wikipedia, 9*109 entries [Berberich et al., SIGIR 2007] 212
  • 213. Remedy: Coalescing • Combine adjacent postings with similar pay-loads v <d, 1, t0, t1> <d, 2, t2, t3> <d, 2, t3, ∞> v <d, 1, t0, t1> <d, 2, t2, ∞> [Berberich et al., SIGIR 2007] 213
  • 214. Optimization Problem [Berberich et al., SIGIR 2007] 214
  • 215. ATC • Linear-time approximate algorithm [Berberich et al., SIGIR 2007] 215
  • 216. It works! [Berberich et al., SIGIR 2007] 216
  • 217. Temporal Partitioning (Slicing) • Even after colescing, wasted I/O for redundant postings that does not overlap with query’s time point v, [t0, t1) <d, 1, t0, t1> <d, 2, t2, ∞> Remark: Validity intervals still reside in the postings! v, [t2, ∞) [Berberich et al., SIGIR 2007] 217
  • 218. Trade-off • Optimal approaches – Space-optimal – Performance-optimal • Trading off space and performance: – Performance-Guarantee apparoach – Space-Bound apparoach [Berberich et al., SIGIR 2007] 218
  • 219. Computational Complexity • Performance-Guarantee Approach – DP: time and space complexity O(n2 ) • Space-Bound Approach – DP solution is prohibitively expensive – Approximate solution using simulated annealing • Time O(n2 ), Space O(n) • Notice n is the number of all unique time interval boundaries for a given term’s posting list [Berberich et al., SIGIR 2007] 219
  • 220. Time-point Queries: Result • Given a space-bound of 3 (i.e., 3x of the optimal space), close-to-optimal performance is achievable! – (Reminder: Partitioning is not very cheap!) [Berberich et al., SIGIR 2007] 220
  • 221. Time-point Queries: Result • Time-point queries can be handled like this: • Example 221
  • 222. Time-interval Queries? • When more than one partitions should be accessed, wasted I/O due to repeated postings! (e.g., 3x more postings in SB!) • Example [Anand et al., CIKM 2010] 222
  • 223. Solution Approaches Partition selection [Anand et al., CIKM 2010] Can we avoid accessing all partitions related to a given query? Document partitioning (sharding) [Anand et al., SIGIR 2011] Can we partition postings in a list in a different way i.e., other than using time information? 223
  • 224. Partition Selection • Problem: Optimize result quality by accessing a subset of “affected partitions” without exceeding a given I/O upper- bound Optimization criterion: maximize the fraction of original query results (relative recall). Two types of constraints: a) Size-based: allowed to read up to a fixed number of postings (focus on sequential access) b) Equi-cost: allowed to read up to a fixed number of partitions (focus on random access) [Anand et al., CIKM 2010] 224
  • 225. Assumption • Assume an oracle exists “providing us the cardinalities of the individual partitions as well as the result of their intersection/union” • Later, KMV synopses will be employed as an approximation [Anand et al., CIKM 2010] 225
  • 226. Single-term queries • Size-based partition selection • Equi-cost partition selection • DP solutions exist, but expensive! • Approximation – Reduce the problem to budgeted max coverage N [Anand et al., CIKM 2010] 226
  • 227. GreedySelect Algorithm • Cost(P): – Size-based: no. of postings in the partition – Equi-cost: 1 • Benefit(P): – no of unread postings in P • At each step: – Select the partition with highest B(P) / C(P) – Update benefits of the unselected partitions Recall: we assume an oracle providing these numbers! [Anand et al., CIKM 2010] 227
  • 229. Multi-term Queries • Conjunctive semantics intersect partitions of query terms • Optimization objective: increase the coverage of postings that are in this intersection space of partitions Size-based N Equi-cost [Anand et al., CIKM 2010] 229
  • 230. Simple Math does the job! • Compute “union of intersections” • x: a tuple of intersection from each term • x is a tuple from the Cartesian product of partitions for each query term – X = Pterm1 x Pterm2 x..... X Ptermk and then, – x = {(Pterm1,i, Pterm2,j, ..., x Ptermk, l)} Choose a partition from this term’s partitions [Anand et al., CIKM 2010] 230
  • 231. GreedySelect, again! • Apply GreedySelect over X to pick x ϵ X • C(x) – Sum of the sizes of unselected partitions in x, or, number of unselected partitions in x • B(x) – No of new documents that appear in the intersection of the partitions in x • Modify algorithm to update benefits and costs after each picking a tuple x. [Anand et al., CIKM 2010] 231
  • 232. Cartesian Product or Join? • Problem: Set X might be very large! • Remedy: Observe that partitions of different terms that have no temporal overlap cannot have any intersecting docs! Apply the algorithm over the t-join set! [Anand et al., CIKM 2010] 232
  • 233. Performance of Partition Selection • Reletive recall: 50% of affected partitions might be enough! – 3 datasets, Wikipedia seems harder! Wikipedia UKGOV NY Times [Anand et al., CIKM 2010] 233
  • 234. Performance of Partition Selection • Compare query run times for: – Unpartitioned index, – No partition selection, – Partition selection with I/O bound (all index files are compressed!) [Anand et al., CIKM 2010] 234
  • 235. Real Life: KMV Synopses • Instead of assuming an “oracle” for cardinality estimation, use KMV synopses • A KMV synopsis for a multiset S: – Fix a hash function h – Apply h to each distinct value in S – k-smallest of the hashed values form KMV [Beyer et al., SIGMOD 2007] Effective sketches for sets supporting arbitrary multiset operations: ⋃, ∩, / [Anand et al., CIKM 2010] 235
  • 236. Partition Selection with KMV • For each partition of each term: – create a flat file storing KMV synopses (5% and 10% samples) [Anand et al., CIKM 2010] 236
  • 237. Partition Selection with KMV • KMV is promising – 5% is enough! Wikipedia UKGOV NY Times [Anand et al., CIKM 2010] 237
  • 238. Document partitioning (Sharding) • Another solution to avoid reading and processing repeated postings with cross- cutting validity intervals in temporal partitioning • Example [Anand et al., SIGIR 2011] 238
  • 239. Sharding • Key idea: – Instead of partitioning temporally, partition postings based on doc-ids – No repetition as in slicing – Example [Anand et al., SIGIR 2011] 239
  • 240. Sharding • Entries in a shard are ordered according to their begin time • For each shard, an axuliary data structure: – List of pairs: (query begin times, offset in shard) – Maintain for practical granularities of begin times (like, days)  can fit in memory! – For a query with a begin time • Seek to the offset position in the shard and then read sequentially [Anand et al., SIGIR 2011] 240
  • 241. Why several shards? • There can be still wasted disk I/O while reading a shard: • Problem is the postings with long validity intervals (i.e., subsuming lot of other postings) Query begin time Earliest posting that includes the query begin time! This causes reading 5 useless postings! [Anand et al., SIGIR 2011] 241
  • 242. Idealized Sharding • For a given posting list L, find a minimal set of shards that satisfy staircase property (SP) if begin(p) ≤ begin(q) → end(p) ≤ end(q) • Greedy solution: – For each posting p in L (in begin-time order) • Add p in an exiting shard s if SP is satisfied • Otherwise, start a new shard with p [Anand et al., SIGIR 2011] 242
  • 243. Merging Shards • Idealized sharding can yield several shards – random access per shard is expensive! • A greedy merging algortihm – Penalty <= CostRan /CostSeq – Penalty (pairwise): wasted sequential reads for merging two shards – Merge in ascending choice order and then smallest size order More on this later! [Anand et al., SIGIR 2011] 243
  • 244. Performance • Sharding improves query processing times w.r.t. temporal partitioning or no partitioning et al. • Merging shards results shorter QP times • Time measurements are taken on warm caches!!! [Anand et al., SIGIR 2011] 244
  • 245. A Grand Summary • A full posting list with validity intervals – High sequential access + CPU cost - 1 random access per query term • Temporal partitioning (slicing) – Reduce seq access (but, repetitions) – Reduce CPU cost – ≥1 random accesses per query term (with partition selection) • Document partitioning (sharding) – Reduce seq access (no repetitions + time map) – Reduce CPU cost (staircase property) – ≥1 random accesses per query term (read all shards?!) TRADE-OFF: May increase random accesses for reducing sequantial accesses and CPU cost 245
  • 246. An Alternative Perspective • Work from Polytechnique Ins. of NYU • Focus on the index size Approaches up to now consider each version of a document separately: no special attention on the overlap between versions 246
  • 247. Index Compression • Key ideas – Small integers can be represented with smaller codes – Doc ids are not so small: instead, compress the gaps between the ids – Term frequencies are already small – Example 247
  • 248. Indexing Versions • Assume validity-intervals are stored separately  Time-constraints @ post-processing • Versions of di are represented as di,j t0 t1 t2 t3 ∞ tf(v)=1 tf(v)=0 tf(v)=2 tf(v)=2 v <d1, 1, t0, t1> <d1, 2, t2, t3> <d1, 2, t3, ∞> v <d11, 1> <d1,2, 2> <d1,3, 2> Question: How can we reduce the storage space for such a representation? [He et al., CIKM 2009] 248
  • 249. Versions Are Similar • There is a high overlap between versions! Snapshot@ May 15 Snapshot@ June 15 Snapshot@ July 15 249
  • 250. How to Exploit? • Simplest idea: – Assign consecutive doc IDs to consecutive versions of the same document – Allows small d-gaps for overlapping terms among the versions <d1000, 1> <d3000, 2> <d6000, 2> russir <d1, 1> <d2, 2> <d3, 2> d1: http://romip.ru/russir2012/, [May 15, June 15] d2: http://romip.ru/russir2012/, [June 15, July 15] d3: http://romip.ru/russir2012/, [July 15, Aug 15] russir Random ids Consecutive ids [He et al., CIKM 2009] 250
  • 251. MSA Approach • Multiple Segment Alignment [Herscovici et al., ECIR’07] – Given d with some versions, a virtual document Vi,j is all terms occuring in (only) versions i through j – Reduces the number of postings but increases the document space! (theoretically, up to N2 virtual documents!) [He et al., CIKM 2009] 251
  • 252. MSA: Example Herscovici et al., ECIR 2007] 252
  • 253. Indexing the Differences • DIFF approach [Anick et al., SIGIR’92] – For every pair of consecutive versions di and di+1; create a virtual document that is the symmetric difference between these versions. [He et al., CIKM 2009] 253
  • 255. Performance • Two index compression schemes – Interpolative coding (IPC) – PForDelta with optimizations (PFD) • Two datasets: Wiki, Ireland Wikipedia Ireland Random-IPC 4957 2908 Random-PFD 5499 3289Index sizes for doc ids Query processing times (in-memory): Sorted is the best! [He et al., CIKM 2009] 255
  • 256. Two-level Indexing: Roots • Assume we have clusters of documents t1 t2 d1 d2 2 d4 2 1 d2 2 d3 6 C1 d1 d2 d3 d4 C2 C3 • We have queries restricted to clusters: – {t1, t2} in C3  intersect lists  d2, d4  d4 d4 1 256
  • 257. Cluster-skipping index • Skip redundant postings during QP – Reduce decompression cost [Altingovde et al., TOIS 2008] • Skipping is widely used for QP in practice t1 t2 C1 d1 1 C2 d2 2 d3 6 C2 d2 2 C3 d4 2 C3 d4 1 [Altingovde et al., TOIS 2008] 257
  • 258. Two-level indexing • Apply the same idea for versions: – First level index: document ids – Second level index: a bitvector for versions No need for indexing seperate ids for each version  implicit from the bicvector. [He et al., CIKM 2009] 258
  • 259. Two-level indexing • In actual implementation, group blocks of doc ids and bitvectors of versions • Bitvectors best compressed by hierarchical Huffman coding! russir <d1, 1> <d2, 2> <d3, 2> russir <d> 1 1 1 0 0/1: First three versions include “russir” russir <d> 1 2 2 0 TFs: Actual term frequencies [He et al., CIKM 2009] 259
  • 260. Two-level indexing: HUFF • Query Processing: Decompress bitvectors of documents that are in the intersection of query terms russir <d1> 1 2 2 0 <d2> 1 1 2 0 <d4> 1 0 1 0 schedule <d3> 1 2 2 1 <d4> 1 2 2 0 <d5> 1 3 2 1 Final result: d4,1 and d4,3 [He et al., CIKM 2009] 260
  • 261. Two level indexing • Same idea can be applied to MSA and DIFF – For virtual documents, create bitvectors as before – Compress first level IPC, second level PFD • Even better compression performance [He et al., CIKM 2010] Index sizes for doc ids HUFF 213 577 261
  • 262. Two level indexing: QP • Queries without temporal constaints [He et al., SIGIR 2011] 262
  • 263. Two level indexing: QP • Queries with temporal constaints – Post-processing [He et al., SIGIR 2011] 263
  • 264. Two level indexing: QP • Queries with temporal constaints – Partitioning, again! [He et al., SIGIR 2011] 264
  • 265. QP Performance [He et al., SIGIR 2011] 265
  • 266. Grand Summary Partition-focused – Time information kept in postings – Redundancy solved by partitioning • Process only relevant partitions – Lossy compression of versions (coalescing) Size-focused – Time information kept separately – Redundancy solved by 2-level compression • Process compact blocks • Hierarchical partitions – Lossless compression of versions 266
  • 267. References • [Altingovde et al., TOIS 2008] Ismail Sengör Altingövde, Engin Demir, Fazli Can, Özgür Ulusoy: Incremental cluster-based retrieval using compressed cluster-skipping inverted files. ACM Trans. Inf. Syst. 26(3): (2008) • [Anand et al., CIKM 2010] Avishek Anand, Srikanta J. Bedathur, Klaus Berberich, Ralf Schenkel: Efficient temporal keyword search over versioned text. CIKM 2010: 699-708 • [Anand et al., SIGIR2011] Avishek Anand, Srikanta J. Bedathur, Klaus Berberich, Ralf Schenkel: Temporal index sharding for space-time efficiency in archive search. SIGIR 2011: 545-554 • [Anick et al., SIGIR 1992] Peter G. Anick, Rex A. Flynn: Versioning a Full-text Information Retrieval System. SIGIR 1992: 98-111 • [Berberich et al., SIGIR 2007] Klaus Berberich, Srikanta J. Bedathur, Thomas Neumann, Gerhard Weikum: A time machine for text search. SIGIR 2007: 519-526 • [Beyer et al., SIGMOD 2007] Kevin S. Beyer, Peter J. Haas, Berthold Reinwald, Yannis Sismanis, Rainer Gemulla: On synopses for distinct-value estimation under multiset operations. SIGMOD Conference 2007: 199-210 • [He et al., CIKM 2009] Jinru He, Hao Yan, Torsten Suel: Compact full-text indexing of versioned document collections. CIKM 2009: 415-424 • [He et al., CIKM 2010] Jinru He, Junyuan Zeng, Torsten Suel: Improved index compression techniques for versioned document collections. CIKM 2010: 1239-1248 • [He et al., SIGIR 2011] Jinru He, Torsten Suel: Faster temporal range queries over versioned text. SIGIR 2011: 565-574 • [Herscovici et al, ECIR 2007] Michael Herscovici, Ronny Lempel, Sivan Yogev: Efficient Indexing of Versioned Document Sequences. ECIR 2007: 76-87 • [Hou U et al., SIGMOD 2010] Leong Hou U, Nikos Mamoulis, Klaus Berberich, Srikanta J. Bedathur: Durable top-k search in document archives. SIGMOD Conference 2010: 555-566 267
  • 269. Retrieval and Ranking Models (1)Searching the past (2)Searching the future
  • 270. RECAP Two time dimensions 1. Publication or modified time 2. Content or event time 270
  • 271. RECAP Two time dimensions 1. Publication or modified time 2. Content or event time content time publication time 271
  • 272. Searching the past • Historical or temporal information needs – A journalist working the historical story of a particular news article – A Wikipedia contributor finding relevant information that has not been written about yet 272
  • 273. Searching the past • Historical or temporal information needs – A journalist working the historical story of a particular news article – A Wikipedia contributor finding relevant information that has not been written about yet 273
  • 274. • Time must be explicitly modeled in order to increase the effectiveness of ranking – To order search results so that the most relevant ones come first • Time uncertainty should be taken into account – Two temporal expressions can refer to the same time period even though they are not equally written – E.g. the query “Independence Day 2011” • A retrieval model relying on term-matching only will fail to retrieve documents mentioning “July 4, 2011” Challenge 274
  • 275. Query- and document model • A temporal query consists of: – Query keywords – Temporal expressions • A document consists of: – Terms, i.e., bag-of-words – Publication time and temporal expressions [Berberich et al., ECIR 2010] 275
  • 276. Query- and document model • A temporal query consists of: – Query keywords – Temporal expressions • A document consists of: – Terms, i.e., bag-of-words – Publication time and temporal expressions [Berberich et al., ECIR 2010] 276
  • 277. Time-aware ranking models • Follow two main approaches 1. Mixture model [Kanhabua et al., ECDL 2010] • Linearly combining textual- and temporal similarity 1. Probabilistic model [Berberich et al., ECIR 2010] • Generating a query from the textual part and temporal part of a document independently 277
  • 278. Mixture model • Linearly combine textual- and temporal similarity – α indicates the importance of similarity scores • Both scores are normalized before combining – Textual similarity can be determined using any term- based retrieval model • E.g., tf.idf or a unigram language model 278
  • 279. Mixture model • Linearly combine textual- and temporal similarity – α indicates the importance of similarity scores • Both scores are normalized before combining – Textual similarity can be determined using any term- based retrieval model • E.g., tf.idf or a unigram language model How to determine temporal similarity? 279
  • 280. Temporal similarity • Assume that temporal expressions in the query are generated independently from a two-step generative model: – P(tq|td) can be estimated based on publication time using an exponential decay function [Kanhabua et al., ECDL 2010] – Linear interpolation smoothing is applied to eliminates zero probabilities • I.e., an unseen temporal expression tq in d 280
  • 281. Temporal similarity • Assume that temporal expressions in the query are generated independently from a two-step generative model: – P(tq|td) can be estimated based on publication time using an exponential decay function [Kanhabua et al., ECDL 2010] – Linear interpolation smoothing is applied to eliminates zero probabilities • I.e., an unseen temporal expression tq in d Similarityscore Time distance d1 d2 281
  • 282. • Five time-aware ranking models – LMT [Berberich et al., ECIR 2010] – LMTU [Berberich et al., ECIR 2010] – TS [Kanhabua et al., ECLD 2010] – TSU [Kanhabua et al., ECLD 2010] – FuzzySet [Kalczynski et al., Inf. Process. 2005] Comparison of time-aware ranking [Kanhabua et al., SIGIR 2011] 282
  • 283. • Experiment – New York Times Annotated Corpus – 40 temporal queries [Berberich et al., ECIR 2010] • Result – TSU outperforms other methods significantly for most metrics • Conclusions – Although TSU gains the best performance, but only applied to a collection with time metadata – LMT, LMTU can be applied to any collection without time metadata, but time extraction is needed Discussion [Kanhabua et al., SIGIR 2011] 283
  • 284. Searching the future • People are naturally curious about the future – What will happen to EU economies in next 5 years? – What will be potential effects of climate changes? 284
  • 285. Previous work • Searching the future – Extract temporal expressions from news articles – Retrieve future information using a probabilistic model, i.e., multiplying textual similarity and a time confidence • Supporting analysis of future-related information in news and Web – Extract future mentions from news snippets obtained from search engines – Summarize and aggregate results using clustering methods, but no ranking [Baeza-Yates SIGIR Forum 2005; Jatowt et al., JCDL 2009] 285
  • 287. Yahoo! Time Explorer [Matthews et al., HCIR 2010] 287
  • 288. Ranking news predictions • Over 32% of 2.5M documents from Yahoo! News (July’09 – July’10) contain at least one prediction • Retrieve predictions related to a news story in news archives and rank by relevance [Kanhabua et al., SIGIR 2011a] 288
  • 289. Related news predictions [Kanhabua et al., SIGIR 2011a] 289
  • 290. Related news predictions [Kanhabua et al., SIGIR 2011a] 290
  • 291. Related news predictions [Kanhabua et al., SIGIR 2011a] 291
  • 292. • Four classes of features – Term similarity, entity-based similarity, topic similarity and temporal similarity • Rank results using a learning-to-rank technique Approach [Kanhabua et al., SIGIR 2011a] 292
  • 293. • Step 1: Document annotation. – Extract temporal expressions using time and event recognition – Normalize them to dates so they can be anchored on a timeline – Output: sentences annotated with named entities and dates, i.e., predictions • Step 2: Retrieving predictions. – Automatically generate a query from a news article being read – Retrieve predictions that match the query – Rank predictions by relevance (i.e., a prediction is “relevant” if it is about the topics of the article) System architecture [Kanhabua et al., SIGIR 2011a] 293
  • 294. • Capture the term similarity between q and p 1. TF-IDF scoring function • Problem: keyword matching, short texts • Predictions not match with query terms 1. Field-aware ranking function, e.g., bm25f • Search the context of a prediction, i.e., surrounding sentences Term similarity [Kanhabua et al., SIGIR 2011a] 294
  • 295. • Measure the similarity between q and p using annotated entities in dp, p, q – Features commonly employed in entity ranking Entity-based similarity [Kanhabua et al., SIGIR 2011a] 295
  • 296. • Compute the similarity between q and p on topic – Latent Dirichlet allocation [Blei et al., J. Mach. Learn. 2003] for modeling topics 1. Train a topic model 2. Infer topics 3. Compute topic similarity Topic similarity [Kanhabua et al., SIGIR 2011a] 296
  • 297. • Compute the similarity between q and p on topic – Latent Dirichlet allocation [Blei et al., J. Mach. Learn. 2003] for modeling topics 1. Train a topic model 2. Infer topics 3. Compute topic similarity Topic similarity [Kanhabua et al., SIGIR 2011a] 297
  • 298. • Hypothesis I. Predictions that are more recent to the query are more relevant Temporal similarity [Kanhabua et al., SIGIR 2011a] 298
  • 299. • Hypothesis I. Predictions that are more recent to the query are more relevant Temporal similarity • Hypothesis II. Predictions extracted from more recent documents are more relevant [Kanhabua et al., SIGIR 2011a] 299
  • 300. • Learning-to-rank: Given an unseen (q, p), p is ranked using a model trained over a set of labeled query/prediction – SVM-MAP [Yue et al., SIGIR 2007] – RankSVM [Joachims, KDD 2002] – SGD-SVM [Zhang, ICML 2004] – PegasosSVM [Shalev-Shwartz et al., ICML 2007] – PA-Perceptron [Crammer et al., J. Mach. Learn. 2006] Ranking method [Kanhabua et al., SIGIR 2011a] 300
  • 301. • 42 future-related topics Relevance judgments [Kanhabua et al., SIGIR 2011a] 301
  • 302. • New York Times Annotated Corpus – 1.8 million articles, over 20 years – More than 25% contain at least one prediction • Annotation process uses several language processing tools – OpenNLP for tokenizing, sentence splitting, part-of- speech tagging, shallow parsing – SuperSense tagger for named entity recognition – TARSQI for extracting temporal expressions • Apache Lucene for indexing and retrieving. – 44,335,519 sentences and 548,491 predictions – 939,455 future dates (avg. future date/prediction is 1.7) Experiments [Kanhabua et al., SIGIR 2011a] 302
  • 303. • Results – Topic features play an important role in ranking – Features in top-5 features with lowest weights are entity-based features • Open issues – Extract predictions from other sources, e.g., Wikipedia, blogs, comments, etc. – Sentiment analysis for future-related information Discussion [Kanhabua et al., SIGIR 2011a] 303
  • 304. References • [Baeza-Yates SIGIR Forum 2005] Ricardo A. Baeza-Yates: Searching the future. SIGIR workshop MF/IR 2005 • [Berberich et al., ECIR 2010] Klaus Berberich, Srikanta J. Bedathur, Omar Alonso, Gerhard Weikum: A Language Modeling Approach for Temporal Information Needs. ECIR 2010: 13-25 • [Blei et al., J. Mach. Learn. 2003] David M. Blei, Andrew Y. Ng, Michael I. Jordan: Latent Dirichlet Allocation. Journal of Machine Learning Research 3: 993-1022 (2003) • [Crammer et al., J. Mach. Learn. 2006] Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer: Online Passive-Aggressive Algorithms. Journal of Machine Learning Research 7: 551-585 (2006) • [Jatowt et al., JCDL 2009] Adam Jatowt, Kensuke Kanazawa, Satoshi Oyama, Katsumi Tanaka: Supporting analysis of future-related information in news archives and the web. JCDL 2009: 115-124 • [Joachims, KDD 2002] Thorsten Joachims: Optimizing search engines using clickthrough data. KDD 2002: 133- 142 • [Kalczynski et al., Inf. Process. 2005] Pawel Jan Kalczynski, Amy Chou: Temporal Document Retrieval Model for business news archives. Inf. Process. Manage. 41(3): 635-650 (2005) • [Kanhabua et al., SIGIR 2011] Nattiya Kanhabua, Kjetil Nørvåg: A comparison of time-aware ranking methods. SIGIR 2011: 1257-1258 • [Kanhabua et al., SIGIR 2011a] Nattiya Kanhabua, Roi Blanco, Michael Matthews: Ranking related news predictions. SIGIR 2011: 755-764 • [Matthews et al., HCIR 2010] Michael Matthews, Pancho Tolchinsky, Roi Blanco, Jordi Atserias, Peter Mika, Hugo Zaragoza: Searching through time in the new york times. HCIR workshop 2010 • [Shalev-Shwartz et al., ICML 2007] Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, Andrew Cotter: Pegasos: primal estimated sub-gradient solver for SVM. Math. Program. 127(1): 3-30 (2011) • [Yue et al., SIGIR 2007] Yisong Yue, Thomas Finley, Filip Radlinski, Thorsten Joachims: A support vector method for optimizing average precision. SIGIR 2007: 271-278 • [Zhang, ICML 2004] Tong Zhang: Solving large scale linear prediction problems using stochastic gradient descent algorithms. ICML 2004 304

Notes de l'éditeur

  1. Google Insights for Search, you can compare search volume patterns across specific regions, categories, time frames and properties
  2. A few tools or algorithms support Web dynamics
  3. A few tools or algorithms support Web dynamics
  4. A few tools or algorithms support Web dynamics
  5. 1) determine the time of a query that is composed of only keywords where its relevant documents are associated to particular time periods that are not given by the query 2) leverage the determined time of queries, the so-called temporal profiles of queries, for re-ranking search results in order to increase the retrieval effectiveness