Beyond document retrieval using semantic annotations

Beyond document retrieval using
semantic annotations
Roi Blanco (roi@yahoo-inc.com)
http://labs.yahoo.com/Yahoo_Labs_Barcelona

Yahoo! Research Barcelona
• Established January, 2006
• Led by Ricardo Baeza-Yates
• Research areas
• Web Mining
• Social Media
• Distributed Web retrieval
• Semantic Search
• NLP and Semantics

Contributions
Hugo
Zaragoza
• “Every time I fire a linguist my performance goes up…” (Fred Jelinek)
Great strategy until you’ve fired them all… but what then?
Michael
Matthews
Jordi
Atserias
Roi
Blanco
Sebastiano Vigna (U. Milan)
Paolo Boldi
- Indexing (MG4J)
Peter
Mika

Agenda
• Search: this was then, this is now
• Natural Language processing and search
• Semantic Search
• Search over annotated documents
• Time Explorer

Natural Language Retrieval
• How to exploit the structure and meaning of
natural language text to improve search
• Current search engines perform only limited NLP
(tokenization, stemming)
• Automated tools exist for deeper analysis
• Applications to diversity-aware search
• Source, Location, Time, Language, Opinion,
Ranking…
• Search over semi-structured data, semantic
search
• Roll-out user experiences that use higher layers
of the NLP stack

Structured data - Web search
Top-1 entity with
structured data
Related entities
Structured data
extracted from HTML

New devices
• Different interaction (e.g. voice)
• Different capabilities (e.g. display)
• More Information (geo-localization)
• More personalized

Yahoo! Axis
Smarter, Faster Search
instant answers
visual previews
infinite browsing
Connected Experience:
across devices, iPhone, iPad,
Firefox, Safari, Internet Explorer,
Chrome.
Pesonalized Home Page
Signing:Yahoo!, Google, Facebook,
direct access to favorite sites,
saved articles and bookmarks.

Semantic Search
• What different kinds of search and
applications beyond string matching or
returning 10 blue links?
• Can we have a better understanding of
documents and queries?
• New devices open new possibilities, new
experiences
• Is current technology in natural language
understanding mature enough?

Semantic Search (II)
• Matching the user’s query with the Web’s content at a
conceptual level, often with the help of world knowledge
– Natural Language Search
• Exploiting the (implicit) structure and semantics of natural language
• Intersection of IR and NLP
– Semantic Web Search
• Exploiting the (explicit) meaning of data
• Intersection of IR and Semantic Web
• As a field
– ISWC/ESWC/ASWC, WWW, SIGIR, VLDB, CIKM
– Exploring Semantic Annotations in Information Retrieval
(ECIR08, WSDM09)
– Semantic Search Workshop (ESWC08, WWW09, WWW10)
– Future of Web Search: Semantic Search (FoWS09)

State of search
• “We are at the beginning of search.“ (Marissa Mayer)
• Old battles are won
– Marginal returns on investments in crawling, indexing,
ranking
– Solved large classes of queries (e.g. navigational)
– Lots of tempting, but high hanging fruits
• Currently, the biggest bottlenecks in IR not
computational, but in modeling user cognition
– If only we could find a computationally expensive way to
solve the problem…
• In particular, solving queries that require a deeper understanding of the
query, the content and/or the world at large
– Corollary : go beyond string matching!

Some examples…
• Ambiguous searches
– paris hilton
• Multimedia search
– paris hilton sexy
• Imprecise or overly precise searches
– jim hendler
– pictures of strong adventures people
• Searches for descriptions
– 33 year old computer scientist living in barcelona
– reliable digital camera under 300 dollars
• Searches that require aggregation
– height eiffel tower
– harry potter movie review
– world temperature 2020

Is NLU that complex?
”A child of five would understand this.
Send someone to fetch a child of five”.
Groucho Marx

Language is Ambiguous
The man saw the girl with the telescope

Paraphrases
• ‘This parrot is dead’
• ‘This parrot has kicked the bucket’
• ‘This parrot has passed away’
• ‘This parrot is no more'
• 'He's expired and gone to meet his maker,’
• 'His metabolic processes are now history’

Semantics at every step of the IR process
bla bla bla?
bla
bla bla
The IR engine The Web
Query interpretation
q=“bla” * 3
Document processing bla
bla bla
bla
bla
bla
Indexing
Ranking
θ(q,d) “bla”
Result presentation

Understanding Queries
• Query logs are a big source of information &
knowledge
To rank better the results (what you click)
To understand queries better
Paris Paris Flights
Paris Paris Hilton

“Understand” Documents
NLU Still
an open
issue

NLP for IR
• Full NLU is AI complete, not scalable to the web
size (parsing the web is really hard).
• BUT … what about other shallow NLP techniques?
• Hypothesis/Requirements:
• Linear extraction/parsing time
• Error-prone output (e.g. 60-90%)
• Highly redundant information
• Explore new ways of browsing
• Support your answers

Usability
We also fail at using the technology
Sometimes

Support your answers
Errors happen: choose the right ones!
• Humans need to “verify” unknown facts
• Multiple sources of evidence
• Common sense vs. Contradictions
• are you sure? is this spam? Interesting!
• Tolerance to errors greatly increases if users can
verify things fast
• Importance of snippets, image search
• Often the context is as important as the fact
• E.g. “S discovered the transistor in X”
• There are different kinds of errors
• Ridiculous result (decreases overall confidence in system)
• Reasonably wrong result (makes us feel good)

SEARCH OVER ANNOTATED
DOCUMENTS

Annotated documents
Barack Obama visited Tokyo this Monday as part of an extended Asian trip.
He is expected to deliver a speech at the ASEAN conference next Tuesday
20 May 2009
28 May 2009
Barack Obama visited Tokyo this Monday as part of an extended Asian trip.
He is expected to deliver a speech at the ASEAN conference next Tuesday

How does it work?
Monty
Python Inverted Index
(sentence/doc level)
Forward Index
(entity level)
Flying Circus
John Cleese
Brian

Efficient element retrieval
• Goal
– Given an ad-hoc query, return a list of documents and
annotations ranked according to their relevance to the query
• Simple Solution
– For each document that matches the query, retrieve its
annotations and return the ones with the highest counts
• Problems
– If there are many documents in the result set this will take too
long - too many disk seeks, too much data to search through
– What if counting isn’t the best method for ranking elements?
• Solution
– Special compressed data structures designed specifically for
annotation retrieval

Forward Index
• Access metadata and document contents
– Length, terms, annotations
• Compressed (in memory) forward indexes
– Gamma, Delta, Nibble, Zeta codes (power laws)
• Retrieving and scoring annotations
– Sort terms by frequency
• Random access using an extra compressed
pointer list (Elias-Fano)

Parallel Indexes
• Standard index contains only tokens
• Parallel indices contain annotations on the tokens – the
annotation indices must be aligned with main token index
• For example: given the sentence “New York has great
pizza” where New York has been annotated as a LOCATION
– Token index has five entries
(“new”, “york”, “has”, “great”, “pizza”)
– The annotation index has five entries
(“LOC”, “LOC”, “O”,”O”,”O”)
Can optionally encode BIO format (e.g. LOC-B, LOC-I)
• To search for the New York location entity, we search for:
“token:New ^ entity:LOC token:York ^ entity:LOC”

Parallel Indices (II)
Doc #3: The last time Peter exercised was in the XXth century.
Doc #5: Hope claims that in 1994 she run to Peter Town.
Peter  D3:1, D5:9
Town  D5:10
Hope  D5:1
1994  D5:5
…
Possible Queries:
“Peter AND run”
“Peter AND WNS:N_DATE”
“(WSJ:CITY ^ *) AND run”
“(WSJ:PERSON ^ Hope) AND run”
WSJ:PERSON  D3:1, D5:1
WSJ:CITY  D5:9
WNS:V_DATE  D5:5
(Bracketing can also be dealt with)

Time(ly) opportunities
Can we create new user experiences based on a deeper analysis and
exploration of the time dimension?
• Goals:
– Build an application that helps users to explore,
interact and ultimately understand existing
information about the past and the future.
– Help the user cope with the information overload
and eventually find/learn about what she’s looking
for.

Original Idea
• R. Baeza-Yates, Searching the Future, MF/IR 2005
– On December 1st 2003, on Google News, there were more than 100K
references to 2004 and beyond.
– E.g. 2034:
• The ownership of Dolphin Square in London must revert to an insurance company.
• Voyager 2 should run out of fuel.
• Long-term care facilities may have to house 2.1 million people in the USA.
• A human base in the moon would be in operation.

Time Explorer
• Public demo since August 2010
• Winner of HCIR NYT Challenge
• Goal: explore news through time and into
the future
• Using a customized Web crawl from news
and blog feeds
• http://fbmya01.barcelonamedia.org:8080/future/

Time Explorer - Motivation
 Time is important to search
 Recency, particularly in news is highly related
to relevancy
 But, what about evolution over time?
 How has a topic evolved over time?
 How did the entities (people, place, etc) evolve with respect to the topic over
time?
 How will this topic continue to evolve over the future?
 How does bias and sentiment in blogs and news change over time?
 Google Trends, Yahoo! Clues, RecordedFuture
…
 Great research playground
 Open source!

Analysis Pipeline
 Tokenization, Sentence Splitting, Part-of-speech
tagging, chunking with OpenNLP
 Entity extraction with SuperSense tagger
 Time expressions extracted with TimeML
 Explicit dates (August 23rd, 2008)
 Relative dates (Next year, resolved with Pub Date)
 Sentiment Analysis with LivingKnowledge
 Ontology matching with Yago
 Image Analysis – sentiment and face detection

Indexing/Search
• Lucene/Solr search platform to index and search
– Sentence level
– Document level
• Facets for annotations (multiple fields for faster
entity-type access)
• Index publication date and content date –extracted
dates if they exists or publication date
• Solr Faceting allows aggregation over query entity
ranking and for aggregating counts over time
• Content date enables search into the future

Other challenges
– Large scale processing
• Distributed computing
• Shift from batch (Hadoop) to online (S4)
– Efficient extraction/retrieval, algorithmic/data
structures
• Critical for interactive exploration
– Connection with the user experience
• Measures! User engagement?
– Personalization
– Integration with Knowledge Bases (Semantic Web)
– Multilingual

Beyond document retrieval using semantic annotations

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (20)

Similar to Beyond document retrieval using semantic annotations

Similar to Beyond document retrieval using semantic annotations (20)

More from Roi Blanco

More from Roi Blanco (7)

Recently uploaded

Recently uploaded (20)

Beyond document retrieval using semantic annotations