Traditional information retrieval approaches deal with retrieving full-text document as a response to a user's query. However, applications that go beyond the "ten blue links" and make use of additional information to display and interact with search results are becoming increasingly popular and adopted by all major search engines. In addition, recent advances in text extraction allow for inferring semantic information over particular items present in textual documents. This talks presents how enhancing a document with structures derived from shallow parsing is able to convey a different user experience in search and browsing scenarios, and what challenges we face as a consequence.
Beyond document retrieval using semantic annotations
1. Beyond document retrieval using
semantic annotations
Roi Blanco (roi@yahoo-inc.com)
http://labs.yahoo.com/Yahoo_Labs_Barcelona
2. Yahoo! Research Barcelona
• Established January, 2006
• Led by Ricardo Baeza-Yates
• Research areas
• Web Mining
• Social Media
• Distributed Web retrieval
• Semantic Search
• NLP and Semantics
3. Contributions
Hugo
Zaragoza
• “Every time I fire a linguist my performance goes up…” (Fred Jelinek)
Great strategy until you’ve fired them all… but what then?
Michael
Matthews
Jordi
Atserias
Roi
Blanco
Sebastiano Vigna (U. Milan)
Paolo Boldi
- Indexing (MG4J)
Peter
Mika
4. Agenda
• Search: this was then, this is now
• Natural Language processing and search
• Semantic Search
• Search over annotated documents
• Time Explorer
5. Natural Language Retrieval
• How to exploit the structure and meaning of
natural language text to improve search
• Current search engines perform only limited NLP
(tokenization, stemming)
• Automated tools exist for deeper analysis
• Applications to diversity-aware search
• Source, Location, Time, Language, Opinion,
Ranking…
• Search over semi-structured data, semantic
search
• Roll-out user experiences that use higher layers
of the NLP stack
14. Semantic Search
• What different kinds of search and
applications beyond string matching or
returning 10 blue links?
• Can we have a better understanding of
documents and queries?
• New devices open new possibilities, new
experiences
• Is current technology in natural language
understanding mature enough?
15. Semantic Search (II)
• Matching the user’s query with the Web’s content at a
conceptual level, often with the help of world knowledge
– Natural Language Search
• Exploiting the (implicit) structure and semantics of natural language
• Intersection of IR and NLP
– Semantic Web Search
• Exploiting the (explicit) meaning of data
• Intersection of IR and Semantic Web
• As a field
– ISWC/ESWC/ASWC, WWW, SIGIR, VLDB, CIKM
– Exploring Semantic Annotations in Information Retrieval
(ECIR08, WSDM09)
– Semantic Search Workshop (ESWC08, WWW09, WWW10)
– Future of Web Search: Semantic Search (FoWS09)
16. State of search
• “We are at the beginning of search.“ (Marissa Mayer)
• Old battles are won
– Marginal returns on investments in crawling, indexing,
ranking
– Solved large classes of queries (e.g. navigational)
– Lots of tempting, but high hanging fruits
• Currently, the biggest bottlenecks in IR not
computational, but in modeling user cognition
– If only we could find a computationally expensive way to
solve the problem…
• In particular, solving queries that require a deeper understanding of the
query, the content and/or the world at large
– Corollary : go beyond string matching!
17. Some examples…
• Ambiguous searches
– paris hilton
• Multimedia search
– paris hilton sexy
• Imprecise or overly precise searches
– jim hendler
– pictures of strong adventures people
• Searches for descriptions
– 33 year old computer scientist living in barcelona
– reliable digital camera under 300 dollars
• Searches that require aggregation
– height eiffel tower
– harry potter movie review
– world temperature 2020
18. Is NLU that complex?
”A child of five would understand this.
Send someone to fetch a child of five”.
Groucho Marx
20. Paraphrases
• ‘This parrot is dead’
• ‘This parrot has kicked the bucket’
• ‘This parrot has passed away’
• ‘This parrot is no more'
• 'He's expired and gone to meet his maker,’
• 'His metabolic processes are now history’
22. Semantics at every step of the IR process
bla bla bla?
bla
bla bla
The IR engine The Web
Query interpretation
q=“bla” * 3
Document processing bla
bla bla
bla
bla
bla
Indexing
Ranking
θ(q,d) “bla”
Result presentation
23. Understanding Queries
• Query logs are a big source of information &
knowledge
To rank better the results (what you click)
To understand queries better
Paris Paris Flights
Paris Paris Hilton
25. NLP for IR
• Full NLU is AI complete, not scalable to the web
size (parsing the web is really hard).
• BUT … what about other shallow NLP techniques?
• Hypothesis/Requirements:
• Linear extraction/parsing time
• Error-prone output (e.g. 60-90%)
• Highly redundant information
• Explore new ways of browsing
• Support your answers
27. Support your answers
Errors happen: choose the right ones!
• Humans need to “verify” unknown facts
• Multiple sources of evidence
• Common sense vs. Contradictions
• are you sure? is this spam? Interesting!
• Tolerance to errors greatly increases if users can
verify things fast
• Importance of snippets, image search
• Often the context is as important as the fact
• E.g. “S discovered the transistor in X”
• There are different kinds of errors
• Ridiculous result (decreases overall confidence in system)
• Reasonably wrong result (makes us feel good)
29. Annotated documents
Barack Obama visited Tokyo this Monday as part of an extended Asian trip.
He is expected to deliver a speech at the ASEAN conference next Tuesday
20 May 2009
28 May 2009
Barack Obama visited Tokyo this Monday as part of an extended Asian trip.
He is expected to deliver a speech at the ASEAN conference next Tuesday
30.
31. How does it work?
Monty
Python Inverted Index
(sentence/doc level)
Forward Index
(entity level)
Flying Circus
John Cleese
Brian
32. Efficient element retrieval
• Goal
– Given an ad-hoc query, return a list of documents and
annotations ranked according to their relevance to the query
• Simple Solution
– For each document that matches the query, retrieve its
annotations and return the ones with the highest counts
• Problems
– If there are many documents in the result set this will take too
long - too many disk seeks, too much data to search through
– What if counting isn’t the best method for ranking elements?
• Solution
– Special compressed data structures designed specifically for
annotation retrieval
33. Forward Index
• Access metadata and document contents
– Length, terms, annotations
• Compressed (in memory) forward indexes
– Gamma, Delta, Nibble, Zeta codes (power laws)
• Retrieving and scoring annotations
– Sort terms by frequency
• Random access using an extra compressed
pointer list (Elias-Fano)
34. Parallel Indexes
• Standard index contains only tokens
• Parallel indices contain annotations on the tokens – the
annotation indices must be aligned with main token index
• For example: given the sentence “New York has great
pizza” where New York has been annotated as a LOCATION
– Token index has five entries
(“new”, “york”, “has”, “great”, “pizza”)
– The annotation index has five entries
(“LOC”, “LOC”, “O”,”O”,”O”)
Can optionally encode BIO format (e.g. LOC-B, LOC-I)
• To search for the New York location entity, we search for:
“token:New ^ entity:LOC token:York ^ entity:LOC”
35. Parallel Indices (II)
Doc #3: The last time Peter exercised was in the XXth century.
Doc #5: Hope claims that in 1994 she run to Peter Town.
Peter D3:1, D5:9
Town D5:10
Hope D5:1
1994 D5:5
…
Possible Queries:
“Peter AND run”
“Peter AND WNS:N_DATE”
“(WSJ:CITY ^ *) AND run”
“(WSJ:PERSON ^ Hope) AND run”
WSJ:PERSON D3:1, D5:1
WSJ:CITY D5:9
WNS:V_DATE D5:5
(Bracketing can also be dealt with)
38. Time(ly) opportunities
Can we create new user experiences based on a deeper analysis and
exploration of the time dimension?
• Goals:
– Build an application that helps users to explore,
interact and ultimately understand existing
information about the past and the future.
– Help the user cope with the information overload
and eventually find/learn about what she’s looking
for.
39. Original Idea
• R. Baeza-Yates, Searching the Future, MF/IR 2005
– On December 1st 2003, on Google News, there were more than 100K
references to 2004 and beyond.
– E.g. 2034:
• The ownership of Dolphin Square in London must revert to an insurance company.
• Voyager 2 should run out of fuel.
• Long-term care facilities may have to house 2.1 million people in the USA.
• A human base in the moon would be in operation.
40. Time Explorer
• Public demo since August 2010
• Winner of HCIR NYT Challenge
• Goal: explore news through time and into
the future
• Using a customized Web crawl from news
and blog feeds
• http://fbmya01.barcelonamedia.org:8080/future/
42. Time Explorer - Motivation
Time is important to search
Recency, particularly in news is highly related
to relevancy
But, what about evolution over time?
How has a topic evolved over time?
How did the entities (people, place, etc) evolve with respect to the topic over
time?
How will this topic continue to evolve over the future?
How does bias and sentiment in blogs and news change over time?
Google Trends, Yahoo! Clues, RecordedFuture
…
Great research playground
Open source!
44. Analysis Pipeline
Tokenization, Sentence Splitting, Part-of-speech
tagging, chunking with OpenNLP
Entity extraction with SuperSense tagger
Time expressions extracted with TimeML
Explicit dates (August 23rd, 2008)
Relative dates (Next year, resolved with Pub Date)
Sentiment Analysis with LivingKnowledge
Ontology matching with Yago
Image Analysis – sentiment and face detection
45. Indexing/Search
• Lucene/Solr search platform to index and search
– Sentence level
– Document level
• Facets for annotations (multiple fields for faster
entity-type access)
• Index publication date and content date –extracted
dates if they exists or publication date
• Solr Faceting allows aggregation over query entity
ranking and for aggregating counts over time
• Content date enables search into the future
54. Other challenges
– Large scale processing
• Distributed computing
• Shift from batch (Hadoop) to online (S4)
– Efficient extraction/retrieval, algorithmic/data
structures
• Critical for interactive exploration
– Connection with the user experience
• Measures! User engagement?
– Personalization
– Integration with Knowledge Bases (Semantic Web)
– Multilingual