Opinion-based Article Ranking for Information Retrieval Systems: Factoids and Facts

How Search Engines Leverage
Opinion-based Articles for Ranking
Rethinking Search: Corroboration of Web Answers
Koray Tuğberk GÜBÜR

Components for Re-ranking based on Opiniated Factoids
01
Uncertain
Inference
Knowledge
Base
02
Corroboration
of Web
Answers
03
Embarrassment
Factor
04
Open Information
Extraction
5
External
Databases
6
7
Evidence
Aggregation
09
9
Information
Literacy
06
07
08
10 Semantic Role
Labeling
Truth Ranges
05

Uncertain Inference
• Uncertain Inference is found by C. J. Van
Rijsbergen from Glasgow University.
• Focuses on “Query Inference” with
“Context Understanding”.
• Query Path, and Query Context (Context-
Sensitive Search Elements) are used.
• Query is processed with Probable
Probabilities for Question Generation.
• It requires a “Knowledge Base” for
understanding Factual Needs for the query.
• “Uncertain facts” have a plausibility
threshold that gives “Opinions” to exist on
results.
• Extract word sequences in News Titles.
How do Search Engines know facts?
Andrew Houge
The Structured Search Engine

Uncertain Inference
Andrew Houge – The Structured Search Engine
• Query Processing and Parsing is
another topic.
• But, to reach out to “wrong” and
“true” facts, the high level of
confidence and coverage are
needed.
• The Uncertain Inference follows
users’ behaviors in “Adaptive
Search”, or sometimes, it uses
“word-sequences” in a mega corpus.
• Extract, Entity-Attribute Pairs and
their synonyms from News Articles.

Knowledge Base
• Different than Knowledge Graph.
• Stores facts, or factual values for the
same entity-attribute pairs, and
triples.
• It is dynamic.
• A fact from today might be
inaccurate information tomorrow.
• Procedural Part of Knowledge Bases
helps to update the connections
between components.
• Understand which facts are
approved by search engine.
Browsable Fact Repository

Corroboration of Web Answers
• One of the best 10 “Opinion Papers” in
Information Retrieval.
• Directly connected to the concept of
“Helpful Content”, or “Information
Responsiveness”.
• “Even, main web source has
contradicting information for the same
question, which one is fact?”.
• Corroboration of Web Answers focus on
“Truth Ranges”, and “Answer
Prominence” to choose answers from
certain sources.
• Create your own truth range by
auditing ranking resources.

• Minji Wu, and Amelia Marian focus on
numeric values and measure units to
find real authorities.
• PageRank, Source Authority, First
Answer, Closeness to First Answer and
De-duplication are used to determine a
“Fact Range”, or “Truth Range”.
• The “Truth Range” changes from today
to tomorrow according to ranking
sources
• Use numeric values, metrics, dates, and
measurement units to have higher
precision.

• Google cited the research paper
of “Corroborating Answers from
Multiple Web Sources” more than
40 times in “Candidate Answer
Passage” patent series.
• It is used in Featured Snippets
(Web Answers) since 2018.
• This brings us to “Embarrassment
Factor”.
• Use “safe” and “indirect” answers
for conflicted issues.

Embarrassment Factor
• What is Embarrassment Factor?
• Does a Search Engine get shame?
• Can you make a search engine feel shame
with your bad answer, or opinion?
• What happens if you tell that “Barrack
Obama is a communist” in a featured
snippet? Or, “Global Warming is hoax”, or
“Vaccines are for controlling your brain”.
• Let’s remember, “Truth Ranges”.
• Do not play with the patience of search
engine engineers. Do not take advantage
of fundamental NLP understanding.

Truth Ranges
• Fuzzy Logic is used.
• Not every wrong is equal.
• Some facts are more facts.
• Some opinions are accepted as consensus.
• Upper and Bottom Limits are used to
determine “safe opinions”.
• Google created “Content Advisories” to
help for “Information Consensus”.
• Stay in the consensus (reports with
descriptive news), unless it is “satiric”
(critiques with questions).
• Use “question-format” as a shield against
algorithms, if you are outside of truth
ranges.
Which one is more factual?
Source: Wesley Chai

Truth Ranges
• There are two different approaches
in Linguistics for a “truth”, or “fact”.
• Words like “will”, “can”, “might”, “may”,
“may” decrease the certainty.
• Numeric Ranges, or Sentiment
Magnitude and Direction are used.
• The middle of range is called
“Fixpoint”.
• The answers that are outside of
Range is filtered out.
• Find the balance between
“precision” and “coverage” in news
titles, and intros.

Truth Ranges
• According to Fuzzy Logic:
• 1 > 5 and 1 > 10 are not equally wrong.
• One of them is more wrong than other.
• For “Disagreeing Views”,
“Corroboration” happens with
inference.
• “Barrack Obama is born in Hawaii”,
• “Barrack Obama is born in Kenya”.
• A search engine might see “Barrack
Obama is a US Citizen” as a safe answer to
give to avoid embarrassment.
• Use the absolute truths, for projecting
a safe answer rather than giving a
possible wrong factoid.
Journalists share organization’s trustworthiness
Source: Indiatimes
Source: Making Better Informed Trust Decisions with Generalized Fact-Finding

Truth Ranges
• Uncertainty is used as a measurement to filter
factoids.
• Phrases like “I am sure”, or “%45 possibility” create
uncertainty.
• Intrinsic Ambiguities decrease the trust to the
source.
• “Who claims what” is key point for fact-finding
algorithms.
• Source Reliability and, “Variance” and “Mean”
values are used for “fixpoints”.
• Do not use “I am sure”, or “Pretty sure”, “I think…”,
“In my opinion…”, “It might”, “It may”. Tell whether
the “bomb exploded”, or not. Tell “how many
people died”, do not tell “With %45 possibility,
over 20 people…”
• Compare your numbers, names, dates and places
for an event to your competitors.
“Safe Answers” is better.
Source: Making Better Informed Trust Decisions with Generalized Fact-Finding
CIUV: Collaborating Information Against Unreliable Views

Truth Ranges: Why do we need PageRank?
• Speed.
• Google and other search engines do not have time
to process text of the documents.
• News SEO has to prioritize “indexing”.
• News Search Engine has to serve everything in
fastest way.
• Processing the text, checking accuracy is not
possible in seconds, minutes, or hours and days,
when a source publishes 100,000 words a day.
• Thus, Truth Ranges is a “long-term ranking factor”
for news sources.
• Google gets angry when I give PageRank related
suggestions.
• Understand that, some sources are prioritized,
even if they scrape and use your original news
story.
Groundedness - Unanimity
Source: Towards an axiomatic approach to truth discovery
Source: Towards an axiomatic approach to truth discovery

Truth Ranges: Why do we need PageRank?
We guess that this news is quality…
Source: Corroborating Information from Disagreeing Views
Source: Corroborating Information from
Disagreeing Views

Information Extraction (OIE)
An example of OIE
• Open Information Extraction is found
by WAVII.
• WAVII is bought by Google for $30
Million.
• It is used to expand Google’s
Knowledge Graph.
• OIE is to extract triples, and recognize
minor entities to structure a semantic
network.
• Extract “predicates” from news
articles. Create tuples from
“predicates, nouns, and subjects”.
• Understand which fact, or factoid is
given first, or later.
Open Information Extraction Example from the researchers.

Information Extraction (OIE): Rel-grams
Precision / Coverage
• Open Information Extraction is to
extract opinions, and facts about
certain concepts, and named entities.
• It uses “tuples” as “predicate” and
“noun”.
• Aggregates occurrences, standardizing
the masked sections by comparing the
different OIE iterations.
• Match “prepositions” to
“interrogative” terms.
• Use “uncertain inference” to extract
interrogative terms.

Information Extraction (OIE): Rel-grams
Word Connections and Sense Disambiguation
• OIE is used by Google to recognize and
understand micro entities, and knowledge on
the web.
• OIE is helpful for processing the text in the
news sources to understand latest changes in
real-world, and reflect it on the knowledge
base.
• Open Information Extraction is different than
Information Retrieval.
• The opinions and facts of web sources are
compared to each other to understand the
higher groundedness.
• Update outdated facts in your website. “X
lives in P” declaration might be wrong, if “X”
is not alive anymore. How many “died in”
entity lives in your internal knowledge base?

External Databases (Data Commons)
Structuring the Web
• Data Commons is aggregation of
unified databases for nearly every
topic, industry, geography and
entity.
• It is a common fact repository that
is open to all web.
• It is supported by Ramanathan V.
Guha.
• It focuses on statistical data.
• Query external databases for
“statistics” to create statistic-rich
news articles.

• Google integrated Data
Commons Project to its own
algorithms.
• The announcement is done by
Prabhakar Raghavan.
• It helps to understand accuracy,
and authority of an information
source.
• A trustworthy news article
propagate its trust to next news
article.

“As we may think”

“As we may think”
“Google is planned to be third-part of your brain”
- Sergey Bring
“Google is designed as a Star Trek Computer to
answer your needs.
It is not created for websites, it is created for users.
- Larry Page
“They already hate Google, so what is the down-
side?”
- Craig Nevill-Manning

Semantic Role Labeling
Which news source reflected emotions?
• Words’ order change, but sentence’s
meaning stay same.
• Same opinion can be expressed in
many different ways.
• XYZ corporation bought the stock.
• They sold the stock to XYZ corporation.
• The stock was bought by XYZ corporation.
• The purchase of the stock by XYZ
• corporation ...
• The stock purchase by XYZ corporation ...
• OIE provides an aggregation for
tuples, and relational n-grams to
extract factual propositions.
• Semantic Role Labels help for
standardization based on
“predicates”.
• Match “emotions” to “causes” with
shorter declarations, stay away from
“nested declarations”.
Semantic Role Labeling as Dependency Parsing: Exploring
Latent Tree Structures Inside Arguments

Agent – Predicate - Theme
• Predicates can take multiple
arguments.
• Semantic role labels are descriptions of
the semantic relation between the
predicate and its arguments.
• Semantic Roles are abstract
representations of the role that an
argument plays in the event described
by the predicate.
• Semantic Role Labeling assigns roles to
the constituents of a sentence.
• Semantic selection restrictions allow
words to have semantic contractions on
the semantic properties.
• Understand “patterns of human
mind”. Reflect these patterns in news
articles, according to “macro-
context”.

Predicate is context.
• Let’s say, “George Bush” phrase appeared
500,000 times in the News Titles.
• Google has to categorize them according to the
news contexts.
• “Context-based Person Search” is used for this
task.
• But, News Search Engines have to be fast.
• There is no time for processing the text.
• But, “SRL” is a quick process.
• Check Semantic Role Label of Entity, is it agent? Or, is it
theme?
• Which instrument is used?
• Which goal is mentioned?
• Which propositional structure is used?
• For the sentence “George Bush signed military
operation”, the “Relational Grams”, “Aggregated
Tuples”, and “Semantic Role Labels” help a
search engine to differentiate entities/context from
each other.
• “Grouping entities” is not enough. Group
“contexts”. “X and Love Life”, “X and Career”
have different contexts. Connections should
follow “identity” and context together. Analyze
“News Context”, more than “Entity” that
appears.

How do opinions differ in phrases?
• Beyond Classification:
• It helps to see the factual information.
• It is used to differentiate opinions from
each other.
• It measures the possibility of truth.
• It understands the representation of the
web source according to its connection to
others.
• Semantic Role Labeling is used by
semantic search engines to have
better entity associations.
• The suggested associations, or
graphs are accepted or rejected by
semantic network constructors.
• “Names in the News Title”
should match the Faces in the
News Image.
Source: Marina Santini, Brighton University
Source: Grounded Semantic Role Labeling

Question-Answer Pairs
Which evidence is correct?
• Question Generation and Answer
Pairing are NLP tasks for fact
extraction.
• Question generation involves query
parsing and processing.
• Answer pairing involves dense-context
retrieval and question-answer format
matching.
• But, it is not clear which answer is
more accurate.
• Thus, Question-Answer Coverage,
Entity-oriented search and Semantic-
Syntatic Parsing are used.
• Matching entities, attributes,
queries, or phrases are not good
enough, as long as information is
not responsive.
Source: Evidence Aggregation for Answer Re-Ranking in Open-Domain Question Answering

Information Literacy - Consensus
Who said it?
• Google started to give education for
Information Literacy.
• It involves recognizing information source
before the information on the source.
• Google ranks News Sources for certain
topics, contexts and entities before ranking
the news.
• The need of “fast indexing and serving” will
always be more important than
understanding the “truth” at the first stage.
• Thus, the quality news sources have higher
accuracy with more historical data, and
PageRank.
• Google has to assume that truth comes
from strength of repeated evidence from
the most authoritative sources.
• Audit “About the source” panels of your
competitors, create a review, and third-
party mention gap.

Author Authority?
https://searchengineland.com/what-social-signals-do-google-
bing-really-count-55389
• Danny Sullivan once asked Google
and Bing whether they use social
signals, or author names to
understand who is the real expert on
a topic.
• Both of the search engines said that
they audit “author quality” and
“author expertise” for different
topics.
• Associate authoritative authors
with your web source stronger, if
they are writing for multiple web
sources.

How do they use Knowledge Base?
Integrating Knowledge Graph and Natural Text for Language
Model Pre-training
• There are hundreds of different
algorithms to understand the
authenticity and “true facts”.
• For a search engine engineer,
there is no “lie” and “fact”.
• It is only “true facts” and “wrong
facts”.
• And, KELM-like algorithms help
together to differentiate them
from each other.
• Query “Google Knowledge Graph
API” to understand what they
state for the same entity.

Information Literacy - Cues
What makes you trustworthy?
• The research that Google cites
mentioned that there are “6 Cues for
snap judgments about whom to trust”.
• These involve “images”, “brands”,
“headlines – tonality”, “social cues”,
“sponsors”, and “interactivity”.
• Google works with MediaWise to
perform surveys and integrate findings
to their own algorithms.
• Create your own “audit templates” for
news articles for these 6 different
verticals. Mark up “MediaWise”.

Information Literacy – About this source
Why does your opinion matter?
• The story of “Web Answers” is
too long.
• Context-terms, Topical Entries,
Candidate Answer Passages,
Context-scoring for Candidate
Answer Passages, and many
more concepts…
• Google Product Manager calls
these “word callouts”.
• Search Engine Engineers call
them “representative answer”.
• Learn NLP. Scoring Candidate Answer Passages

Some Google Designs
Machine learning to identify opinions in documents
•Identifying opinionated portions in documents
•Relating opinionated portions inside the document
and/or across other documents (e.g., that relate to
the same story)
•To surface opinionated snippets or quotes to users
of a news aggregation.
•To identify portions of a document that convey
opinion.
•Google might rank a source for “report”, but not
for “opinion”. Understand which vertical has a
higher chance for your web source.

Some Google Designs
System and method for supporting editorial opinion in the
ranking of search results
“Editorial opinion” without “distorting facts” helps you for ranking.
Especially for “first-person” experience stories, or reviews.

Some Google Designs
Embedded communication of link information
“Information in the improved link tags may allow one or more publishers of content and/or
documents to convey opinions about content and/or documents at one or more content
locations and/or one or more document locations. The link tags may also allow one or more
publishers to convey a weighting of the relative importance of one or more content locations
and/or one or more document locations. In some embodiment, at least a portion of the
information in the improved link tags may be encrypted, to allow one or more publishers to
restrict the audience that may view the information in the link tags….. The improved link tags may
allow the publishers to communicate additional information, such as opinions, about the content locations
and/or document locations.”
Categorize boilerplate/main content links according to their context.
“Joe Biden and Congress” might have a different “block-link” than “Joe Biden and Elections”.

Some Google Designs
Aspect-Based Sentiment Summarization
Use “key-points” with “sentiments” to summarize essence of news stories.

Topicality and Context Filters
Long and Shor Term Solutions for SERP Construction in News Vertical
Short-term Solutions for News Search Engines:
• Classify authoritative sources (PageRank,
Article Count, Unique Sentence Count,
Publication Frequency, Length, Citations,
Search Behaviors).
• Rank authoritative sources for different
topics.
• Classify and rank news web pages according
to their context, and topicality.
• Serve the most relevant news articles based
on trust and confidence.
Long-term Solutions for News Search Engines:
• Process text.
• Understand facts.
• Audit accuracy and comprehensiveness.
• Filter the sources, by re-assigning topical
relevance and authority.

Samples from News SEO with Factoids
NaturalNews

Powerofpositivity

RealClearPolitics

BREITBART

What would you do if you were Google?
Which opinions should rank?

Opinion-based Article Ranking for Information Retrieval Systems: Factoids and Facts

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Opinion-based Article Ranking for Information Retrieval Systems: Factoids and Facts

Similaire à Opinion-based Article Ranking for Information Retrieval Systems: Factoids and Facts (20)

Dernier

Dernier (20)

Opinion-based Article Ranking for Information Retrieval Systems: Factoids and Facts