Relevance, Authority & Topic Modeling in Search

•

4 j'aime•1,102 vues

BrightEdge

Technologie Design

Slide from LIS 544 IMT 542 INSC 544 by Jeff Huang lazyjeff@uw.edu and Shawn
Walker stw3@uw.edu
The document with the highest proportion of terms which are part of the query is
most relevant
• Documents containing more of the term(s) scored higher
• Longer documents discounted
• Rare terms weighted higher
5

Hilltop was one of the first to introduce the concept of machine-mediated “authority” to
combat the human manipulation of results for commercial gain (using link blast services, viral
distribution of misleading links. It is used by all of the search engines in some way, shape or
form.
Hilltop is:
•Performed on a small subset of the corpus that best represents nature of the whole
•Authorities: have lots of unaffiliated expert document on the same subject pointing to them
•Pages are ranked according to the number of non-affiliated “experts” point to it – i.e. not in
the same site or directory
•Affiliation is transitive [if A=B and B=C then A=C]
The beauty of Hilltop is that unlike PageRank, it is query-specific and reinforces the
relationship between the authority and the user’s query. You don’t have to be big or have a
thousand links from auto parts sites to be an “authority.” Google’s 2003 Florida update,
rumored to contain Hilltop reasoning, resulted in a lot of sites with extraneous links fall from
their previously lofty placements as a result.
Google artificially inflates the placement of results from Wikipedia because it perceives
Wikipedia as an authoritative resources due to social mediation and commercial agnosticism.
Wikipedia is not infallible. However, someone finding it in the “most relevant” top results will
certainly see it as so.

Computes PR based on a set of representational topics [augments PR with content analysis]
Topic derived from the Open Source directory
Uses a set of ranking vectors: Pre-query selection of topics + at-query comparison of the
similarity of query to topics
8

Pew Internet Trust Study of Search engine behavior
http://www.pewinternet.org/Reports/2012/Search-Engine-Use-2012/Summary-of-findings.aspx
Moreover, users report generally good outcomes and relatively high confidence in the capabilities of
search engines:
• 91% of search engine users say they always or most of the time find the information they are
seeking when they use search engines
• 73% of search engine users say that most or all the information they find as they use search
engines is accurate and trustworthy
• 66% of search engine users say search engines are a fair and unbiased source of information
• 55% of search engine users say that, in their experience, the quality of search results is getting
better over time, while just 4% say it has gotten worse
• 52% of search engine users say search engine results have gotten more relevant and useful over
time, while just 7% report that results have gotten less relevant
Using the Internet: Skill Related Problems in User Online Behavior; van Deursen & van Dijk; 2009
56% constructed poor queries
55% selected irrelevant results 1 or more times
38% overwhelmed by amount of information in results
34% found critical information missing from results
9

Contenu connexe

Tendances

NNg Visioneering-MKishkishmc

Federated Search Falls Shortslknight

Web scale discovery vs google scholarNikesh Narayanan

How to evaluate the whole web (without being Google)Dixon Jones

Federated Search in a Disparate EnvironmentHelen Mitchell

Evaluation of web scale discovery servicesNikesh Narayanan

How Accessible Is Our Collection? Performing an E-Resources Accessibility ReviewNASIG

Web Scale Discovery Services: Google like search experienceNikesh Narayanan

data citationSarah Jones

A snake, a planet, and a bear ditching spreadsheets for quick, reproducible r...NASIG

Evaluation of Web Scale Discovery ServicesNikesh Narayanan

Data quality in social policy designAgora Research

Pingar - The Future of Text AnalyticsChris Riley ☁

Discovery platforms: Technology, tools and issuessaiful76

Important Terms Related to the World Wide WebHTS Hosting

SMART Emergency Medical TeamsInternet of Things DC

Blockchain, Science Publishing, and ReplicabilitySean Manion PhD

ITCT Foundations Melanie Parlette-Stewart

Evaluating sourcesMelissa Clark

Internet search-toolsChapelgate Christian Academy

Tendances (20)

NNg Visioneering-MKish

Federated Search Falls Short

Web scale discovery vs google scholar

How to evaluate the whole web (without being Google)

Federated Search in a Disparate Environment

Evaluation of web scale discovery services

How Accessible Is Our Collection? Performing an E-Resources Accessibility Review

Web Scale Discovery Services: Google like search experience

data citation

A snake, a planet, and a bear ditching spreadsheets for quick, reproducible r...

Evaluation of Web Scale Discovery Services

Data quality in social policy design

Pingar - The Future of Text Analytics

Discovery platforms: Technology, tools and issues

Important Terms Related to the World Wide Web

SMART Emergency Medical Teams

Blockchain, Science Publishing, and Replicability

ITCT Foundations

Evaluating sources

Internet search-tools

En vedette

The Good, The Bad and The Ugly of SEOVerticalResponse

DubaiAna Maria

Catedral de Cristal-CaliforniaBiaEsteves

Have a "Tweet" Holiday! VerticalResponse

Function oop - bonnes pratiques ms tech daysJean-Pierre Vincent

Writing Killer PPC Ad CopyBrightEdge

En vedette (6)

The Good, The Bad and The Ugly of SEO

Dubai

Catedral de Cristal-California

Have a "Tweet" Holiday!

Function oop - bonnes pratiques ms tech days

Writing Killer PPC Ad Copy

Similaire à Relevance, Authority & Topic Modeling in Search

Sweeny ux-seo om-cap 2014_v3Marianne Sweeny

Optimising Your Content for FindabilityFindwise

How to search on internet.pptxRehanZia10

Webinar: Search and RecommendersLucidworks

2014 CrossRef Annual Meeting Peer Review Panel: PRE: Securing Trust & Transpa...Crossref

Solving Real World Challenges with Enterprise SearchSPC Adriatics

Reputation management 2019KR_Barker

SPSBE14 Intranet Search #failBen van Mol

SharePoint Saturday Belgium 2013 Intranet search failBIWUG

Online Learning to Rankewhuang3

Opinion-based Article Ranking for Information Retrieval Systems: Factoids and...Koray Tugberk GUBUR

Developing a Search & Findability Practice for the EnterpriseRavi Mynampaty

Developing a Search & Findability Practice for the Enterprise – Ravi Mynampat...Findwise

FAIR is not Fair Enough, Particularly for Software Citation, Availability, or...Daniel S. Katz

Managing plagiarism: Similarity CheckCrossref

Introduction to Enterprise SearchFindwise

Analysing the performance of open access papers discovery toolspetrknoth

Webinar: Building Customer-Targeted Search with FusionLucidworks

Eewoww in research 1 2-3 20161127iGroup (Asia Pacific) Limited

Webinar: Building Conversational Search with FusionLucidworks

Similaire à Relevance, Authority & Topic Modeling in Search (20)

Sweeny ux-seo om-cap 2014_v3

Optimising Your Content for Findability

How to search on internet.pptx

Webinar: Search and Recommenders

2014 CrossRef Annual Meeting Peer Review Panel: PRE: Securing Trust & Transpa...

Solving Real World Challenges with Enterprise Search

Reputation management 2019

SPSBE14 Intranet Search #fail

SharePoint Saturday Belgium 2013 Intranet search fail

Online Learning to Rank

Opinion-based Article Ranking for Information Retrieval Systems: Factoids and...

Developing a Search & Findability Practice for the Enterprise

Developing a Search & Findability Practice for the Enterprise – Ravi Mynampat...

FAIR is not Fair Enough, Particularly for Software Citation, Availability, or...

Managing plagiarism: Similarity Check

Introduction to Enterprise Search

Analysing the performance of open access papers discovery tools

Webinar: Building Customer-Targeted Search with Fusion

Eewoww in research 1 2-3 20161127

Webinar: Building Conversational Search with Fusion

Plus de BrightEdge

Digital innovation-summit roi-of-ai-sept2017_v3BrightEdge

SEO 101: Finding Yourself – Mobilizing the Public 2017BrightEdge

PPC for the Manager who Manages EverythingBrightEdge

International SEO Strategy WebinarBrightEdge

Personas with PersonalityBrightEdge

Sweeny smx-social-media-2014 with-notesBrightEdge

4 GA Goals Everyone Needs to TrackBrightEdge

The 30 Minute SEO AuditBrightEdge

PPC Attribution: A Portent & Hanapin WebinarBrightEdge

Facebook Ads Now: Getting Started Off RightBrightEdge

Smashing SIlos: UX is the New SEOBrightEdge

Making Good SEO Reports Portent WebinarBrightEdge

47 Conversion Rate Optimization TipsBrightEdge

Save Time, Money and Bloodshed with Soft System DiscoveryBrightEdge

7.5 Tips for Becoming a Brainstorming GeniusBrightEdge

Everything Non-SEOs Need to Know about SEOBrightEdge

SEO Copywriting Demystified: How to Get Started Writing for the WebBrightEdge

Why Customers Love Responsive Design (And You Should Too!)BrightEdge

How to SEO a Terrific - and Profitable - User ExperienceBrightEdge

SEO Tips for Small BusinessesBrightEdge

Plus de BrightEdge (20)

Digital innovation-summit roi-of-ai-sept2017_v3

SEO 101: Finding Yourself – Mobilizing the Public 2017

PPC for the Manager who Manages Everything

International SEO Strategy Webinar

Personas with Personality

Sweeny smx-social-media-2014 with-notes

4 GA Goals Everyone Needs to Track

The 30 Minute SEO Audit

PPC Attribution: A Portent & Hanapin Webinar

Facebook Ads Now: Getting Started Off Right

Smashing SIlos: UX is the New SEO

Making Good SEO Reports Portent Webinar

47 Conversion Rate Optimization Tips

Save Time, Money and Bloodshed with Soft System Discovery

7.5 Tips for Becoming a Brainstorming Genius

Everything Non-SEOs Need to Know about SEO

SEO Copywriting Demystified: How to Get Started Writing for the Web

Why Customers Love Responsive Design (And You Should Too!)

How to SEO a Terrific - and Profitable - User Experience

SEO Tips for Small Businesses

Dernier

Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein

Time Series Foundation Models - current state and future directionsNathaniel Shimoni

Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3

A Framework for Development in the AI AgeCprime

Manual 508 Accessibility Compliance AuditSkynet Technologies

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney

Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3

A Journey Into the Emotions of Software DevelopersNicole Novielli

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

The State of Passkeys with FIDO Alliance.pptxLoriGlavin3

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

Take control of your SAP testing with UiPath Test SuiteDianaGray10

Dernier (20)

Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24

Time Series Foundation Models - current state and future directions

Generative Artificial Intelligence: How generative AI works.pdf

DevEX - reference for building teams, processes, and platforms

Generative AI for Technical Writer or Information Developers

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx

A Framework for Development in the AI Age

Manual 508 Accessibility Compliance Audit

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...

Moving Beyond Passwords: FIDO Paris Seminar.pdf

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx

A Journey Into the Emotions of Software Developers

The Ultimate Guide to Choosing WordPress Pros and Cons

The State of Passkeys with FIDO Alliance.pptx

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

Take control of your SAP testing with UiPath Test Suite

Relevance, Authority & Topic Modeling in Search

1. 1

2. 2

3. 3

4. 4

5. Slide from LIS 544 IMT 542 INSC 544 by Jeff Huang lazyjeff@uw.edu and Shawn Walker stw3@uw.edu The document with the highest proportion of terms which are part of the query is most relevant • Documents containing more of the term(s) scored higher • Longer documents discounted • Rare terms weighted higher 5

6. 6

7. Hilltop was one of the first to introduce the concept of machine-mediated “authority” to combat the human manipulation of results for commercial gain (using link blast services, viral distribution of misleading links. It is used by all of the search engines in some way, shape or form. Hilltop is: •Performed on a small subset of the corpus that best represents nature of the whole •Authorities: have lots of unaffiliated expert document on the same subject pointing to them •Pages are ranked according to the number of non-affiliated “experts” point to it – i.e. not in the same site or directory •Affiliation is transitive [if A=B and B=C then A=C] The beauty of Hilltop is that unlike PageRank, it is query-specific and reinforces the relationship between the authority and the user’s query. You don’t have to be big or have a thousand links from auto parts sites to be an “authority.” Google’s 2003 Florida update, rumored to contain Hilltop reasoning, resulted in a lot of sites with extraneous links fall from their previously lofty placements as a result. Google artificially inflates the placement of results from Wikipedia because it perceives Wikipedia as an authoritative resources due to social mediation and commercial agnosticism. Wikipedia is not infallible. However, someone finding it in the “most relevant” top results will certainly see it as so.

8. Computes PR based on a set of representational topics [augments PR with content analysis] Topic derived from the Open Source directory Uses a set of ranking vectors: Pre-query selection of topics + at-query comparison of the similarity of query to topics 8

9. Pew Internet Trust Study of Search engine behavior http://www.pewinternet.org/Reports/2012/Search-Engine-Use-2012/Summary-of-findings.aspx Moreover, users report generally good outcomes and relatively high confidence in the capabilities of search engines: • 91% of search engine users say they always or most of the time find the information they are seeking when they use search engines • 73% of search engine users say that most or all the information they find as they use search engines is accurate and trustworthy • 66% of search engine users say search engines are a fair and unbiased source of information • 55% of search engine users say that, in their experience, the quality of search results is getting better over time, while just 4% say it has gotten worse • 52% of search engine users say search engine results have gotten more relevant and useful over time, while just 7% report that results have gotten less relevant Using the Internet: Skill Related Problems in User Online Behavior; van Deursen & van Dijk; 2009 56% constructed poor queries 55% selected irrelevant results 1 or more times 38% overwhelmed by amount of information in results 34% found critical information missing from results 9

10. 10

11. 11

12. 12

13. 13

14. 14

15. 15

16. 16

17. 17

18. 18

19. 19

20. 20

21. 21