SlideShare une entreprise Scribd logo
1  sur  22
Phrase Based Indexing
           By
      Bala Abirami
•   Introduction of Phrase Based Indexing
•   What is Phrase Based Indexing?
•   Back ground of Invention
•   Summary on Invention
•   Spam Detection
Introduction
• An information retrieval system uses phrases to
  index, retrieve, organize and describe
  documents.
• It was a patent application submitted by the
  Google Engineer, Anna Lynn Patterson to US
• Application filed: July, 2004
• Published: January, 2006
Background of Invention
• Information retrieval systems, generally called
  search engines, are now an essential tool for
  finding information in large scale, diverse, and
  growing corpuses such as the Internet.

• A document is retrieved in response to a query
  containing a number of query terms, typically
  based on having some number of query terms
  present in the document.

• The retrieved documents are then ranked
  according to other statistical measures, such as
  frequency of occurrence of the query terms, host
  domain, link analysis, and the like
Cont…
• Concepts are often expressed in phrases, such
  as "Australian Shepherd," "President of the
  United States," or "Sundance Film Festival".
• Accordingly, there is a need for an information
  retrieval system and methodology that can
  identify phrases, index documents according to
  phrases, search and rank documents in
  accordance with their phrases.
Summary
  An information retrieval system and
  methodology uses phrases to index, search,
  rank, and describe documents in the document
  collection.

1. Identifying Phrases and Related Phrases
2. Indexing Documents w.r.t Phrases
3. Ranking Documents w.r.t Phrases
4. Creating description for the document
5. Elimination of Duplicate Documents
Identifying Phrase and Related
               Phrases
• Based on a phrase's ability to predict the
  presence of other phrases in a document.
• It looks to identify phrases that have
  frequent and/or distinguished/unique
  usage
• Prediction measure is used for identifying related
  phrases
• Prediction measure relates Actual co
  -occurrence rate of two phrases to expected co-
  occurrence rate of the two phrases
• Information gain = actual co-occurrence rate :
Cont…
• Two Phrases are related to each other
  when the prediction measure exceeds the
  prediction threshold.
• Example:
  Phrase : “President of the United States”
  predicts the related phrase “White House”,
  “George Bush” etc.,
Indexing documents based on
           related Phrases
• An information retrieval system indexes
  documents in the document collection by the
  valid or good phrases.
• Posting List = documents that contain the
  phrase
• Second List = used to store data indicating
  which of the related phrases of the given phrase
  are also present in each document containing
  the given phrase
Ranking

•   Ranking documents is based on two factors
      1. Ranking Documents based on Contained
    Phrases
      2. Ranking Documents based on Anchor
    Phrases
•   Document Score = Body Hit Score + Anchor Hit
    Score
•   For Example: Body Hit Score = 0.30, Anchor
    Hit Score = 0.70
•   Document Score = 0.30 + 0.70
Phrase Extension
• The information retrieval system is also adapted
  to use the phrases when searching for
  documents in response to a query.
• A user may enter an incomplete phrase in a
  search query, such as "President of the“
   Incomplete phrases such as these may be
  identified and replaced by a phrase extension,
  such as "President of the United States."
Descriptions for Documents
• Phrase information is used to create description
  of a document.
• System identifies phrase present in the query,
  related phrases and Phrase extensions in each
  sentences and have a count for each sentences.
• Ranks the sentences based on the count.
• Selects some number of top ranking sentences
  as description and includes it in the search
  results.
Eliminating Duplicate documents
• Identifying and Eliminating duplicate documents while
  crawling a document or when processing the search
  query.
• The description is stored in association with every
  document in a hash table.
• The system concatenates the newly crawled page with
  that stored hash value in the Hash table. If it finds a
  match, then it indicates that the current document is
  duplicate value.
• The system keeps the one which has higher page rank
  or more document significance and remove the duplicate
  document and will not appear in future search results for
  any query.
Functions of Indexing system

• Indentifies Phrases in documents
• Indexing Documents according to the
  phrases by accessing various websites.

Functions of Front End Server

• Receives queries from a user
• Provides those queries to the search system
Functions of Searching System

• Searching for documents relevant to the
  search query
• Identifies the phrases in the search query
• Ranking the documents

Functions of Presentation system

• Modifying the search results including
  removing of duplicate content.
• Generating topical descriptions of
  documents and provides modified
Spam Detection
• “Spam” pages have little meaningful content,
  but may instead be made up of large
  collections of popular words and phrases.
  These are sometimes referred to as “keyword
  stuffed pages”.

• Pages containing specific words and phrases
  that advertisers might be interested in are
  often called “honeypots,” and are created for
  search engines to display along with paid
  advertisements .
Cont…
• A phrase based indexing system knows the
  number of related phrases in a document.

• A normal, non-spam document will generally
  have a relatively limited number of related
  phrases, typically on the order of between 8 and
  20, depending on the document collection.

• A spam document will have an excessive
  number of related phrases, for example on the
  order of between 100 and 1000 related phrases.
Advantages of Phrase Based
            Indexing

• Detecting Duplicate Pages
• Spam Detection
• Save time
Other Patent Applications
• Phrase identification in an information retrieval system

• Phrase-based searching in an information retrieval system

• Phrase-based generation of document descriptions

• Detecting spam documents in a phrase based information
  retrieval system

• Efficient Phrase Based Document Indexing for Document
  Clustering
According to data collected from users of European Web
 analytics provider OneStat, most people use 2- or 3-word
 queries in search engines


Two-word phrases -- 28.38 percent
Three-word phrases -- 27.15 percent
Four-word phrases -- 16.42 percent
One-word phrase -- 13.48 percent
Five-word phrases -- 8.03 percent
Six-word phrases -- 3.67 percent
Seven-word phrases -- 1.63 percent
Eight-word phrases -- 0.73 percent
Nine-word phrases -- 0.34 percent
Ten-word phrases -- 0.16 percent
Thank you

Contenu connexe

Tendances

Jurafsky, Martin.-Speech and Language Processing_ An Introduction to Natural ...
Jurafsky, Martin.-Speech and Language Processing_ An Introduction to Natural ...Jurafsky, Martin.-Speech and Language Processing_ An Introduction to Natural ...
Jurafsky, Martin.-Speech and Language Processing_ An Introduction to Natural ...
KanwalNaz30
 

Tendances (20)

Information retrieval 10 tf idf and bag of words
Information retrieval 10 tf idf and bag of wordsInformation retrieval 10 tf idf and bag of words
Information retrieval 10 tf idf and bag of words
 
Web clustering engines
Web clustering enginesWeb clustering engines
Web clustering engines
 
The semantic web
The semantic web The semantic web
The semantic web
 
Jurafsky, Martin.-Speech and Language Processing_ An Introduction to Natural ...
Jurafsky, Martin.-Speech and Language Processing_ An Introduction to Natural ...Jurafsky, Martin.-Speech and Language Processing_ An Introduction to Natural ...
Jurafsky, Martin.-Speech and Language Processing_ An Introduction to Natural ...
 
Cyber crime ppt
Cyber crime pptCyber crime ppt
Cyber crime ppt
 
Footprinting and reconnaissance
Footprinting and reconnaissanceFootprinting and reconnaissance
Footprinting and reconnaissance
 
Web clustring engine
Web clustring engineWeb clustring engine
Web clustring engine
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Cyber Security - Unit - 5 - Introduction to Cyber Crime Investigation
Cyber Security - Unit - 5 - Introduction to Cyber Crime InvestigationCyber Security - Unit - 5 - Introduction to Cyber Crime Investigation
Cyber Security - Unit - 5 - Introduction to Cyber Crime Investigation
 
Malicious traffic
Malicious trafficMalicious traffic
Malicious traffic
 
Legal aspects of digital forensics
Legal aspects of digital forensics Legal aspects of digital forensics
Legal aspects of digital forensics
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense Disambiguation
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
What is WHOIS?
What is WHOIS?What is WHOIS?
What is WHOIS?
 
Email spam detection
Email spam detectionEmail spam detection
Email spam detection
 
Spam Detection Using Natural Language processing
Spam Detection Using Natural Language processingSpam Detection Using Natural Language processing
Spam Detection Using Natural Language processing
 
Trojan Horse Virus
Trojan Horse VirusTrojan Horse Virus
Trojan Horse Virus
 
The Use of Artificial Intelligence and Machine Learning in Speech Recognition
The Use of Artificial Intelligence and Machine Learning in Speech RecognitionThe Use of Artificial Intelligence and Machine Learning in Speech Recognition
The Use of Artificial Intelligence and Machine Learning in Speech Recognition
 
Web technologies: HTTP
Web technologies: HTTPWeb technologies: HTTP
Web technologies: HTTP
 

Similaire à Phrase based Indexing and Information Retrieval

How search engines work Anand Saini
How search engines work Anand SainiHow search engines work Anand Saini
How search engines work Anand Saini
Dr,Saini Anand
 
Tovek Presentation by Livio Costantini
Tovek Presentation by Livio CostantiniTovek Presentation by Livio Costantini
Tovek Presentation by Livio Costantini
maxfalc
 
Information storage and retrieval
Information storage and retrievalInformation storage and retrieval
Information storage and retrieval
Sadaf Rafiq
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
silambu111
 

Similaire à Phrase based Indexing and Information Retrieval (20)

Phrase Based Indexing
Phrase Based IndexingPhrase Based Indexing
Phrase Based Indexing
 
Phrase Based Indexing and Information Retrivel
Phrase Based Indexing and Information RetrivelPhrase Based Indexing and Information Retrivel
Phrase Based Indexing and Information Retrivel
 
How search engines work Anand Saini
How search engines work Anand SainiHow search engines work Anand Saini
How search engines work Anand Saini
 
Tovek Presentation by Livio Costantini
Tovek Presentation by Livio CostantiniTovek Presentation by Livio Costantini
Tovek Presentation by Livio Costantini
 
Using Technology for Academic Research
Using Technology for Academic ResearchUsing Technology for Academic Research
Using Technology for Academic Research
 
information retrieval in artificial intelligence
information retrieval in artificial intelligenceinformation retrieval in artificial intelligence
information retrieval in artificial intelligence
 
Information storage and retrieval
Information storage and retrievalInformation storage and retrieval
Information storage and retrieval
 
Text mining
Text miningText mining
Text mining
 
Chapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and RetrievalChapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and Retrieval
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
 
How to search on internet.pptx
How to search on internet.pptxHow to search on internet.pptx
How to search on internet.pptx
 
File000162
File000162File000162
File000162
 
How a search engine works slide
How a search engine works slideHow a search engine works slide
How a search engine works slide
 
Writing & publishing research workshop
Writing & publishing research workshopWriting & publishing research workshop
Writing & publishing research workshop
 
Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Info 2402 irt-chapter_2
Info 2402 irt-chapter_2
 
EDS for IFLA
EDS for IFLAEDS for IFLA
EDS for IFLA
 
A Gentle Introduction to Text Analysis I
A Gentle Introduction to Text Analysis IA Gentle Introduction to Text Analysis I
A Gentle Introduction to Text Analysis I
 
Jonathan Breeze, Symplectic
Jonathan Breeze, SymplecticJonathan Breeze, Symplectic
Jonathan Breeze, Symplectic
 
BLC & Digital Science: Jonathan Breeze, Symplectic
BLC & Digital Science: Jonathan Breeze, SymplecticBLC & Digital Science: Jonathan Breeze, Symplectic
BLC & Digital Science: Jonathan Breeze, Symplectic
 
Entrez databases
Entrez databasesEntrez databases
Entrez databases
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

Phrase based Indexing and Information Retrieval

  • 1. Phrase Based Indexing By Bala Abirami
  • 2. Introduction of Phrase Based Indexing • What is Phrase Based Indexing? • Back ground of Invention • Summary on Invention • Spam Detection
  • 3. Introduction • An information retrieval system uses phrases to index, retrieve, organize and describe documents. • It was a patent application submitted by the Google Engineer, Anna Lynn Patterson to US • Application filed: July, 2004 • Published: January, 2006
  • 4. Background of Invention • Information retrieval systems, generally called search engines, are now an essential tool for finding information in large scale, diverse, and growing corpuses such as the Internet. • A document is retrieved in response to a query containing a number of query terms, typically based on having some number of query terms present in the document. • The retrieved documents are then ranked according to other statistical measures, such as frequency of occurrence of the query terms, host domain, link analysis, and the like
  • 5. Cont… • Concepts are often expressed in phrases, such as "Australian Shepherd," "President of the United States," or "Sundance Film Festival". • Accordingly, there is a need for an information retrieval system and methodology that can identify phrases, index documents according to phrases, search and rank documents in accordance with their phrases.
  • 6. Summary An information retrieval system and methodology uses phrases to index, search, rank, and describe documents in the document collection. 1. Identifying Phrases and Related Phrases 2. Indexing Documents w.r.t Phrases 3. Ranking Documents w.r.t Phrases 4. Creating description for the document 5. Elimination of Duplicate Documents
  • 7. Identifying Phrase and Related Phrases • Based on a phrase's ability to predict the presence of other phrases in a document. • It looks to identify phrases that have frequent and/or distinguished/unique usage • Prediction measure is used for identifying related phrases • Prediction measure relates Actual co -occurrence rate of two phrases to expected co- occurrence rate of the two phrases • Information gain = actual co-occurrence rate :
  • 8. Cont… • Two Phrases are related to each other when the prediction measure exceeds the prediction threshold. • Example: Phrase : “President of the United States” predicts the related phrase “White House”, “George Bush” etc.,
  • 9. Indexing documents based on related Phrases • An information retrieval system indexes documents in the document collection by the valid or good phrases. • Posting List = documents that contain the phrase • Second List = used to store data indicating which of the related phrases of the given phrase are also present in each document containing the given phrase
  • 10. Ranking • Ranking documents is based on two factors 1. Ranking Documents based on Contained Phrases 2. Ranking Documents based on Anchor Phrases • Document Score = Body Hit Score + Anchor Hit Score • For Example: Body Hit Score = 0.30, Anchor Hit Score = 0.70 • Document Score = 0.30 + 0.70
  • 11. Phrase Extension • The information retrieval system is also adapted to use the phrases when searching for documents in response to a query. • A user may enter an incomplete phrase in a search query, such as "President of the“ Incomplete phrases such as these may be identified and replaced by a phrase extension, such as "President of the United States."
  • 12. Descriptions for Documents • Phrase information is used to create description of a document. • System identifies phrase present in the query, related phrases and Phrase extensions in each sentences and have a count for each sentences. • Ranks the sentences based on the count. • Selects some number of top ranking sentences as description and includes it in the search results.
  • 13. Eliminating Duplicate documents • Identifying and Eliminating duplicate documents while crawling a document or when processing the search query. • The description is stored in association with every document in a hash table. • The system concatenates the newly crawled page with that stored hash value in the Hash table. If it finds a match, then it indicates that the current document is duplicate value. • The system keeps the one which has higher page rank or more document significance and remove the duplicate document and will not appear in future search results for any query.
  • 14.
  • 15. Functions of Indexing system • Indentifies Phrases in documents • Indexing Documents according to the phrases by accessing various websites. Functions of Front End Server • Receives queries from a user • Provides those queries to the search system
  • 16. Functions of Searching System • Searching for documents relevant to the search query • Identifies the phrases in the search query • Ranking the documents Functions of Presentation system • Modifying the search results including removing of duplicate content. • Generating topical descriptions of documents and provides modified
  • 17. Spam Detection • “Spam” pages have little meaningful content, but may instead be made up of large collections of popular words and phrases. These are sometimes referred to as “keyword stuffed pages”. • Pages containing specific words and phrases that advertisers might be interested in are often called “honeypots,” and are created for search engines to display along with paid advertisements .
  • 18. Cont… • A phrase based indexing system knows the number of related phrases in a document. • A normal, non-spam document will generally have a relatively limited number of related phrases, typically on the order of between 8 and 20, depending on the document collection. • A spam document will have an excessive number of related phrases, for example on the order of between 100 and 1000 related phrases.
  • 19. Advantages of Phrase Based Indexing • Detecting Duplicate Pages • Spam Detection • Save time
  • 20. Other Patent Applications • Phrase identification in an information retrieval system • Phrase-based searching in an information retrieval system • Phrase-based generation of document descriptions • Detecting spam documents in a phrase based information retrieval system • Efficient Phrase Based Document Indexing for Document Clustering
  • 21. According to data collected from users of European Web analytics provider OneStat, most people use 2- or 3-word queries in search engines Two-word phrases -- 28.38 percent Three-word phrases -- 27.15 percent Four-word phrases -- 16.42 percent One-word phrase -- 13.48 percent Five-word phrases -- 8.03 percent Six-word phrases -- 3.67 percent Seven-word phrases -- 1.63 percent Eight-word phrases -- 0.73 percent Nine-word phrases -- 0.34 percent Ten-word phrases -- 0.16 percent