SlideShare a Scribd company logo
1 of 22
The Search Engine Index http://scienceforseo.blogspot.com IR tutorial series: Part 1
What is an index? The word “index” can mean many things in computing, but in the case of search engines, it can be defined as: A database where information (after being collected, parsed and processed) is stored to allow for quick retrieval. Cache-based engines store the index along with the corpus (collection of documents).  When something is added to the corpus, the index is updated.
“Index” We call it that because it's exactly what we called it when it was one of these: And that took its name from the index finger Photo from: http://www.homeschoolinthewoods.com
Why use an index? If we didn't have an index, it would take too much time to search through the whole corpus to find documents that matched our query.  Creating an index means that the retrieval process is faster and the accuracy is better. The search engine doesn't need to scan each document to know what it's about – this saves on storage and makes the whole process faster.
Some things we need to think about ,[object Object],[object Object],[object Object],[object Object],[object Object]
Indexing methods ,[object Object],[object Object],[object Object],[object Object],[object Object]
The inverted index It is an index which has terms marked as keys.  These map to the document they appear in.  The index is sorted by its keys and works well with Boolean operators (AND,OR, AND NOT) We find the documents by matching the terms – this is why we say it is inverted. Diagram by http://developer.apple.com/
Limitations It can only tell us if a word occurs in a particular document. It can't tell us how often it occurs or its location in the document, it also can't rank those documents either. That information is very important because it helps the search engine determine how relevant to a query a document is. so... we look at  latent semantic indexing (LSI)
LSI “ Semantic” = meaning “ Latent” = present but hidden It is the analysis of the hidden meaning of words and how often they occur in a document. It can infer meaning from words which isn't obvious: Computer – PC – Laptop => connected It can put together documents that are not obviously created. It can do this because it creates a “latent semantic space”
How does LSI work? It uses lots of vectors and creates a “term document matrix” from all the documents it has. Then 3 matrices are created using SVD (“singular value decomposition”) Of these 3 vectors, the 2 nd  contains the singular values of the original matrix in a diagonal matrix Sets of documents are represented as d-dimensional vectors Using the cosine of the angle between these vectors, there is  now an easy-to-calculate similarity measure between any two sets of terms and/or documents.
A quick sketch of LSI Sets of terms and documents = d-dimensional vectors  There are however some big limitations to this method.... Term document  matrix Box of documents Lots of vectors Matrix 1 Matrix 2 Matrix 3
The resulting dimensions can be very difficult to interpret so there are mistakes.  It's unclear what the resulting similarities between terms really mean.  The input is a bag-of-words so we don't have any text structure information. A compound term (“bull-headed”) is treated as 2 terms. Ambiguous terms create noise in the vector space There's no way to define the optimal dimensionality of the vector space There's a time complexity for SVD in dynamic collections
PLSI “ Probabilistic latent semantic indexing” is a better choice because: It has a more robust statistical foundation and provides a proper generative data model It uses the EM algorithm (Expectation maximization to avoid over-fitting (nodes too specific to noise)) - this makes it far more flexible It can deal with domain specific synonymy and  polysemous  words
What did all that mean? “ Generative data model” -  It's used for randomly generating observed data from unknown parameters (HMMs are generative data models for example) “ EM algorithm” - it finds the maximum likelihood estimate of parameters in a probabilistic model (where the model depends on unobserved latent variables) – good for machine learning and data clustering. Synonymy – It's the synonym relation between words.  A synonym is when 2 different words mean the same thing. Polysemous – a word that has multiple meanings or interpretations
How does it work? ,[object Object],[object Object],[object Object],[object Object]
How is it different to LSI? The order of the words is lost (but results are still good due to word co-occurrence) Documents can be represented by numeric vectors in a space of words It retrieves topics Each query uses the cosine similarity metric to find the similarity between vectors.
More indexing difficulties It's easy for us to pick a document and classify it, well most of the time, but search engines have other difficulties to over come before even getting to the classification stage.
Tokenization Machines don't understand sentences in text. They see everything in bytes. Consider: The dog ran in the field We see 6 words. Machine sees 24 characters (chars) The words found in a document are called “tokens”.  Information is extracted from documents to be placed in the index.  The tokens may be email addresses, words, URLs,... The Part-Of-Speech, line number, sentence number, size and so on can be stored in the index.
Section recognition Before tokenization happens, all the major parts of a document are identified.  Some documents are newsletters other have a side navigation, some are reports...and the text can be displayed in columns.  Machines will read this sequentially though and index the word sequentially as well. The difficulty is finding which view of the document is informative. Some engines will index an abstract representation of the document instead.  Most engines don't though. This is also why using JavaScript for example is avoided.
Formats Documents come in all flavours on the web.  There are documents in HTML, PDF, EXCEL, Powerpoint, and so many others. Before documents are analysed, they are stripped down and the formatting extracted.  They are "normalised". It's important for the search engine to not misread "markup" information for content or the index gets polluted.
To conclude... The indexing process of a search engine is really very important because if this is wrong, everything is wrong.  This is why “Spamdexing” is such an issue. There are a lot of very specialised areas of computing who focus their work on making it easier for machines to create an index.  Don't let this short presentation fool you, it is a very very big research issue.  Natural language processing is used for rich text analysis, which helps identify what's going on so that the other computational elements can do their job.
Resources The inverted index in detail  http://tinyurl.com/65hbfd   The seminal PLSI paper  http://tinyurl.com/54wd76 The seminal LSI paper  http://tinyurl.com/5e8v36 The semantic indexing project  http://knowledgesearch.org/ Boulder Uni on LSA  http://lsa.colorado.edu/ Apache Lucene  http://lucene.apache.org/java/docs/ Google test data ($150)  http://tinyurl.com/62t4la

More Related Content

What's hot

Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsMounia Lalmas-Roelleke
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notesAnandh Arumugakan
 
Metadata harvesting
Metadata harvestingMetadata harvesting
Metadata harvestingAndrewLIS688
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrievalssbd6985
 
Multilingualism in Information Retrieval System
Multilingualism in Information Retrieval SystemMultilingualism in Information Retrieval System
Multilingualism in Information Retrieval SystemAriel Hess
 
Web Mining & Text Mining
Web Mining & Text MiningWeb Mining & Text Mining
Web Mining & Text MiningHemant Sharma
 
The vector space model
The vector space modelThe vector space model
The vector space modelpkgosh
 
Information Retrieval Models
Information Retrieval ModelsInformation Retrieval Models
Information Retrieval ModelsNisha Arankandath
 
AGRIS (agricultural information system)
AGRIS (agricultural information system)AGRIS (agricultural information system)
AGRIS (agricultural information system)Abid Fakhre Alam
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibEl Habib NFAOUI
 
Information retrival system and PageRank algorithm
Information retrival system and PageRank algorithmInformation retrival system and PageRank algorithm
Information retrival system and PageRank algorithmRupali Bhatnagar
 
Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information RetrievalDishant Ailawadi
 
Information retrieval system!
Information retrieval system!Information retrieval system!
Information retrieval system!Jane Garay
 
Information retrieval (introduction)
Information  retrieval (introduction) Information  retrieval (introduction)
Information retrieval (introduction) Primya Tamil
 

What's hot (20)

Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & Models
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
 
Metadata harvesting
Metadata harvestingMetadata harvesting
Metadata harvesting
 
Thesaurus 2101
Thesaurus 2101Thesaurus 2101
Thesaurus 2101
 
Link Analysis
Link AnalysisLink Analysis
Link Analysis
 
Interoperability in Digital Libraries
Interoperability in Digital LibrariesInteroperability in Digital Libraries
Interoperability in Digital Libraries
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
Multilingualism in Information Retrieval System
Multilingualism in Information Retrieval SystemMultilingualism in Information Retrieval System
Multilingualism in Information Retrieval System
 
Web Mining & Text Mining
Web Mining & Text MiningWeb Mining & Text Mining
Web Mining & Text Mining
 
The vector space model
The vector space modelThe vector space model
The vector space model
 
Information Retrieval Models
Information Retrieval ModelsInformation Retrieval Models
Information Retrieval Models
 
AGRIS (agricultural information system)
AGRIS (agricultural information system)AGRIS (agricultural information system)
AGRIS (agricultural information system)
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_Habib
 
Information retrival system and PageRank algorithm
Information retrival system and PageRank algorithmInformation retrival system and PageRank algorithm
Information retrival system and PageRank algorithm
 
International Digital Library Initiatives
International Digital Library InitiativesInternational Digital Library Initiatives
International Digital Library Initiatives
 
Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information Retrieval
 
Information retrieval system!
Information retrieval system!Information retrieval system!
Information retrieval system!
 
Web spam
Web spamWeb spam
Web spam
 
Information retrieval (introduction)
Information  retrieval (introduction) Information  retrieval (introduction)
Information retrieval (introduction)
 
Automatic indexing
Automatic indexingAutomatic indexing
Automatic indexing
 

Viewers also liked

Search Engine Spam Index - Types of Link Spam & Content Spam
Search Engine Spam Index - Types of Link Spam & Content SpamSearch Engine Spam Index - Types of Link Spam & Content Spam
Search Engine Spam Index - Types of Link Spam & Content Spamjagadish thaker
 
Optical Mark Recognition
Optical Mark RecognitionOptical Mark Recognition
Optical Mark RecognitionHimanshu Popli
 
Cybercrime And Computer Misuse Cases
Cybercrime And Computer Misuse CasesCybercrime And Computer Misuse Cases
Cybercrime And Computer Misuse CasesAshesh R
 
Identity Theft Presentation
Identity Theft PresentationIdentity Theft Presentation
Identity Theft PresentationRandall Chesnutt
 
Mac281 Open Source software
Mac281 Open Source softwareMac281 Open Source software
Mac281 Open Source softwareRob Jewitt
 
Search engines and its types
Search engines and its typesSearch engines and its types
Search engines and its typesNagarjuna Kalluru
 
Port mann bridge modification
Port mann bridge modificationPort mann bridge modification
Port mann bridge modificationjacobkwack
 
Presentation search strategy
Presentation   search strategyPresentation   search strategy
Presentation search strategyjmunks
 
Richard kwock jsm 2012 poster
Richard kwock jsm 2012 posterRichard kwock jsm 2012 poster
Richard kwock jsm 2012 posterAjay Ohri
 
From KWIC to Enterprise Search - M G Lindquist
From KWIC to Enterprise Search - M G LindquistFrom KWIC to Enterprise Search - M G Lindquist
From KWIC to Enterprise Search - M G Lindquistmglindquist
 
Keyword Searching: Advanced Techniques
Keyword Searching: Advanced TechniquesKeyword Searching: Advanced Techniques
Keyword Searching: Advanced TechniquesKris Jacobson
 
Advanced keyword research
Advanced keyword researchAdvanced keyword research
Advanced keyword researchJono Alderson
 
Institutional Repositories
Institutional RepositoriesInstitutional Repositories
Institutional RepositoriesSarika Sawant
 

Viewers also liked (20)

Search Engine Spam Index - Types of Link Spam & Content Spam
Search Engine Spam Index - Types of Link Spam & Content SpamSearch Engine Spam Index - Types of Link Spam & Content Spam
Search Engine Spam Index - Types of Link Spam & Content Spam
 
Optical Mark Recognition
Optical Mark RecognitionOptical Mark Recognition
Optical Mark Recognition
 
Cybercrime And Computer Misuse Cases
Cybercrime And Computer Misuse CasesCybercrime And Computer Misuse Cases
Cybercrime And Computer Misuse Cases
 
Identity Theft Presentation
Identity Theft PresentationIdentity Theft Presentation
Identity Theft Presentation
 
Mac281 Open Source software
Mac281 Open Source softwareMac281 Open Source software
Mac281 Open Source software
 
Cyber Terrorism
Cyber TerrorismCyber Terrorism
Cyber Terrorism
 
Parts of cpu
Parts of cpuParts of cpu
Parts of cpu
 
Search engines and its types
Search engines and its typesSearch engines and its types
Search engines and its types
 
Types of Search Engines
Types of Search EnginesTypes of Search Engines
Types of Search Engines
 
Port mann bridge modification
Port mann bridge modificationPort mann bridge modification
Port mann bridge modification
 
Presentation search strategy
Presentation   search strategyPresentation   search strategy
Presentation search strategy
 
Richard kwock jsm 2012 poster
Richard kwock jsm 2012 posterRichard kwock jsm 2012 poster
Richard kwock jsm 2012 poster
 
POPSI
POPSIPOPSI
POPSI
 
From KWIC to Enterprise Search - M G Lindquist
From KWIC to Enterprise Search - M G LindquistFrom KWIC to Enterprise Search - M G Lindquist
From KWIC to Enterprise Search - M G Lindquist
 
Keyword Searching: Advanced Techniques
Keyword Searching: Advanced TechniquesKeyword Searching: Advanced Techniques
Keyword Searching: Advanced Techniques
 
3rd Thesaurus
3rd Thesaurus3rd Thesaurus
3rd Thesaurus
 
Lawrence kwockresume1
Lawrence kwockresume1Lawrence kwockresume1
Lawrence kwockresume1
 
Advanced keyword research
Advanced keyword researchAdvanced keyword research
Advanced keyword research
 
Searching techniques
Searching techniquesSearching techniques
Searching techniques
 
Institutional Repositories
Institutional RepositoriesInstitutional Repositories
Institutional Repositories
 

Similar to The search engine index

Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and SparkAudible, Inc.
 
Demystifying analytics in e discovery white paper 06-30-14
Demystifying analytics in e discovery   white paper 06-30-14Demystifying analytics in e discovery   white paper 06-30-14
Demystifying analytics in e discovery white paper 06-30-14Steven Toole
 
IRJET - BOT Virtual Guide
IRJET -  	  BOT Virtual GuideIRJET -  	  BOT Virtual Guide
IRJET - BOT Virtual GuideIRJET Journal
 
Tovek Presentation by Livio Costantini
Tovek Presentation by Livio CostantiniTovek Presentation by Livio Costantini
Tovek Presentation by Livio Costantinimaxfalc
 
Topic detecton by clustering and text mining
Topic detecton by clustering and text miningTopic detecton by clustering and text mining
Topic detecton by clustering and text miningIRJET Journal
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsDerek Kane
 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibEl Habib NFAOUI
 
G04124041046
G04124041046G04124041046
G04124041046IOSR-JEN
 
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
professional fuzzy type-ahead rummage around in xml  type-ahead search techni...professional fuzzy type-ahead rummage around in xml  type-ahead search techni...
professional fuzzy type-ahead rummage around in xml type-ahead search techni...Kumar Goud
 
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATIONUSING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATIONIJDKP
 
Searching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerSearching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerIOSR Journals
 
The need for sophistication in modern search engine implementations
The need for sophistication in modern search engine implementationsThe need for sophistication in modern search engine implementations
The need for sophistication in modern search engine implementationsBen DeMott
 
XXIX Charleston 2009 Silverchair Kerner
XXIX Charleston 2009 Silverchair KernerXXIX Charleston 2009 Silverchair Kerner
XXIX Charleston 2009 Silverchair KernerDarrell W. Gunter
 
leewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdf
leewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdfleewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdf
leewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdfrobertsamuel23
 
Content Analyst - Conceptualizing LSI Based Text Analytics White Paper
Content Analyst - Conceptualizing LSI Based Text Analytics White PaperContent Analyst - Conceptualizing LSI Based Text Analytics White Paper
Content Analyst - Conceptualizing LSI Based Text Analytics White PaperJohn Felahi
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrLucidworks
 
Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...ijsrd.com
 
Keyphrase Extraction using Neighborhood Knowledge
Keyphrase Extraction using Neighborhood KnowledgeKeyphrase Extraction using Neighborhood Knowledge
Keyphrase Extraction using Neighborhood KnowledgeIJMTST Journal
 

Similar to The search engine index (20)

Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and Spark
 
Demystifying analytics in e discovery white paper 06-30-14
Demystifying analytics in e discovery   white paper 06-30-14Demystifying analytics in e discovery   white paper 06-30-14
Demystifying analytics in e discovery white paper 06-30-14
 
IRJET - BOT Virtual Guide
IRJET -  	  BOT Virtual GuideIRJET -  	  BOT Virtual Guide
IRJET - BOT Virtual Guide
 
Tovek Presentation by Livio Costantini
Tovek Presentation by Livio CostantiniTovek Presentation by Livio Costantini
Tovek Presentation by Livio Costantini
 
Topic detecton by clustering and text mining
Topic detecton by clustering and text miningTopic detecton by clustering and text mining
Topic detecton by clustering and text mining
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
 
G04124041046
G04124041046G04124041046
G04124041046
 
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
professional fuzzy type-ahead rummage around in xml  type-ahead search techni...professional fuzzy type-ahead rummage around in xml  type-ahead search techni...
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
 
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATIONUSING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
 
Searching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerSearching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal Computer
 
The need for sophistication in modern search engine implementations
The need for sophistication in modern search engine implementationsThe need for sophistication in modern search engine implementations
The need for sophistication in modern search engine implementations
 
XXIX Charleston 2009 Silverchair Kerner
XXIX Charleston 2009 Silverchair KernerXXIX Charleston 2009 Silverchair Kerner
XXIX Charleston 2009 Silverchair Kerner
 
leewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdf
leewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdfleewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdf
leewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdf
 
Oops Concepts
Oops ConceptsOops Concepts
Oops Concepts
 
Content Analyst - Conceptualizing LSI Based Text Analytics White Paper
Content Analyst - Conceptualizing LSI Based Text Analytics White PaperContent Analyst - Conceptualizing LSI Based Text Analytics White Paper
Content Analyst - Conceptualizing LSI Based Text Analytics White Paper
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
 
Beautiful Research Data (Structured Data and Open Refine)
Beautiful Research Data (Structured Data and Open Refine)Beautiful Research Data (Structured Data and Open Refine)
Beautiful Research Data (Structured Data and Open Refine)
 
Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...
 
Keyphrase Extraction using Neighborhood Knowledge
Keyphrase Extraction using Neighborhood KnowledgeKeyphrase Extraction using Neighborhood Knowledge
Keyphrase Extraction using Neighborhood Knowledge
 

More from CJ Jenkins

I am an experience designer
I am an experience designer I am an experience designer
I am an experience designer CJ Jenkins
 
How Sentiment Analysis works
How Sentiment Analysis worksHow Sentiment Analysis works
How Sentiment Analysis worksCJ Jenkins
 
Using construction grammar in conversational systems
Using construction grammar in conversational systemsUsing construction grammar in conversational systems
Using construction grammar in conversational systemsCJ Jenkins
 
Knowledgebase vs Database
Knowledgebase vs DatabaseKnowledgebase vs Database
Knowledgebase vs DatabaseCJ Jenkins
 
Building a semantic website
Building a semantic websiteBuilding a semantic website
Building a semantic websiteCJ Jenkins
 
Search Engine Spiders
Search Engine SpidersSearch Engine Spiders
Search Engine SpidersCJ Jenkins
 
Twitter for business
Twitter for businessTwitter for business
Twitter for businessCJ Jenkins
 

More from CJ Jenkins (7)

I am an experience designer
I am an experience designer I am an experience designer
I am an experience designer
 
How Sentiment Analysis works
How Sentiment Analysis worksHow Sentiment Analysis works
How Sentiment Analysis works
 
Using construction grammar in conversational systems
Using construction grammar in conversational systemsUsing construction grammar in conversational systems
Using construction grammar in conversational systems
 
Knowledgebase vs Database
Knowledgebase vs DatabaseKnowledgebase vs Database
Knowledgebase vs Database
 
Building a semantic website
Building a semantic websiteBuilding a semantic website
Building a semantic website
 
Search Engine Spiders
Search Engine SpidersSearch Engine Spiders
Search Engine Spiders
 
Twitter for business
Twitter for businessTwitter for business
Twitter for business
 

Recently uploaded

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

The search engine index

  • 1. The Search Engine Index http://scienceforseo.blogspot.com IR tutorial series: Part 1
  • 2. What is an index? The word “index” can mean many things in computing, but in the case of search engines, it can be defined as: A database where information (after being collected, parsed and processed) is stored to allow for quick retrieval. Cache-based engines store the index along with the corpus (collection of documents). When something is added to the corpus, the index is updated.
  • 3. “Index” We call it that because it's exactly what we called it when it was one of these: And that took its name from the index finger Photo from: http://www.homeschoolinthewoods.com
  • 4. Why use an index? If we didn't have an index, it would take too much time to search through the whole corpus to find documents that matched our query. Creating an index means that the retrieval process is faster and the accuracy is better. The search engine doesn't need to scan each document to know what it's about – this saves on storage and makes the whole process faster.
  • 5.
  • 6.
  • 7. The inverted index It is an index which has terms marked as keys. These map to the document they appear in. The index is sorted by its keys and works well with Boolean operators (AND,OR, AND NOT) We find the documents by matching the terms – this is why we say it is inverted. Diagram by http://developer.apple.com/
  • 8. Limitations It can only tell us if a word occurs in a particular document. It can't tell us how often it occurs or its location in the document, it also can't rank those documents either. That information is very important because it helps the search engine determine how relevant to a query a document is. so... we look at latent semantic indexing (LSI)
  • 9. LSI “ Semantic” = meaning “ Latent” = present but hidden It is the analysis of the hidden meaning of words and how often they occur in a document. It can infer meaning from words which isn't obvious: Computer – PC – Laptop => connected It can put together documents that are not obviously created. It can do this because it creates a “latent semantic space”
  • 10. How does LSI work? It uses lots of vectors and creates a “term document matrix” from all the documents it has. Then 3 matrices are created using SVD (“singular value decomposition”) Of these 3 vectors, the 2 nd contains the singular values of the original matrix in a diagonal matrix Sets of documents are represented as d-dimensional vectors Using the cosine of the angle between these vectors, there is now an easy-to-calculate similarity measure between any two sets of terms and/or documents.
  • 11. A quick sketch of LSI Sets of terms and documents = d-dimensional vectors There are however some big limitations to this method.... Term document matrix Box of documents Lots of vectors Matrix 1 Matrix 2 Matrix 3
  • 12. The resulting dimensions can be very difficult to interpret so there are mistakes. It's unclear what the resulting similarities between terms really mean. The input is a bag-of-words so we don't have any text structure information. A compound term (“bull-headed”) is treated as 2 terms. Ambiguous terms create noise in the vector space There's no way to define the optimal dimensionality of the vector space There's a time complexity for SVD in dynamic collections
  • 13. PLSI “ Probabilistic latent semantic indexing” is a better choice because: It has a more robust statistical foundation and provides a proper generative data model It uses the EM algorithm (Expectation maximization to avoid over-fitting (nodes too specific to noise)) - this makes it far more flexible It can deal with domain specific synonymy and polysemous words
  • 14. What did all that mean? “ Generative data model” - It's used for randomly generating observed data from unknown parameters (HMMs are generative data models for example) “ EM algorithm” - it finds the maximum likelihood estimate of parameters in a probabilistic model (where the model depends on unobserved latent variables) – good for machine learning and data clustering. Synonymy – It's the synonym relation between words. A synonym is when 2 different words mean the same thing. Polysemous – a word that has multiple meanings or interpretations
  • 15.
  • 16. How is it different to LSI? The order of the words is lost (but results are still good due to word co-occurrence) Documents can be represented by numeric vectors in a space of words It retrieves topics Each query uses the cosine similarity metric to find the similarity between vectors.
  • 17. More indexing difficulties It's easy for us to pick a document and classify it, well most of the time, but search engines have other difficulties to over come before even getting to the classification stage.
  • 18. Tokenization Machines don't understand sentences in text. They see everything in bytes. Consider: The dog ran in the field We see 6 words. Machine sees 24 characters (chars) The words found in a document are called “tokens”. Information is extracted from documents to be placed in the index. The tokens may be email addresses, words, URLs,... The Part-Of-Speech, line number, sentence number, size and so on can be stored in the index.
  • 19. Section recognition Before tokenization happens, all the major parts of a document are identified. Some documents are newsletters other have a side navigation, some are reports...and the text can be displayed in columns. Machines will read this sequentially though and index the word sequentially as well. The difficulty is finding which view of the document is informative. Some engines will index an abstract representation of the document instead. Most engines don't though. This is also why using JavaScript for example is avoided.
  • 20. Formats Documents come in all flavours on the web. There are documents in HTML, PDF, EXCEL, Powerpoint, and so many others. Before documents are analysed, they are stripped down and the formatting extracted. They are "normalised". It's important for the search engine to not misread "markup" information for content or the index gets polluted.
  • 21. To conclude... The indexing process of a search engine is really very important because if this is wrong, everything is wrong. This is why “Spamdexing” is such an issue. There are a lot of very specialised areas of computing who focus their work on making it easier for machines to create an index. Don't let this short presentation fool you, it is a very very big research issue. Natural language processing is used for rich text analysis, which helps identify what's going on so that the other computational elements can do their job.
  • 22. Resources The inverted index in detail http://tinyurl.com/65hbfd The seminal PLSI paper http://tinyurl.com/54wd76 The seminal LSI paper http://tinyurl.com/5e8v36 The semantic indexing project http://knowledgesearch.org/ Boulder Uni on LSA http://lsa.colorado.edu/ Apache Lucene http://lucene.apache.org/java/docs/ Google test data ($150) http://tinyurl.com/62t4la