SlideShare une entreprise Scribd logo
1  sur  18
Télécharger pour lire hors ligne
Full Text Search




     for when a database is not enough...
TOC

● What is "Full text search"?
● How does it work?
● What is it good for?
● What makes it so good?
● Common Caracteristics
● Some of the most known solutions
● Who uses them?
● Practical Example
What is full text search?

Wikipedia says: full text search refers to a technique for searching a
computer-stored document or database. In a full text search, the search engine
examines all of the words in every stored document as it tries to match search
words supplied by the user.



I say: Full text search is a technique for searching documents or databases
that allows for a more relevant search (getting the results that we need instead
of the results that just "match" with our query).
How does it work?

In order to do a full text search, we first have to index all the information.

There are several techniques for indexing, but the basic idea behind it is as
follows:

 1. Scan the document
 2. For every word within the document, create an entry in the index with that
    word, and with the relative position within the document.
 3. Apply specific rules to the terms, such us:
      ○ Ignoring stop words
      ○ Stemming
      ○ etc
... how? part II

We have the index ready, now what?
Depending on the solution used, we'll have access to a formal querying
language. Using that, we can query our engine to tell it what we're looking for.

Something like:
title:"The Right Way" AND text:goorjakarta^4 apache

This will tell our search engine to look for documents with a title equal to "The
Right Way" and also, those that have the words "goorjakarta" and "apache"
on it's text, the only difference, is that "goorjakarta" is 4 times more important
than the word "apache"
What is it good for?

Full text search allows us to search (well duh!) very large amounts of
information in a very small time frame.


This type of solutions are generally used when the size of the database to be
search rises to the giga bytes.


It is normally used for searching inside the content of documents, such as word
documents, excel spreadsheets, web pages, etc.
What makes it so good?
Full text search is great! (but why?)

Some of the most important caracteristics to all full text search
solutions are:

- Relevant search: The results we get can be sorted based on relevance, this
allows for the user to get what he is looking for easily. (i.e: if we search for "red"
and "apple" we want to get the fruit and not results about the Apple company)

- Keywords: When indexing, keywords can be assigned to different parts of the
documents, allowing for a more specific type of query.

- Wildcards: Great tool that allows us to search terms when we don't know
exactly how to write it.

- Fuzzy search: Using this techniques, we can search terms that are close to
the ones on our query string.
Common caracteristics

Let's talk about some of the most common caracteristics
amongst full text search solutions.

 ● Presicion vs. Recall
 ● Stopwords
 ● Stemming
 ● Wildcards
Precision vs. recall tradeoff

Precision: Number of relevant results returned divided by the
total of results returned.

Recall: Number of relevant results returned divided by the total
of relevant results.

When choosing a solution, it is important to manage this two
concepts correctly. An increase on precision regularly means a
decrease on recall, and the oposite also applies.
Stopwords

Stopwords are terms that are too common on a language and
therefore are not specific enough to be of used when
searching.

Some examples of this are words like "the", "a", "an", "by",
"can", etc.

They're normally ignored by full text analyzers when indexing
information.
Stemming

Stemming allows us to reduce a word to it's root form (or stem)
in order to generalize terms while searching. Note that this is
not the same as synonyms.

For example, a stemmer would generalize words like "catlike",
"catty" and "cats" to their root form: "cat".
W?ldc*ds (A.k.a: Wildcards)

Wildcards are a bit more known and they do what you'd expect
them to do: they are used in place of characters when you don't
know exactly how your search terms are formed.

Wildcards characters may vary from one solution to the other,
but there are normally two: one that represents a single
character, and one that represents a group of them.

For example: the string 'hel*' would match words like 'hello',
'helium' and others, while the string 'hel?' would only match
words that begin with "hel" and end with one more character,
like "hell" but not "helium".
Some of the most known solutions 

There are different types of solutions, some of them are just
APIs that can be integrated into our proyects, whilst others are
servers that provide an entire layer of services between our
application and the information.
Some examples of this are:

APIs:
 ● Xapian
 ● Lucene

Servers:
 ● Sphinx
 ● Solr
... a bit more about Lucene and Xapian

There are many more, but those are some of the most known
ones...

Xapian and Lucene are two APIs but they work differently,
because Xapian needs bindins for every language in order to
be compatible.
In the case of Lucene, there are specific implementations of
Lucene for every compatible language.
... and a bit more about Sphinx and Solr

On the other hand, Solr (which is based on Lucene) and
Sphinx are both full text search servers.

They both provide their functionalities through interfaces and
not directly inside the application.

Sphinx is designed to be efficient while indexing database
content.
Who uses them?

This types of solutions are used by many companies, for
example:


- Debian uses Xapian for many tasks, one of them
is Searching their archive of software packages
- NASA Planetary Data System (PDS) uses Solr to search for
dataset, mission, instrument, target, and host information
- Digg uses Solr for searching their site
- Craigslist uses Sphinx
- Moove-it! has used Sphinx on some of it's projects
- And many more...
Practical Example

Let's take a look at a very original example...
Thanks for reading...

 ... and happy searching!

Contenu connexe

Tendances

Server system architecture
Server system architectureServer system architecture
Server system architectureFaiza Hafeez
 
Best Practices in the Use of Columnar Databases
Best Practices in the Use of Columnar DatabasesBest Practices in the Use of Columnar Databases
Best Practices in the Use of Columnar DatabasesDATAVERSITY
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in NetflixDanny Yuan
 
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...Amazon Web Services
 
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...Databricks
 
Event Sourcing - Greg Young
Event Sourcing - Greg YoungEvent Sourcing - Greg Young
Event Sourcing - Greg YoungJAXLondon2014
 
Elasticsearch V/s Relational Database
Elasticsearch V/s Relational DatabaseElasticsearch V/s Relational Database
Elasticsearch V/s Relational DatabaseRicha Budhraja
 
Apache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in DepthApache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in DepthDatabricks
 
Challenges of Conventional Systems.pptx
Challenges of Conventional Systems.pptxChallenges of Conventional Systems.pptx
Challenges of Conventional Systems.pptxGovardhanV7
 
Introduction of Knowledge Graphs
Introduction of Knowledge GraphsIntroduction of Knowledge Graphs
Introduction of Knowledge GraphsJeff Z. Pan
 
Cost and Performance Optimisation in Amazon RDS - AWS Summit Sydney 2018
Cost and Performance Optimisation in Amazon RDS - AWS Summit Sydney 2018Cost and Performance Optimisation in Amazon RDS - AWS Summit Sydney 2018
Cost and Performance Optimisation in Amazon RDS - AWS Summit Sydney 2018Amazon Web Services
 
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...Altinity Ltd
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overviewJames Serra
 
An Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaAn Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaObjectRocket
 
İleri Seviye T-SQL Programlama - Chapter 19
İleri Seviye T-SQL Programlama - Chapter 19İleri Seviye T-SQL Programlama - Chapter 19
İleri Seviye T-SQL Programlama - Chapter 19Cihan Özhan
 
AWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationAWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationVolodymyr Rovetskiy
 

Tendances (20)

Server system architecture
Server system architectureServer system architecture
Server system architecture
 
Best Practices in the Use of Columnar Databases
Best Practices in the Use of Columnar DatabasesBest Practices in the Use of Columnar Databases
Best Practices in the Use of Columnar Databases
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in Netflix
 
Redshift overview
Redshift overviewRedshift overview
Redshift overview
 
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
 
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
 
The Google Bigtable
The Google BigtableThe Google Bigtable
The Google Bigtable
 
Event Sourcing - Greg Young
Event Sourcing - Greg YoungEvent Sourcing - Greg Young
Event Sourcing - Greg Young
 
Elasticsearch V/s Relational Database
Elasticsearch V/s Relational DatabaseElasticsearch V/s Relational Database
Elasticsearch V/s Relational Database
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
Apache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in DepthApache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in Depth
 
Challenges of Conventional Systems.pptx
Challenges of Conventional Systems.pptxChallenges of Conventional Systems.pptx
Challenges of Conventional Systems.pptx
 
Introduction of Knowledge Graphs
Introduction of Knowledge GraphsIntroduction of Knowledge Graphs
Introduction of Knowledge Graphs
 
Cost and Performance Optimisation in Amazon RDS - AWS Summit Sydney 2018
Cost and Performance Optimisation in Amazon RDS - AWS Summit Sydney 2018Cost and Performance Optimisation in Amazon RDS - AWS Summit Sydney 2018
Cost and Performance Optimisation in Amazon RDS - AWS Summit Sydney 2018
 
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
An Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaAn Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and Kibana
 
İleri Seviye T-SQL Programlama - Chapter 19
İleri Seviye T-SQL Programlama - Chapter 19İleri Seviye T-SQL Programlama - Chapter 19
İleri Seviye T-SQL Programlama - Chapter 19
 
AWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationAWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentation
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 

En vedette

Enterprise Search and Findability in 2013
Enterprise Search and Findability in 2013Enterprise Search and Findability in 2013
Enterprise Search and Findability in 2013Findwise
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with LuceneWO Community
 
The Enterprise Search Market in a Nutshell
The Enterprise Search Market in a NutshellThe Enterprise Search Market in a Nutshell
The Enterprise Search Market in a NutshellDr. Haxel Consult
 
Search++: Cognitive transformation of human-system interaction: Presented by ...
Search++: Cognitive transformation of human-system interaction: Presented by ...Search++: Cognitive transformation of human-system interaction: Presented by ...
Search++: Cognitive transformation of human-system interaction: Presented by ...Lucidworks
 
Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Sematext Group, Inc.
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearchFlorian Hopf
 

En vedette (7)

Interfaces to xapian
Interfaces to xapianInterfaces to xapian
Interfaces to xapian
 
Enterprise Search and Findability in 2013
Enterprise Search and Findability in 2013Enterprise Search and Findability in 2013
Enterprise Search and Findability in 2013
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
 
The Enterprise Search Market in a Nutshell
The Enterprise Search Market in a NutshellThe Enterprise Search Market in a Nutshell
The Enterprise Search Market in a Nutshell
 
Search++: Cognitive transformation of human-system interaction: Presented by ...
Search++: Cognitive transformation of human-system interaction: Presented by ...Search++: Cognitive transformation of human-system interaction: Presented by ...
Search++: Cognitive transformation of human-system interaction: Presented by ...
 
Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 

Similaire à Full text search

MODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptxMODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptxnikshaikh786
 
Natural language processing and search
Natural language processing and searchNatural language processing and search
Natural language processing and searchNathan McMinn
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsDerek Kane
 
Searching the Internet
Searching the Internet Searching the Internet
Searching the Internet guest32ae6
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrLucidworks
 
Better problem solving through scripting: How to think through your #eprdctn ...
Better problem solving through scripting: How to think through your #eprdctn ...Better problem solving through scripting: How to think through your #eprdctn ...
Better problem solving through scripting: How to think through your #eprdctn ...BookNet Canada
 
E-LEARN: Search Strategies
E-LEARN: Search StrategiesE-LEARN: Search Strategies
E-LEARN: Search StrategiesRose Petralia
 
Relevance redefined
Relevance redefinedRelevance redefined
Relevance redefinedLukas Koster
 
Dictionary implementation using TRIE
Dictionary implementation using TRIEDictionary implementation using TRIE
Dictionary implementation using TRIECharmi Chokshi
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
Academic Skills 4
Academic Skills 4Academic Skills 4
Academic Skills 4Hala Nur
 
Best Practices In Terminology Research 2010
Best Practices In Terminology Research 2010Best Practices In Terminology Research 2010
Best Practices In Terminology Research 2010Valentini Mellas
 

Similaire à Full text search (20)

MODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptxMODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptx
 
NLP todo
NLP todoNLP todo
NLP todo
 
Natural language processing and search
Natural language processing and searchNatural language processing and search
Natural language processing and search
 
Natural Language Processing using Java
Natural Language Processing using JavaNatural Language Processing using Java
Natural Language Processing using Java
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
Searching the Internet
Searching the Internet Searching the Internet
Searching the Internet
 
Key Phrases for Better Search
Key Phrases for Better SearchKey Phrases for Better Search
Key Phrases for Better Search
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
 
Php packages
Php packagesPhp packages
Php packages
 
Keyword Research.pdf
Keyword Research.pdfKeyword Research.pdf
Keyword Research.pdf
 
Semantic Search with Topic Maps
Semantic Search with Topic MapsSemantic Search with Topic Maps
Semantic Search with Topic Maps
 
Better problem solving through scripting: How to think through your #eprdctn ...
Better problem solving through scripting: How to think through your #eprdctn ...Better problem solving through scripting: How to think through your #eprdctn ...
Better problem solving through scripting: How to think through your #eprdctn ...
 
E-LEARN: Search Strategies
E-LEARN: Search StrategiesE-LEARN: Search Strategies
E-LEARN: Search Strategies
 
Relevance redefined
Relevance redefinedRelevance redefined
Relevance redefined
 
Dictionary implementation using TRIE
Dictionary implementation using TRIEDictionary implementation using TRIE
Dictionary implementation using TRIE
 
NLP PPT.pptx
NLP PPT.pptxNLP PPT.pptx
NLP PPT.pptx
 
Parser
ParserParser
Parser
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
Academic Skills 4
Academic Skills 4Academic Skills 4
Academic Skills 4
 
Best Practices In Terminology Research 2010
Best Practices In Terminology Research 2010Best Practices In Terminology Research 2010
Best Practices In Terminology Research 2010
 

Dernier

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Dernier (20)

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

Full text search

  • 1. Full Text Search for when a database is not enough...
  • 2. TOC ● What is "Full text search"? ● How does it work? ● What is it good for? ● What makes it so good? ● Common Caracteristics ● Some of the most known solutions ● Who uses them? ● Practical Example
  • 3. What is full text search? Wikipedia says: full text search refers to a technique for searching a computer-stored document or database. In a full text search, the search engine examines all of the words in every stored document as it tries to match search words supplied by the user. I say: Full text search is a technique for searching documents or databases that allows for a more relevant search (getting the results that we need instead of the results that just "match" with our query).
  • 4. How does it work? In order to do a full text search, we first have to index all the information. There are several techniques for indexing, but the basic idea behind it is as follows: 1. Scan the document 2. For every word within the document, create an entry in the index with that word, and with the relative position within the document. 3. Apply specific rules to the terms, such us: ○ Ignoring stop words ○ Stemming ○ etc
  • 5. ... how? part II We have the index ready, now what? Depending on the solution used, we'll have access to a formal querying language. Using that, we can query our engine to tell it what we're looking for. Something like: title:"The Right Way" AND text:goorjakarta^4 apache This will tell our search engine to look for documents with a title equal to "The Right Way" and also, those that have the words "goorjakarta" and "apache" on it's text, the only difference, is that "goorjakarta" is 4 times more important than the word "apache"
  • 6. What is it good for? Full text search allows us to search (well duh!) very large amounts of information in a very small time frame. This type of solutions are generally used when the size of the database to be search rises to the giga bytes. It is normally used for searching inside the content of documents, such as word documents, excel spreadsheets, web pages, etc.
  • 7. What makes it so good? Full text search is great! (but why?) Some of the most important caracteristics to all full text search solutions are: - Relevant search: The results we get can be sorted based on relevance, this allows for the user to get what he is looking for easily. (i.e: if we search for "red" and "apple" we want to get the fruit and not results about the Apple company) - Keywords: When indexing, keywords can be assigned to different parts of the documents, allowing for a more specific type of query. - Wildcards: Great tool that allows us to search terms when we don't know exactly how to write it. - Fuzzy search: Using this techniques, we can search terms that are close to the ones on our query string.
  • 8. Common caracteristics Let's talk about some of the most common caracteristics amongst full text search solutions. ● Presicion vs. Recall ● Stopwords ● Stemming ● Wildcards
  • 9. Precision vs. recall tradeoff Precision: Number of relevant results returned divided by the total of results returned. Recall: Number of relevant results returned divided by the total of relevant results. When choosing a solution, it is important to manage this two concepts correctly. An increase on precision regularly means a decrease on recall, and the oposite also applies.
  • 10. Stopwords Stopwords are terms that are too common on a language and therefore are not specific enough to be of used when searching. Some examples of this are words like "the", "a", "an", "by", "can", etc. They're normally ignored by full text analyzers when indexing information.
  • 11. Stemming Stemming allows us to reduce a word to it's root form (or stem) in order to generalize terms while searching. Note that this is not the same as synonyms. For example, a stemmer would generalize words like "catlike", "catty" and "cats" to their root form: "cat".
  • 12. W?ldc*ds (A.k.a: Wildcards) Wildcards are a bit more known and they do what you'd expect them to do: they are used in place of characters when you don't know exactly how your search terms are formed. Wildcards characters may vary from one solution to the other, but there are normally two: one that represents a single character, and one that represents a group of them. For example: the string 'hel*' would match words like 'hello', 'helium' and others, while the string 'hel?' would only match words that begin with "hel" and end with one more character, like "hell" but not "helium".
  • 13. Some of the most known solutions  There are different types of solutions, some of them are just APIs that can be integrated into our proyects, whilst others are servers that provide an entire layer of services between our application and the information. Some examples of this are: APIs: ● Xapian ● Lucene Servers: ● Sphinx ● Solr
  • 14. ... a bit more about Lucene and Xapian There are many more, but those are some of the most known ones... Xapian and Lucene are two APIs but they work differently, because Xapian needs bindins for every language in order to be compatible. In the case of Lucene, there are specific implementations of Lucene for every compatible language.
  • 15. ... and a bit more about Sphinx and Solr On the other hand, Solr (which is based on Lucene) and Sphinx are both full text search servers. They both provide their functionalities through interfaces and not directly inside the application. Sphinx is designed to be efficient while indexing database content.
  • 16. Who uses them? This types of solutions are used by many companies, for example: - Debian uses Xapian for many tasks, one of them is Searching their archive of software packages - NASA Planetary Data System (PDS) uses Solr to search for dataset, mission, instrument, target, and host information - Digg uses Solr for searching their site - Craigslist uses Sphinx - Moove-it! has used Sphinx on some of it's projects - And many more...
  • 17. Practical Example Let's take a look at a very original example...
  • 18. Thanks for reading... ... and happy searching!