2. TOC
● What is "Full text search"?
● How does it work?
● What is it good for?
● What makes it so good?
● Common Caracteristics
● Some of the most known solutions
● Who uses them?
● Practical Example
3. What is full text search?
Wikipedia says: full text search refers to a technique for searching a
computer-stored document or database. In a full text search, the search engine
examines all of the words in every stored document as it tries to match search
words supplied by the user.
I say: Full text search is a technique for searching documents or databases
that allows for a more relevant search (getting the results that we need instead
of the results that just "match" with our query).
4. How does it work?
In order to do a full text search, we first have to index all the information.
There are several techniques for indexing, but the basic idea behind it is as
follows:
1. Scan the document
2. For every word within the document, create an entry in the index with that
word, and with the relative position within the document.
3. Apply specific rules to the terms, such us:
○ Ignoring stop words
○ Stemming
○ etc
5. ... how? part II
We have the index ready, now what?
Depending on the solution used, we'll have access to a formal querying
language. Using that, we can query our engine to tell it what we're looking for.
Something like:
title:"The Right Way" AND text:goorjakarta^4 apache
This will tell our search engine to look for documents with a title equal to "The
Right Way" and also, those that have the words "goorjakarta" and "apache"
on it's text, the only difference, is that "goorjakarta" is 4 times more important
than the word "apache"
6. What is it good for?
Full text search allows us to search (well duh!) very large amounts of
information in a very small time frame.
This type of solutions are generally used when the size of the database to be
search rises to the giga bytes.
It is normally used for searching inside the content of documents, such as word
documents, excel spreadsheets, web pages, etc.
7. What makes it so good?
Full text search is great! (but why?)
Some of the most important caracteristics to all full text search
solutions are:
- Relevant search: The results we get can be sorted based on relevance, this
allows for the user to get what he is looking for easily. (i.e: if we search for "red"
and "apple" we want to get the fruit and not results about the Apple company)
- Keywords: When indexing, keywords can be assigned to different parts of the
documents, allowing for a more specific type of query.
- Wildcards: Great tool that allows us to search terms when we don't know
exactly how to write it.
- Fuzzy search: Using this techniques, we can search terms that are close to
the ones on our query string.
8. Common caracteristics
Let's talk about some of the most common caracteristics
amongst full text search solutions.
● Presicion vs. Recall
● Stopwords
● Stemming
● Wildcards
9. Precision vs. recall tradeoff
Precision: Number of relevant results returned divided by the
total of results returned.
Recall: Number of relevant results returned divided by the total
of relevant results.
When choosing a solution, it is important to manage this two
concepts correctly. An increase on precision regularly means a
decrease on recall, and the oposite also applies.
10. Stopwords
Stopwords are terms that are too common on a language and
therefore are not specific enough to be of used when
searching.
Some examples of this are words like "the", "a", "an", "by",
"can", etc.
They're normally ignored by full text analyzers when indexing
information.
11. Stemming
Stemming allows us to reduce a word to it's root form (or stem)
in order to generalize terms while searching. Note that this is
not the same as synonyms.
For example, a stemmer would generalize words like "catlike",
"catty" and "cats" to their root form: "cat".
12. W?ldc*ds (A.k.a: Wildcards)
Wildcards are a bit more known and they do what you'd expect
them to do: they are used in place of characters when you don't
know exactly how your search terms are formed.
Wildcards characters may vary from one solution to the other,
but there are normally two: one that represents a single
character, and one that represents a group of them.
For example: the string 'hel*' would match words like 'hello',
'helium' and others, while the string 'hel?' would only match
words that begin with "hel" and end with one more character,
like "hell" but not "helium".
13. Some of the most known solutions
There are different types of solutions, some of them are just
APIs that can be integrated into our proyects, whilst others are
servers that provide an entire layer of services between our
application and the information.
Some examples of this are:
APIs:
● Xapian
● Lucene
Servers:
● Sphinx
● Solr
14. ... a bit more about Lucene and Xapian
There are many more, but those are some of the most known
ones...
Xapian and Lucene are two APIs but they work differently,
because Xapian needs bindins for every language in order to
be compatible.
In the case of Lucene, there are specific implementations of
Lucene for every compatible language.
15. ... and a bit more about Sphinx and Solr
On the other hand, Solr (which is based on Lucene) and
Sphinx are both full text search servers.
They both provide their functionalities through interfaces and
not directly inside the application.
Sphinx is designed to be efficient while indexing database
content.
16. Who uses them?
This types of solutions are used by many companies, for
example:
- Debian uses Xapian for many tasks, one of them
is Searching their archive of software packages
- NASA Planetary Data System (PDS) uses Solr to search for
dataset, mission, instrument, target, and host information
- Digg uses Solr for searching their site
- Craigslist uses Sphinx
- Moove-it! has used Sphinx on some of it's projects
- And many more...