Full text search

Full text Search

Rahila Syed
Beena Emerson

© 2013 NTT DATA, Inc.

Index

•

Full text search and its types

•

Full text search in PostgreSQL

•

PostgreSQL extension

•

Similarity Search


2

Full Text Search


3

What is full text search?
• Searching for a group of keywords in a pile of texts
– Document
– Query
– Similarity
• Full text search in database
– Searching for a set of keywords in a text field of a database table
– The data used for full text search can be huge
– Indexing words and associating indexed words with documents


4

Full Text Search in PostgreSQL


5

Steps
• Creating Tokens
– Parsing document into set of tokens like numbers, words, complex
words, email addresses.
• Creating Lexemes
– Normalization: Dictionary controls this.
• Removal of suffixes – converts variants into a single form (worry,
worries, worried, etc.)
• Conversion to lower case
• Remove stop words – common words useless for searching (the, at
etc.)
• Storing preprocessed documents
– Storing documents and creating indexes over them for faster search
• Relevance ranking


6


• Full integration
• 27 built-in configurations for 10 languages
• Support of user-defined FTS configurations
• Pluggable dictionaries ( ispell, snowball, thesaurus ), parsers
• Relevance ranking
• GIN and GiST index


7


Morphological Search
• Indexed tokens are words of a
language
• Eg. Tree, book, rain

N-gram search
• Indexed tokens are characters.

• Small index size

• Big index size

• Good in orthographical variants

• Cannot match orthographical
variants

• Eg. _t, tr, re, e_ (2 grams)

• Search results depends on division
of words

• Results closer to indexed LIKE

• Used for large documents like
thesis

• Better suited for a limited set of
words

• Ex. Tsvector

• Ex. pg_bigm, pg_tigm


8

Why full text search?
• Search similar words(No linguistic support)
• Ranking of search results
• Searches substrings
– Indexes does not support substring search

– LIKE operator doesn’t use INDEX when preceded by %.
– Low performance when compared to full text search using GIN and GiST

• Accuracy issue
Eg. LIKE %one% matches prone, money, lonely


9

Measurement results
• POSIX Expression
=# EXPLAIN ANALYZE SELECT * FROM fulltext_search WHERE doc ~ 'postgresql';
QUERY PLAN
-------------------------------------------------------------------------Seq Scan on fulltext_search (cost=10000000000.00..10000000473.77 rows=40
width=152) (actual time=10.871..390.019 rows=250 loops=1)
Filter: (doc ~ 'postgresql'::text)
Rows Removed by Filter: 11397
Total runtime: 390.060 ms

• LIKE Query
=# EXPLAIN ANALYZE SELECT * FROM fulltext_search WHERE doc LIKE
'%postgresql%';
QUERY PLAN
-----------------------------------------------------------------------Seq Scan on fulltext_search (cost=10000000000.00..10000000473.77
rows=40 width=152) (actual time=1.342..110.107 rows=250 loops=1)
Filter: (doc ~~ '%postgresql%'::text)
Rows Removed by Filter: 11397


10

Measurement results
• Full Text Search
Nested Loop (cost=352.83..508.22 rows=107 width=64) (actual
time=1.397..1.575 rows=250 loops=1)
-> Function Scan on to_tsquery query (cost=0.00..0.01 rows=1 width=32)
(actual time=0.023..0.023 rows=1 loops=1)
-> Bitmap Heap Scan on full_text_search (cost=352.83..507.14 rows=107
width=32) (actual time=1.371..1.516 rows=250 loops=1)
Recheck Cond: (query.query @@ to_tsvector('english'::regconfig,
doc))
-> Bitmap Index Scan on full_search_idx (cost=0.00..352.80
rows=107 width=0) (actual time=1.354..1.354 rows=348 loops=1)
Index Cond: (query.query @@
to_tsvector('english'::regconfig, doc))


11

Ranking Example
Normal Search:
SELECT * FROM tbl WHERE col1 LIKE 'The tiger is the largest cat
species';
col1
-------------------------------------The tiger is the largest cat species
(1 row)

Full Text Search:
SELECT col1, similarity(col1, 'The tiger is the largest cat
species') AS sml
FROM tbl_t WHERE col1 % 'The tiger is the largest cat species'
ORDER BY sml DESC, col1;
col1
|
sml
-----------------------------------------+---------The tiger is the largest cat species
|
1
The peacock is the largest bird species | 0.511111
The cheetah is the fastest cat species | 0.466667
(3 rows)

12

Indexes Used in Full Text Search
• GIN(Generalized Inverted Index)
• Custom strategies for particular data types
• Inverted indexes
• Interface for custom data types
• Slower to update
• Deterministic
• Appropriate for fixed data sets.
KEY

TID

Meetup

100 ,140

Pune

100 , 150

Here

100


13

Indexes Used in Full Text Search
• GiST (Generalized Search Tree)
• Interface for data types and access methods
• Document is represented in the index by a fixed-length signature
• Based on hash tables
• Probability of false match
• Table row must be retrieved to see if the match is correct
• In appropriate for large data sets
• Filtering data at the end of index search to remove false match
EXPLAIN SELECT * FROM tab WHERE text_search @@
to_tsquery(‘Mountain');
------------------------------- QUERY PLAN ----------------------------------------Index Scan using text_search_idx on tab (cost=0.00..12.29 rows=2
width=1469)
Index Cond: (textsearch @@ '‘Mountain'''::tsquery)
Filter: (textsearch @@ ''‘Mountain'''::tsquery)


14

tsvector
• Representation of document best suited for full text search
• Normalized lexemes formed by pre-processing of the documents
• Functions to convert normal text to tsvector:
• to_tsvector
to_tsvector([ config regconfig, ] document text) returns
tsvector
=# SELECT to_tsvector('english', 'Glad to be part of this
meetup');
to_tsvector
-----------------------------'glad':1 'meetup':7 'part':4
(1 row)

• The query above specifies 'english' as the configuration to be used to
parse and normalize the strings. The default_text_search_config value will be
used if the configuration parameter is omitted.

15

tsquery
• Representation of search query best suited for full text search
• Normalized lexemes formed by processing the query
• Maybe combined using AND, OR, or NOT operator.
• All keywords used for search


16

tsquery
• Functions to convert normal text to tsquery:
• to_tsquery
to_tsquery([ config regconfig, ] querytext text) returns
tsquery
=# SELECT to_tsquery('meetups & in & ! Pune');
to_tsquery
-------------------'meetup' & !'pune'
(1 row)

• plainto_tsquery
plainto_tsquery([
returns tsquery

config

regconfig,

]

querytext

=# SELECT plainto_tsquery ('english','meetups in
plainto_tsquery
------------------'meetup' & 'pune'
(1 row)

text)

Pune');

17

Match operator @@
• Checks a tsvector(document) with a tsquery(search word)
• Returns true if all tsquery elements are present in the tsvector of the document
=# SELECT to_tsvector('Welcome to this postgresql meetup') @@
plainto_tsquery('PostgreSQL Meetups');
?column?
---------t
(1 row)
=# SELECT to_tsvector('Welcome to this postgresql meetup') @@
plainto_tsquery('Pune meetup');
?column?
---------f
(1 row)


18

Full text search without index
SELECT * FROM <table> WHERE
to_tsvector('<config>', <colname>) @@ to_tsquery('<config>',
'<search word>');

The configuration parameter of the functions to_tsvector and to_tsquery should
be same.
Example:
=# SELECT * FROM tbl WHERE to_tsvector('english', col) @@
to_tsquery('english', 'enjoy');
col
-------------------------------He enjoyed the party
He enjoys the classical music.
(2 rows)


19

Full text search using index
• Creating the index
CREATE INDEX <index_name> ON <table> USING
gin(to_tsvector('<config>', <col>));

• Performing search using the index:
SELECT * FROM <table> WHERE to_tsvector('<config>', <col>) @@
plainto_tsquery('<config>','<search word>')

Example:
=# CREATE INDEX idx ON tbl USING gin(to_tsvector('english',
col));
=# SELECT * FROM tbl WHERE to_tsvector('english', col) @@
plainto_tsquery('english','enjoy');
col
(2 rows)

20

Full text search using separate column
• Procedure
– Create a column of tsvector type

– Define a trigger which will automatically update the tsvector column
– Perform Search on the tsvector column

• Advantages:
– No need to specify the text search configuration in every query in order to
make use of the index
– Faster searches as the to_tsvector function will not be called for each
search query.


21

Example:
=# CREATE TABLE tbl (col

text,

tsv_col

tsvector);

=# CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
ON tbl FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger(tsv_col, 'pg_catalog.english', col);
=# INSERT INTO tbl VALUES ('He enjoyed the party'),('He enjoys the
classical music.'),('The moon winked at him');
=# SELECT * FROM tbl;
col
|
tsv
--------------------------------+--------------------------------He enjoyed the party
| 'enjoy':2 'parti':4
He enjoys the classical music. | 'classic':4 'enjoy':2 'music':5
The moon winked at him
| 'moon':2 'wink':3
(3 rows)


22

Example:
=# CREATE TABLE tbl (col

text,

tsv_col

tsvector);

=# CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
ON tbl FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger(tsv_col, 'pg_catalog.english', col);
=# INSERT INTO tbl VALUES ('He enjoyed the party'),('He enjoys the
classical music.'),('The moon winked at him');
=# SELECT col FROM tbl WHERE tsv_col @@ to_tsquery('enjoys');
col
(2 rows)


23

Ranking
•ts_rank
–Lexical ranking
ts_rank([ weights float4[], ] vector tsvector, query tsquery [,
normalization integer ]) returns float4
=# select ts_rank(to_tsvector('Free text seaRCh is a wonderful
Thing'), to_tsquery('wonderful | thing'));

ts_rank ----------- 0.0607927

•ts_rank_cd
–Proximity ranking
=# select ts_rank_cd(to_tsvector('Free text seaRCh
wonderful Thing'), to_tsquery('wonderful & thing'));

is

a

ts_rank_cd ------------ 0.1


24

Ranking
• Structural ranking
– Query
select ts_rank( array[0.1,0.1,0.9,0.1],
setweight(to_tsvector('All about search'), 'B') ||
setweight(to_tsvector('Free text seaRCh is a
wonderfulThing'),'A'),
to_tsquery('wonderful & search'));
– Result
ts_rank
0.328337


25

PostgreSQL Extension


26

pg_trgm
• Uses index made from trigrams – 3 consecutive characters from string.
• Find string similarity by comparing the trigrams.
• provides GiST and GIN index operator classes to create index.
CREATE INDEX <idx> ON <tbl> USING gist(<col> gist_trgm_ops);
CREATE INDEX <idx> ON <tbl> USING gin (<col> gin_trgm_ops);

• Problem:
− No partial match algorithm
− Slow when search key is < 3 characters
GIN_SEARCH_MODE_ALL is used


27

pg_bigm
• PostgreSQL module which provides full text search capability using 2-gram
index.

• Based on pg_trgm
• First released on April 2013. Version 1.1 to be released soon.
• Developed by NTT Data
• Site: http://sourceforge.jp/projects/pgbigm/


28

Difference

Feature

pg_trgm

pg_bigm

Method of full text
search

3-gram

2-gram

"

" a", ab, bc, cd, "d "

Available index

GIN and GiST

GIN only

1-2 character
keyword search

Slow

Fast


a", " ab", abc, bcd

29

Install pg_bigm
• Download tar.gz file from the site
• Install pg_bigm
$ make USE_PGXS=1
$ su
# make USE_PGXS=1 install

• Register- Set the postgresql.conf variables:
– shared_preload_libraries = 'pg_bigm'
– custom_variable_classes = 'pg_bigm' (only in 9.1)

• Load into the required database
=# CREATE EXTENSION pg_bigm;


30

Function – show_bigm
Argument: Search String
Return Value: Array of all possible 2-gram character string

Procedure:
• For each word perform the following:
• Add a space character before and after the text
• Moving from left to right extract strings in the unit of 2 characters.
=# SELECT show_bigm('ab');
show_bigm
---------------{" a",ab,"b "}
(1 row)


31

Function - likequery
Argument: Search string
Return Value: String in a pattern to be used in LIKE for full-text search

Procedure:
• Add % to the beginning and the end of retrieval string.
• Add a backlash () before every underscore (_), percent (%) and backlash ()
present in the retrieval string.
=# SELECT likequery ('pg_bigm ppt');
likequery
---------------%pg_bigm ppt%
(1 row)


32

Creation of Index
• Only GIN support
• Create Index on the text column of a table
CREATE INDEX <index_name> ON <table> USING gin (<column>,
gin_bigm_ops);

Index
Key
" c"

Data

1

cat

5

mat


Generate bigrams
cat - " c", at, ca, "t "
mat - " m", at, ma, "t "

" m"

5

at

1, 5

ca

1
5

"t "

TID

1

ma

Table

TID

1, 5

33

Full text search Query
SELECT * FROM <tbl> WHERE <col> LIKE likequery(‘<word>');
=# EXPLAIN ANALYZE SELECT * FROM tbl WHERE col LIKE likequery('cat');
QUERY PLAN
------------------------------------------------------------------Bitmap Heap Scan on tbl (cost=12.00..16.01 rows=1 width=4) (actual
time=0.038..0.039 rows=1 loops=1)
Recheck Cond: (col ~~ '%cat%'::text)
-> Bitmap Index Scan on idx (cost=0.00..12.00 rows=1 width=0)
Index Cond: (col ~~ '%cat%'::text)
(5 rows)


34

Full text search Query
Index lookup
Key
" c"

1

cat

Final Result

Perform
Recheck

1, 5
1
5

"t "

Data

at

ma

TID

5

ca

Generate
bigrams

1

" m"
Search
key

TID

1, 5

TID

Data

1

cat

Result Candidates
35

Why Recheck?
• Removes wrong results from result candidates of index scan.
=# EXPLAIN ANALYZE SELECT * FROM tbl WHERE col LIKE
likequery('trial');
QUERY PLAN
-----------------------------------------------------------------------------------------------------------Bitmap Heap Scan on tbl (cost=24.00..28.01 rows=1 width=5)
Recheck Cond: (col ~~ '%trial%'::text)
Rows Removed by Index Recheck: 1
-> Bitmap Index Scan on idx (cost=0.00..24.00 rows=1 width=0)
Index Cond: (col ~~ '%trial%'::text)
(6 rows)


36

Why Recheck?
TID

Data

1

trial

2

trivial

trial
trivial

" t",al,ia,"l ",ri,tr
" t",al,ia,iv,"l ",ri,tr,vi

Key
" t"

1, 2

ia

1, 2

TID Data

iv

2

1

trial

“l "

1, 2

2

trivial

ri

1, 2

tr

1, 2

vi

‘trial’

1, 2

al

Search

TID

2

Recheck

TID Data
1

trial

Index scan

37

Disabling Recheck
Parameter - enable_recheck
• To disable Recheck and get all the results retrieved by index scan
• Values on/off
=# SET pg_bigm.enable_recheck = on;
=# SELECT * FROM tbl WHERE doc LIKE likequery('trial');
doc
---------------------He is awaiting trial
(1 row)
=# SET pg_bigm.enable_recheck = off;
=# SELECT * FROM tbl WHERE doc LIKE likequery('trial');
doc
-------------------------He is awaiting trial
It was a trivial mistake
(2 rows)

38

pg_bigm Full Text Search Sample
=# CREATE TABLE tbl (col text);
=# CREATE INDEX tbl_idx ON tbl USING
=# INSERT INTO tbl VALUES
('He is awaiting trial'),
('Those orchids are very special to
('pg_bigm performs full text search
('pg_trgm performs full text search

gin (col gin_bigm_ops);

her '),
using 2 gram index'),
using 3 gram index');

=# SELECT * FROM tbl WHERE col LIKE likequery('full text search');
col
-----------------------------------------------------pg_bigm performs full text search using 2 gram index
pg_trgm performs full text search using 3 gram index
(2 rows)


39

Similarity Search


40

Function – bigm_similarity
Argument: The 2 strings whose similarity is to be checked
Return value - the similarity value of two arguments (0 - 1)

• measures the similarity of two strings by counting the number of 2-grams they
share.
=# SELECT bigm_similarity ('test','text');
bigm_similarity
----------------0.6
(1 row)


41

Parameter - similarity_limit
• specifies threshold used for the similarity search
• Search returns rows with similarity value >= similarity_limit
• Default: 0.3
• SET command can be used to modify the value.
=# SHOW pg_bigm.similarity_limit;
pg_bigm.similarity_limit
-------------------------0.3
(1 row)
=# SET pg_bigm.similarity_limit = 0.5;


42

Similarity Operator - =%
• Used to perform similarity search
• Uses full text search index.
• Returns rows whose similarity is higher than or equal to the value of
pg_bigm.similarity_limit
SELECT * FROM <tbl> WHERE <col> =% ‘<key>';


43

References
• PostgreSQL documents
• wiki.postgresql.org

• Understanding Full Text Search
• http://linuxgazette.net/164/sephton.html
• http://www.slideshare.net/billkarwin/full-text-search-in-postgresql
• Understanding pg_bigm
• pgbigm.sourceforge.jp
• www.slideshare.net/masahikosawada98/pg-bigm


45

Full text search

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (15)

Recently uploaded

Recently uploaded (20)

Full text search