SlideShare a Scribd company logo
1 of 46
Download to read offline
Full text Search

Rahila Syed
Beena Emerson

© 2013 NTT DATA, Inc.
Index

•

Full text search and its types

•

Full text search in PostgreSQL

•

PostgreSQL extension

•

Similarity Search

© 2013 NTT DATA, Inc.

2
Full Text Search

© 2013 NTT DATA, Inc.

3
What is full text search?
• Searching for a group of keywords in a pile of texts
– Document
– Query
– Similarity
• Full text search in database
– Searching for a set of keywords in a text field of a database table
– The data used for full text search can be huge
– Indexing words and associating indexed words with documents

© 2013 NTT DATA, Inc.

4
Full Text Search in PostgreSQL

© 2013 NTT DATA, Inc.

5
Steps
• Creating Tokens
– Parsing document into set of tokens like numbers, words, complex
words, email addresses.
• Creating Lexemes
– Normalization: Dictionary controls this.
• Removal of suffixes – converts variants into a single form (worry,
worries, worried, etc.)
• Conversion to lower case
• Remove stop words – common words useless for searching (the, at
etc.)
• Storing preprocessed documents
– Storing documents and creating indexes over them for faster search
• Relevance ranking

© 2013 NTT DATA, Inc.

6
Full text search in PostgreSQL

• Full integration
• 27 built-in configurations for 10 languages
• Support of user-defined FTS configurations
• Pluggable dictionaries ( ispell, snowball, thesaurus ), parsers
• Relevance ranking
• GIN and GiST index

© 2013 NTT DATA, Inc.

7
Full text search in PostgreSQL

Morphological Search
• Indexed tokens are words of a
language
• Eg. Tree, book, rain

N-gram search
• Indexed tokens are characters.

• Small index size

• Big index size

• Good in orthographical variants

• Cannot match orthographical
variants

• Eg. _t, tr, re, e_ (2 grams)

• Search results depends on division
of words

• Results closer to indexed LIKE

• Used for large documents like
thesis

• Better suited for a limited set of
words

• Ex. Tsvector

• Ex. pg_bigm, pg_tigm

© 2013 NTT DATA, Inc.

8
Why full text search?
• Search similar words(No linguistic support)
• Ranking of search results
• Searches substrings
– Indexes does not support substring search

– LIKE operator doesn’t use INDEX when preceded by %.
– Low performance when compared to full text search using GIN and GiST

• Accuracy issue
Eg. LIKE %one% matches prone, money, lonely

© 2013 NTT DATA, Inc.

9
Measurement results
• POSIX Expression
=# EXPLAIN ANALYZE SELECT * FROM fulltext_search WHERE doc ~ 'postgresql';
QUERY PLAN
-------------------------------------------------------------------------Seq Scan on fulltext_search (cost=10000000000.00..10000000473.77 rows=40
width=152) (actual time=10.871..390.019 rows=250 loops=1)
Filter: (doc ~ 'postgresql'::text)
Rows Removed by Filter: 11397
Total runtime: 390.060 ms

• LIKE Query
=# EXPLAIN ANALYZE SELECT * FROM fulltext_search WHERE doc LIKE
'%postgresql%';
QUERY PLAN
-----------------------------------------------------------------------Seq Scan on fulltext_search (cost=10000000000.00..10000000473.77
rows=40 width=152) (actual time=1.342..110.107 rows=250 loops=1)
Filter: (doc ~~ '%postgresql%'::text)
Rows Removed by Filter: 11397
Total runtime: 110.134 ms

© 2013 NTT DATA, Inc.

10
Measurement results
• Full Text Search
Nested Loop (cost=352.83..508.22 rows=107 width=64) (actual
time=1.397..1.575 rows=250 loops=1)
-> Function Scan on to_tsquery query (cost=0.00..0.01 rows=1 width=32)
(actual time=0.023..0.023 rows=1 loops=1)
-> Bitmap Heap Scan on full_text_search (cost=352.83..507.14 rows=107
width=32) (actual time=1.371..1.516 rows=250 loops=1)
Recheck Cond: (query.query @@ to_tsvector('english'::regconfig,
doc))
-> Bitmap Index Scan on full_search_idx (cost=0.00..352.80
rows=107 width=0) (actual time=1.354..1.354 rows=348 loops=1)
Index Cond: (query.query @@
to_tsvector('english'::regconfig, doc))
Total runtime: 1.619 ms

© 2013 NTT DATA, Inc.

11
Ranking Example
Normal Search:
SELECT * FROM tbl WHERE col1 LIKE 'The tiger is the largest cat
species';
col1
-------------------------------------The tiger is the largest cat species
(1 row)

Full Text Search:
SELECT col1, similarity(col1, 'The tiger is the largest cat
species') AS sml
FROM tbl_t WHERE col1 % 'The tiger is the largest cat species'
ORDER BY sml DESC, col1;
col1
|
sml
-----------------------------------------+---------The tiger is the largest cat species
|
1
The peacock is the largest bird species | 0.511111
The cheetah is the fastest cat species | 0.466667
(3 rows)
© 2013 NTT DATA, Inc.

12
Indexes Used in Full Text Search
• GIN(Generalized Inverted Index)
• Custom strategies for particular data types
• Inverted indexes
• Interface for custom data types
• Slower to update
• Deterministic
• Appropriate for fixed data sets.
KEY

TID

Meetup

100 ,140

Pune

100 , 150

Here

100

© 2013 NTT DATA, Inc.

13
Indexes Used in Full Text Search
• GiST (Generalized Search Tree)
• Interface for data types and access methods
• Document is represented in the index by a fixed-length signature
• Based on hash tables
• Probability of false match
• Table row must be retrieved to see if the match is correct
• In appropriate for large data sets
• Filtering data at the end of index search to remove false match
EXPLAIN SELECT * FROM tab WHERE text_search @@
to_tsquery(‘Mountain');
------------------------------- QUERY PLAN ----------------------------------------Index Scan using text_search_idx on tab (cost=0.00..12.29 rows=2
width=1469)
Index Cond: (textsearch @@ '‘Mountain'''::tsquery)
Filter: (textsearch @@ ''‘Mountain'''::tsquery)

© 2013 NTT DATA, Inc.

14
tsvector
• Representation of document best suited for full text search
• Normalized lexemes formed by pre-processing of the documents
• Functions to convert normal text to tsvector:
• to_tsvector
to_tsvector([ config regconfig, ] document text) returns
tsvector
=# SELECT to_tsvector('english', 'Glad to be part of this
meetup');
to_tsvector
-----------------------------'glad':1 'meetup':7 'part':4
(1 row)

• The query above specifies 'english' as the configuration to be used to
parse and normalize the strings. The default_text_search_config value will be
used if the configuration parameter is omitted.
© 2013 NTT DATA, Inc.

15
tsquery
• Representation of search query best suited for full text search
• Normalized lexemes formed by processing the query
• Maybe combined using AND, OR, or NOT operator.
• All keywords used for search

© 2013 NTT DATA, Inc.

16
tsquery
• Functions to convert normal text to tsquery:
• to_tsquery
to_tsquery([ config regconfig, ] querytext text) returns
tsquery
=# SELECT to_tsquery('meetups & in & ! Pune');
to_tsquery
-------------------'meetup' & !'pune'
(1 row)

• plainto_tsquery
plainto_tsquery([
returns tsquery

config

regconfig,

]

querytext

=# SELECT plainto_tsquery ('english','meetups in
plainto_tsquery
------------------'meetup' & 'pune'
(1 row)
© 2013 NTT DATA, Inc.

text)

Pune');

17
Match operator @@
• Checks a tsvector(document) with a tsquery(search word)
• Returns true if all tsquery elements are present in the tsvector of the document
=# SELECT to_tsvector('Welcome to this postgresql meetup') @@
plainto_tsquery('PostgreSQL Meetups');
?column?
---------t
(1 row)
=# SELECT to_tsvector('Welcome to this postgresql meetup') @@
plainto_tsquery('Pune meetup');
?column?
---------f
(1 row)

© 2013 NTT DATA, Inc.

18
Full text search without index
SELECT * FROM <table> WHERE
to_tsvector('<config>', <colname>) @@ to_tsquery('<config>',
'<search word>');

The configuration parameter of the functions to_tsvector and to_tsquery should
be same.
Example:
=# SELECT * FROM tbl WHERE to_tsvector('english', col) @@
to_tsquery('english', 'enjoy');
col
-------------------------------He enjoyed the party
He enjoys the classical music.
(2 rows)

© 2013 NTT DATA, Inc.

19
Full text search using index
• Creating the index
CREATE INDEX <index_name> ON <table> USING
gin(to_tsvector('<config>', <col>));

• Performing search using the index:
SELECT * FROM <table> WHERE to_tsvector('<config>', <col>) @@
plainto_tsquery('<config>','<search word>')

Example:
=# CREATE INDEX idx ON tbl USING gin(to_tsvector('english',
col));
=# SELECT * FROM tbl WHERE to_tsvector('english', col) @@
plainto_tsquery('english','enjoy');
col
-------------------------------He enjoyed the party
He enjoys the classical music.
(2 rows)
© 2013 NTT DATA, Inc.

20
Full text search using separate column
• Procedure
– Create a column of tsvector type

– Define a trigger which will automatically update the tsvector column
– Perform Search on the tsvector column

• Advantages:
– No need to specify the text search configuration in every query in order to
make use of the index
– Faster searches as the to_tsvector function will not be called for each
search query.

© 2013 NTT DATA, Inc.

21
Full text search using separate column
Example:
=# CREATE TABLE tbl (col

text,

tsv_col

tsvector);

=# CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
ON tbl FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger(tsv_col, 'pg_catalog.english', col);
=# INSERT INTO tbl VALUES ('He enjoyed the party'),('He enjoys the
classical music.'),('The moon winked at him');
=# SELECT * FROM tbl;
col
|
tsv
--------------------------------+--------------------------------He enjoyed the party
| 'enjoy':2 'parti':4
He enjoys the classical music. | 'classic':4 'enjoy':2 'music':5
The moon winked at him
| 'moon':2 'wink':3
(3 rows)

© 2013 NTT DATA, Inc.

22
Full text search using separate column
Example:
=# CREATE TABLE tbl (col

text,

tsv_col

tsvector);

=# CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
ON tbl FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger(tsv_col, 'pg_catalog.english', col);
=# INSERT INTO tbl VALUES ('He enjoyed the party'),('He enjoys the
classical music.'),('The moon winked at him');
=# SELECT col FROM tbl WHERE tsv_col @@ to_tsquery('enjoys');
col
-------------------------------He enjoyed the party
He enjoys the classical music.
(2 rows)

© 2013 NTT DATA, Inc.

23
Ranking
•ts_rank
–Lexical ranking
ts_rank([ weights float4[], ] vector tsvector, query tsquery [,
normalization integer ]) returns float4
=# select ts_rank(to_tsvector('Free text seaRCh is a wonderful
Thing'), to_tsquery('wonderful | thing'));

ts_rank ----------- 0.0607927

•ts_rank_cd
–Proximity ranking
=# select ts_rank_cd(to_tsvector('Free text seaRCh
wonderful Thing'), to_tsquery('wonderful & thing'));

is

a

ts_rank_cd ------------ 0.1

© 2013 NTT DATA, Inc.

24
Ranking
• Structural ranking
– Query
select ts_rank( array[0.1,0.1,0.9,0.1],
setweight(to_tsvector('All about search'), 'B') ||
setweight(to_tsvector('Free text seaRCh is a
wonderfulThing'),'A'),
to_tsquery('wonderful & search'));
– Result
ts_rank
0.328337

© 2013 NTT DATA, Inc.

25
PostgreSQL Extension

© 2013 NTT DATA, Inc.

26
pg_trgm
• Uses index made from trigrams – 3 consecutive characters from string.
• Find string similarity by comparing the trigrams.
• provides GiST and GIN index operator classes to create index.
CREATE INDEX <idx> ON <tbl> USING gist(<col> gist_trgm_ops);
CREATE INDEX <idx> ON <tbl> USING gin (<col> gin_trgm_ops);

• Problem:
− No partial match algorithm
− Slow when search key is < 3 characters
GIN_SEARCH_MODE_ALL is used

© 2013 NTT DATA, Inc.

27
pg_bigm
• PostgreSQL module which provides full text search capability using 2-gram
index.

• Based on pg_trgm
• First released on April 2013. Version 1.1 to be released soon.
• Developed by NTT Data
• Site: http://sourceforge.jp/projects/pgbigm/

© 2013 NTT DATA, Inc.

28
Difference

Feature

pg_trgm

pg_bigm

Method of full text
search

3-gram

2-gram

"

" a", ab, bc, cd, "d "

Available index

GIN and GiST

GIN only

1-2 character
keyword search

Slow

Fast

© 2013 NTT DATA, Inc.

a", " ab", abc, bcd

29
Install pg_bigm
• Download tar.gz file from the site
• Install pg_bigm
$ make USE_PGXS=1
$ su
# make USE_PGXS=1 install

• Register- Set the postgresql.conf variables:
– shared_preload_libraries = 'pg_bigm'
– custom_variable_classes = 'pg_bigm' (only in 9.1)

• Load into the required database
=# CREATE EXTENSION pg_bigm;

© 2013 NTT DATA, Inc.

30
Function – show_bigm
Argument: Search String
Return Value: Array of all possible 2-gram character string

Procedure:
• For each word perform the following:
• Add a space character before and after the text
• Moving from left to right extract strings in the unit of 2 characters.
=# SELECT show_bigm('ab');
show_bigm
---------------{" a",ab,"b "}
(1 row)

© 2013 NTT DATA, Inc.

31
Function - likequery
Argument: Search string
Return Value: String in a pattern to be used in LIKE for full-text search

Procedure:
• Add % to the beginning and the end of retrieval string.
• Add a backlash () before every underscore (_), percent (%) and backlash ()
present in the retrieval string.
=# SELECT likequery ('pg_bigm ppt');
likequery
---------------%pg_bigm ppt%
(1 row)

© 2013 NTT DATA, Inc.

32
Creation of Index
• Only GIN support
• Create Index on the text column of a table
CREATE INDEX <index_name> ON <table> USING gin (<column>,
gin_bigm_ops);

Index
Key
" c"

Data

1

cat

5

mat

© 2013 NTT DATA, Inc.

Generate bigrams
cat - " c", at, ca, "t "
mat - " m", at, ma, "t "

" m"

5

at

1, 5

ca

1
5

"t "

TID

1

ma

Table

TID

1, 5

33
Full text search Query
SELECT * FROM <tbl> WHERE <col> LIKE likequery(‘<word>');
=# EXPLAIN ANALYZE SELECT * FROM tbl WHERE col LIKE likequery('cat');
QUERY PLAN
------------------------------------------------------------------Bitmap Heap Scan on tbl (cost=12.00..16.01 rows=1 width=4) (actual
time=0.038..0.039 rows=1 loops=1)
Recheck Cond: (col ~~ '%cat%'::text)
-> Bitmap Index Scan on idx (cost=0.00..12.00 rows=1 width=0)
(actual time=0.025..0.025 rows=1 loops=1)
Index Cond: (col ~~ '%cat%'::text)
Total runtime: 0.093 ms
(5 rows)

© 2013 NTT DATA, Inc.

34
Full text search Query
Index lookup
Key
" c"

1

cat

Final Result
© 2013 NTT DATA, Inc.

Perform
Recheck

1, 5
1
5

"t "

Data

at

ma

TID

5

ca

Generate
bigrams

1

" m"
Search
key

TID

1, 5

TID

Data

1

cat

Result Candidates
35
Why Recheck?
• Removes wrong results from result candidates of index scan.
=# EXPLAIN ANALYZE SELECT * FROM tbl WHERE col LIKE
likequery('trial');
QUERY PLAN
-----------------------------------------------------------------------------------------------------------Bitmap Heap Scan on tbl (cost=24.00..28.01 rows=1 width=5)
(actual time=0.060..0.060 rows=1 loops=1)
Recheck Cond: (col ~~ '%trial%'::text)
Rows Removed by Index Recheck: 1
-> Bitmap Index Scan on idx (cost=0.00..24.00 rows=1 width=0)
(actual time=0.043..0.043 rows=2 loops=1)
Index Cond: (col ~~ '%trial%'::text)
Total runtime: 0.117 ms
(6 rows)

© 2013 NTT DATA, Inc.

36
Why Recheck?
TID

Data

1

trial

2

trivial

trial
trivial

" t",al,ia,"l ",ri,tr
" t",al,ia,iv,"l ",ri,tr,vi

Key
" t"

1, 2

ia

1, 2

TID Data

iv

2

1

trial

“l "

1, 2

2

trivial

ri

1, 2

tr

1, 2

vi

‘trial’

1, 2

al

Search

TID

2

Recheck

TID Data
1

trial

Index scan
© 2013 NTT DATA, Inc.

37
Disabling Recheck
Parameter - enable_recheck
• To disable Recheck and get all the results retrieved by index scan
• Values on/off
=# SET pg_bigm.enable_recheck = on;
=# SELECT * FROM tbl WHERE doc LIKE likequery('trial');
doc
---------------------He is awaiting trial
(1 row)
=# SET pg_bigm.enable_recheck = off;
=# SELECT * FROM tbl WHERE doc LIKE likequery('trial');
doc
-------------------------He is awaiting trial
It was a trivial mistake
(2 rows)
© 2013 NTT DATA, Inc.

38
pg_bigm Full Text Search Sample
=# CREATE TABLE tbl (col text);
=# CREATE INDEX tbl_idx ON tbl USING
=# INSERT INTO tbl VALUES
('He is awaiting trial'),
('Those orchids are very special to
('pg_bigm performs full text search
('pg_trgm performs full text search

gin (col gin_bigm_ops);

her '),
using 2 gram index'),
using 3 gram index');

=# SELECT * FROM tbl WHERE col LIKE likequery('full text search');
col
-----------------------------------------------------pg_bigm performs full text search using 2 gram index
pg_trgm performs full text search using 3 gram index
(2 rows)

© 2013 NTT DATA, Inc.

39
Similarity Search

© 2013 NTT DATA, Inc.

40
Function – bigm_similarity
Argument: The 2 strings whose similarity is to be checked
Return value - the similarity value of two arguments (0 - 1)

• measures the similarity of two strings by counting the number of 2-grams they
share.
=# SELECT bigm_similarity ('test','text');
bigm_similarity
----------------0.6
(1 row)

© 2013 NTT DATA, Inc.

41
Parameter - similarity_limit
• specifies threshold used for the similarity search
• Search returns rows with similarity value >= similarity_limit
• Default: 0.3
• SET command can be used to modify the value.
=# SHOW pg_bigm.similarity_limit;
pg_bigm.similarity_limit
-------------------------0.3
(1 row)
=# SET pg_bigm.similarity_limit = 0.5;

© 2013 NTT DATA, Inc.

42
Similarity Operator - =%
• Used to perform similarity search
• Uses full text search index.
• Returns rows whose similarity is higher than or equal to the value of
pg_bigm.similarity_limit
SELECT * FROM <tbl> WHERE <col> =% ‘<key>';

© 2013 NTT DATA, Inc.

43
Similarity Search Sample
=# SET pg_bigm.similarity_limit = 0.2;
=# SELECT *, bigm_similarity(col, 'test')
'test';
col | bigm_similarity
-------+----------------test |
1
text |
0.6
treat |
0.333333
(3 rows)
=# SET pg_bigm.similarity_limit = 0.5;
=# SELECT *, bigm_similarity(col, 'test')
'test';
col | bigm_similarity
------+----------------test |
1
text |
0.6
(2 rows)

© 2013 NTT DATA, Inc.

FROM tbl WHERE col =%

FROM tbl WHERE col =%

44
References
• PostgreSQL documents
• wiki.postgresql.org

• Understanding Full Text Search
• http://linuxgazette.net/164/sephton.html
• http://www.slideshare.net/billkarwin/full-text-search-in-postgresql
• Understanding pg_bigm
• pgbigm.sourceforge.jp
• www.slideshare.net/masahikosawada98/pg-bigm

© 2013 NTT DATA, Inc.

45
© 2013 NTT DATA, Inc.

More Related Content

Viewers also liked

PostgreSQLレプリケーション徹底紹介
PostgreSQLレプリケーション徹底紹介PostgreSQLレプリケーション徹底紹介
PostgreSQLレプリケーション徹底紹介Masao Fujii
 
データサイエンティスト協会 木曜勉強会 #09 『意志の力が拓くシステム~最適化の適用事例から見たデータ活用システムの現在と未来~』
データサイエンティスト協会 木曜勉強会 #09 『意志の力が拓くシステム~最適化の適用事例から見たデータ活用システムの現在と未来~』データサイエンティスト協会 木曜勉強会 #09 『意志の力が拓くシステム~最適化の適用事例から見たデータ活用システムの現在と未来~』
データサイエンティスト協会 木曜勉強会 #09 『意志の力が拓くシステム~最適化の適用事例から見たデータ活用システムの現在と未来~』The Japan DataScientist Society
 
PostgreSQLアーキテクチャ入門(PostgreSQL Conference 2012)
PostgreSQLアーキテクチャ入門(PostgreSQL Conference 2012)PostgreSQLアーキテクチャ入門(PostgreSQL Conference 2012)
PostgreSQLアーキテクチャ入門(PostgreSQL Conference 2012)Uptime Technologies LLC (JP)
 

Viewers also liked (15)

10大ニュースで振り返るPGCon2015
10大ニュースで振り返るPGCon201510大ニュースで振り返るPGCon2015
10大ニュースで振り返るPGCon2015
 
perfを使ったPostgreSQLの解析(前編)
perfを使ったPostgreSQLの解析(前編)perfを使ったPostgreSQLの解析(前編)
perfを使ったPostgreSQLの解析(前編)
 
PostreSQL監査
PostreSQL監査PostreSQL監査
PostreSQL監査
 
使ってみませんか?pg_hint_plan
使ってみませんか?pg_hint_plan使ってみませんか?pg_hint_plan
使ってみませんか?pg_hint_plan
 
PostgreSQL: XID周回問題に潜む別の問題
PostgreSQL: XID周回問題に潜む別の問題PostgreSQL: XID周回問題に潜む別の問題
PostgreSQL: XID周回問題に潜む別の問題
 
PostgreSQL9.3新機能紹介
PostgreSQL9.3新機能紹介PostgreSQL9.3新機能紹介
PostgreSQL9.3新機能紹介
 
PostgreSQL replication
PostgreSQL replicationPostgreSQL replication
PostgreSQL replication
 
JSONBはPostgreSQL9.5でいかに改善されたのか
JSONBはPostgreSQL9.5でいかに改善されたのかJSONBはPostgreSQL9.5でいかに改善されたのか
JSONBはPostgreSQL9.5でいかに改善されたのか
 
PostgreSQLコミュニティに飛び込もう
PostgreSQLコミュニティに飛び込もうPostgreSQLコミュニティに飛び込もう
PostgreSQLコミュニティに飛び込もう
 
PostgreSQLレプリケーション徹底紹介
PostgreSQLレプリケーション徹底紹介PostgreSQLレプリケーション徹底紹介
PostgreSQLレプリケーション徹底紹介
 
PostgreSQL 9.5 新機能紹介
PostgreSQL 9.5 新機能紹介PostgreSQL 9.5 新機能紹介
PostgreSQL 9.5 新機能紹介
 
PostgreSQLの運用・監視にまつわるエトセトラ
PostgreSQLの運用・監視にまつわるエトセトラPostgreSQLの運用・監視にまつわるエトセトラ
PostgreSQLの運用・監視にまつわるエトセトラ
 
PostgreSQLレプリケーション徹底紹介
PostgreSQLレプリケーション徹底紹介PostgreSQLレプリケーション徹底紹介
PostgreSQLレプリケーション徹底紹介
 
データサイエンティスト協会 木曜勉強会 #09 『意志の力が拓くシステム~最適化の適用事例から見たデータ活用システムの現在と未来~』
データサイエンティスト協会 木曜勉強会 #09 『意志の力が拓くシステム~最適化の適用事例から見たデータ活用システムの現在と未来~』データサイエンティスト協会 木曜勉強会 #09 『意志の力が拓くシステム~最適化の適用事例から見たデータ活用システムの現在と未来~』
データサイエンティスト協会 木曜勉強会 #09 『意志の力が拓くシステム~最適化の適用事例から見たデータ活用システムの現在と未来~』
 
PostgreSQLアーキテクチャ入門(PostgreSQL Conference 2012)
PostgreSQLアーキテクチャ入門(PostgreSQL Conference 2012)PostgreSQLアーキテクチャ入門(PostgreSQL Conference 2012)
PostgreSQLアーキテクチャ入門(PostgreSQL Conference 2012)
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 

Recently uploaded (20)

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 

Full text search

  • 1. Full text Search Rahila Syed Beena Emerson © 2013 NTT DATA, Inc.
  • 2. Index • Full text search and its types • Full text search in PostgreSQL • PostgreSQL extension • Similarity Search © 2013 NTT DATA, Inc. 2
  • 3. Full Text Search © 2013 NTT DATA, Inc. 3
  • 4. What is full text search? • Searching for a group of keywords in a pile of texts – Document – Query – Similarity • Full text search in database – Searching for a set of keywords in a text field of a database table – The data used for full text search can be huge – Indexing words and associating indexed words with documents © 2013 NTT DATA, Inc. 4
  • 5. Full Text Search in PostgreSQL © 2013 NTT DATA, Inc. 5
  • 6. Steps • Creating Tokens – Parsing document into set of tokens like numbers, words, complex words, email addresses. • Creating Lexemes – Normalization: Dictionary controls this. • Removal of suffixes – converts variants into a single form (worry, worries, worried, etc.) • Conversion to lower case • Remove stop words – common words useless for searching (the, at etc.) • Storing preprocessed documents – Storing documents and creating indexes over them for faster search • Relevance ranking © 2013 NTT DATA, Inc. 6
  • 7. Full text search in PostgreSQL • Full integration • 27 built-in configurations for 10 languages • Support of user-defined FTS configurations • Pluggable dictionaries ( ispell, snowball, thesaurus ), parsers • Relevance ranking • GIN and GiST index © 2013 NTT DATA, Inc. 7
  • 8. Full text search in PostgreSQL Morphological Search • Indexed tokens are words of a language • Eg. Tree, book, rain N-gram search • Indexed tokens are characters. • Small index size • Big index size • Good in orthographical variants • Cannot match orthographical variants • Eg. _t, tr, re, e_ (2 grams) • Search results depends on division of words • Results closer to indexed LIKE • Used for large documents like thesis • Better suited for a limited set of words • Ex. Tsvector • Ex. pg_bigm, pg_tigm © 2013 NTT DATA, Inc. 8
  • 9. Why full text search? • Search similar words(No linguistic support) • Ranking of search results • Searches substrings – Indexes does not support substring search – LIKE operator doesn’t use INDEX when preceded by %. – Low performance when compared to full text search using GIN and GiST • Accuracy issue Eg. LIKE %one% matches prone, money, lonely © 2013 NTT DATA, Inc. 9
  • 10. Measurement results • POSIX Expression =# EXPLAIN ANALYZE SELECT * FROM fulltext_search WHERE doc ~ 'postgresql'; QUERY PLAN -------------------------------------------------------------------------Seq Scan on fulltext_search (cost=10000000000.00..10000000473.77 rows=40 width=152) (actual time=10.871..390.019 rows=250 loops=1) Filter: (doc ~ 'postgresql'::text) Rows Removed by Filter: 11397 Total runtime: 390.060 ms • LIKE Query =# EXPLAIN ANALYZE SELECT * FROM fulltext_search WHERE doc LIKE '%postgresql%'; QUERY PLAN -----------------------------------------------------------------------Seq Scan on fulltext_search (cost=10000000000.00..10000000473.77 rows=40 width=152) (actual time=1.342..110.107 rows=250 loops=1) Filter: (doc ~~ '%postgresql%'::text) Rows Removed by Filter: 11397 Total runtime: 110.134 ms © 2013 NTT DATA, Inc. 10
  • 11. Measurement results • Full Text Search Nested Loop (cost=352.83..508.22 rows=107 width=64) (actual time=1.397..1.575 rows=250 loops=1) -> Function Scan on to_tsquery query (cost=0.00..0.01 rows=1 width=32) (actual time=0.023..0.023 rows=1 loops=1) -> Bitmap Heap Scan on full_text_search (cost=352.83..507.14 rows=107 width=32) (actual time=1.371..1.516 rows=250 loops=1) Recheck Cond: (query.query @@ to_tsvector('english'::regconfig, doc)) -> Bitmap Index Scan on full_search_idx (cost=0.00..352.80 rows=107 width=0) (actual time=1.354..1.354 rows=348 loops=1) Index Cond: (query.query @@ to_tsvector('english'::regconfig, doc)) Total runtime: 1.619 ms © 2013 NTT DATA, Inc. 11
  • 12. Ranking Example Normal Search: SELECT * FROM tbl WHERE col1 LIKE 'The tiger is the largest cat species'; col1 -------------------------------------The tiger is the largest cat species (1 row) Full Text Search: SELECT col1, similarity(col1, 'The tiger is the largest cat species') AS sml FROM tbl_t WHERE col1 % 'The tiger is the largest cat species' ORDER BY sml DESC, col1; col1 | sml -----------------------------------------+---------The tiger is the largest cat species | 1 The peacock is the largest bird species | 0.511111 The cheetah is the fastest cat species | 0.466667 (3 rows) © 2013 NTT DATA, Inc. 12
  • 13. Indexes Used in Full Text Search • GIN(Generalized Inverted Index) • Custom strategies for particular data types • Inverted indexes • Interface for custom data types • Slower to update • Deterministic • Appropriate for fixed data sets. KEY TID Meetup 100 ,140 Pune 100 , 150 Here 100 © 2013 NTT DATA, Inc. 13
  • 14. Indexes Used in Full Text Search • GiST (Generalized Search Tree) • Interface for data types and access methods • Document is represented in the index by a fixed-length signature • Based on hash tables • Probability of false match • Table row must be retrieved to see if the match is correct • In appropriate for large data sets • Filtering data at the end of index search to remove false match EXPLAIN SELECT * FROM tab WHERE text_search @@ to_tsquery(‘Mountain'); ------------------------------- QUERY PLAN ----------------------------------------Index Scan using text_search_idx on tab (cost=0.00..12.29 rows=2 width=1469) Index Cond: (textsearch @@ '‘Mountain'''::tsquery) Filter: (textsearch @@ ''‘Mountain'''::tsquery) © 2013 NTT DATA, Inc. 14
  • 15. tsvector • Representation of document best suited for full text search • Normalized lexemes formed by pre-processing of the documents • Functions to convert normal text to tsvector: • to_tsvector to_tsvector([ config regconfig, ] document text) returns tsvector =# SELECT to_tsvector('english', 'Glad to be part of this meetup'); to_tsvector -----------------------------'glad':1 'meetup':7 'part':4 (1 row) • The query above specifies 'english' as the configuration to be used to parse and normalize the strings. The default_text_search_config value will be used if the configuration parameter is omitted. © 2013 NTT DATA, Inc. 15
  • 16. tsquery • Representation of search query best suited for full text search • Normalized lexemes formed by processing the query • Maybe combined using AND, OR, or NOT operator. • All keywords used for search © 2013 NTT DATA, Inc. 16
  • 17. tsquery • Functions to convert normal text to tsquery: • to_tsquery to_tsquery([ config regconfig, ] querytext text) returns tsquery =# SELECT to_tsquery('meetups & in & ! Pune'); to_tsquery -------------------'meetup' & !'pune' (1 row) • plainto_tsquery plainto_tsquery([ returns tsquery config regconfig, ] querytext =# SELECT plainto_tsquery ('english','meetups in plainto_tsquery ------------------'meetup' & 'pune' (1 row) © 2013 NTT DATA, Inc. text) Pune'); 17
  • 18. Match operator @@ • Checks a tsvector(document) with a tsquery(search word) • Returns true if all tsquery elements are present in the tsvector of the document =# SELECT to_tsvector('Welcome to this postgresql meetup') @@ plainto_tsquery('PostgreSQL Meetups'); ?column? ---------t (1 row) =# SELECT to_tsvector('Welcome to this postgresql meetup') @@ plainto_tsquery('Pune meetup'); ?column? ---------f (1 row) © 2013 NTT DATA, Inc. 18
  • 19. Full text search without index SELECT * FROM <table> WHERE to_tsvector('<config>', <colname>) @@ to_tsquery('<config>', '<search word>'); The configuration parameter of the functions to_tsvector and to_tsquery should be same. Example: =# SELECT * FROM tbl WHERE to_tsvector('english', col) @@ to_tsquery('english', 'enjoy'); col -------------------------------He enjoyed the party He enjoys the classical music. (2 rows) © 2013 NTT DATA, Inc. 19
  • 20. Full text search using index • Creating the index CREATE INDEX <index_name> ON <table> USING gin(to_tsvector('<config>', <col>)); • Performing search using the index: SELECT * FROM <table> WHERE to_tsvector('<config>', <col>) @@ plainto_tsquery('<config>','<search word>') Example: =# CREATE INDEX idx ON tbl USING gin(to_tsvector('english', col)); =# SELECT * FROM tbl WHERE to_tsvector('english', col) @@ plainto_tsquery('english','enjoy'); col -------------------------------He enjoyed the party He enjoys the classical music. (2 rows) © 2013 NTT DATA, Inc. 20
  • 21. Full text search using separate column • Procedure – Create a column of tsvector type – Define a trigger which will automatically update the tsvector column – Perform Search on the tsvector column • Advantages: – No need to specify the text search configuration in every query in order to make use of the index – Faster searches as the to_tsvector function will not be called for each search query. © 2013 NTT DATA, Inc. 21
  • 22. Full text search using separate column Example: =# CREATE TABLE tbl (col text, tsv_col tsvector); =# CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE ON tbl FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger(tsv_col, 'pg_catalog.english', col); =# INSERT INTO tbl VALUES ('He enjoyed the party'),('He enjoys the classical music.'),('The moon winked at him'); =# SELECT * FROM tbl; col | tsv --------------------------------+--------------------------------He enjoyed the party | 'enjoy':2 'parti':4 He enjoys the classical music. | 'classic':4 'enjoy':2 'music':5 The moon winked at him | 'moon':2 'wink':3 (3 rows) © 2013 NTT DATA, Inc. 22
  • 23. Full text search using separate column Example: =# CREATE TABLE tbl (col text, tsv_col tsvector); =# CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE ON tbl FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger(tsv_col, 'pg_catalog.english', col); =# INSERT INTO tbl VALUES ('He enjoyed the party'),('He enjoys the classical music.'),('The moon winked at him'); =# SELECT col FROM tbl WHERE tsv_col @@ to_tsquery('enjoys'); col -------------------------------He enjoyed the party He enjoys the classical music. (2 rows) © 2013 NTT DATA, Inc. 23
  • 24. Ranking •ts_rank –Lexical ranking ts_rank([ weights float4[], ] vector tsvector, query tsquery [, normalization integer ]) returns float4 =# select ts_rank(to_tsvector('Free text seaRCh is a wonderful Thing'), to_tsquery('wonderful | thing')); ts_rank ----------- 0.0607927 •ts_rank_cd –Proximity ranking =# select ts_rank_cd(to_tsvector('Free text seaRCh wonderful Thing'), to_tsquery('wonderful & thing')); is a ts_rank_cd ------------ 0.1 © 2013 NTT DATA, Inc. 24
  • 25. Ranking • Structural ranking – Query select ts_rank( array[0.1,0.1,0.9,0.1], setweight(to_tsvector('All about search'), 'B') || setweight(to_tsvector('Free text seaRCh is a wonderfulThing'),'A'), to_tsquery('wonderful & search')); – Result ts_rank 0.328337 © 2013 NTT DATA, Inc. 25
  • 26. PostgreSQL Extension © 2013 NTT DATA, Inc. 26
  • 27. pg_trgm • Uses index made from trigrams – 3 consecutive characters from string. • Find string similarity by comparing the trigrams. • provides GiST and GIN index operator classes to create index. CREATE INDEX <idx> ON <tbl> USING gist(<col> gist_trgm_ops); CREATE INDEX <idx> ON <tbl> USING gin (<col> gin_trgm_ops); • Problem: − No partial match algorithm − Slow when search key is < 3 characters GIN_SEARCH_MODE_ALL is used © 2013 NTT DATA, Inc. 27
  • 28. pg_bigm • PostgreSQL module which provides full text search capability using 2-gram index. • Based on pg_trgm • First released on April 2013. Version 1.1 to be released soon. • Developed by NTT Data • Site: http://sourceforge.jp/projects/pgbigm/ © 2013 NTT DATA, Inc. 28
  • 29. Difference Feature pg_trgm pg_bigm Method of full text search 3-gram 2-gram " " a", ab, bc, cd, "d " Available index GIN and GiST GIN only 1-2 character keyword search Slow Fast © 2013 NTT DATA, Inc. a", " ab", abc, bcd 29
  • 30. Install pg_bigm • Download tar.gz file from the site • Install pg_bigm $ make USE_PGXS=1 $ su # make USE_PGXS=1 install • Register- Set the postgresql.conf variables: – shared_preload_libraries = 'pg_bigm' – custom_variable_classes = 'pg_bigm' (only in 9.1) • Load into the required database =# CREATE EXTENSION pg_bigm; © 2013 NTT DATA, Inc. 30
  • 31. Function – show_bigm Argument: Search String Return Value: Array of all possible 2-gram character string Procedure: • For each word perform the following: • Add a space character before and after the text • Moving from left to right extract strings in the unit of 2 characters. =# SELECT show_bigm('ab'); show_bigm ---------------{" a",ab,"b "} (1 row) © 2013 NTT DATA, Inc. 31
  • 32. Function - likequery Argument: Search string Return Value: String in a pattern to be used in LIKE for full-text search Procedure: • Add % to the beginning and the end of retrieval string. • Add a backlash () before every underscore (_), percent (%) and backlash () present in the retrieval string. =# SELECT likequery ('pg_bigm ppt'); likequery ---------------%pg_bigm ppt% (1 row) © 2013 NTT DATA, Inc. 32
  • 33. Creation of Index • Only GIN support • Create Index on the text column of a table CREATE INDEX <index_name> ON <table> USING gin (<column>, gin_bigm_ops); Index Key " c" Data 1 cat 5 mat © 2013 NTT DATA, Inc. Generate bigrams cat - " c", at, ca, "t " mat - " m", at, ma, "t " " m" 5 at 1, 5 ca 1 5 "t " TID 1 ma Table TID 1, 5 33
  • 34. Full text search Query SELECT * FROM <tbl> WHERE <col> LIKE likequery(‘<word>'); =# EXPLAIN ANALYZE SELECT * FROM tbl WHERE col LIKE likequery('cat'); QUERY PLAN ------------------------------------------------------------------Bitmap Heap Scan on tbl (cost=12.00..16.01 rows=1 width=4) (actual time=0.038..0.039 rows=1 loops=1) Recheck Cond: (col ~~ '%cat%'::text) -> Bitmap Index Scan on idx (cost=0.00..12.00 rows=1 width=0) (actual time=0.025..0.025 rows=1 loops=1) Index Cond: (col ~~ '%cat%'::text) Total runtime: 0.093 ms (5 rows) © 2013 NTT DATA, Inc. 34
  • 35. Full text search Query Index lookup Key " c" 1 cat Final Result © 2013 NTT DATA, Inc. Perform Recheck 1, 5 1 5 "t " Data at ma TID 5 ca Generate bigrams 1 " m" Search key TID 1, 5 TID Data 1 cat Result Candidates 35
  • 36. Why Recheck? • Removes wrong results from result candidates of index scan. =# EXPLAIN ANALYZE SELECT * FROM tbl WHERE col LIKE likequery('trial'); QUERY PLAN -----------------------------------------------------------------------------------------------------------Bitmap Heap Scan on tbl (cost=24.00..28.01 rows=1 width=5) (actual time=0.060..0.060 rows=1 loops=1) Recheck Cond: (col ~~ '%trial%'::text) Rows Removed by Index Recheck: 1 -> Bitmap Index Scan on idx (cost=0.00..24.00 rows=1 width=0) (actual time=0.043..0.043 rows=2 loops=1) Index Cond: (col ~~ '%trial%'::text) Total runtime: 0.117 ms (6 rows) © 2013 NTT DATA, Inc. 36
  • 37. Why Recheck? TID Data 1 trial 2 trivial trial trivial " t",al,ia,"l ",ri,tr " t",al,ia,iv,"l ",ri,tr,vi Key " t" 1, 2 ia 1, 2 TID Data iv 2 1 trial “l " 1, 2 2 trivial ri 1, 2 tr 1, 2 vi ‘trial’ 1, 2 al Search TID 2 Recheck TID Data 1 trial Index scan © 2013 NTT DATA, Inc. 37
  • 38. Disabling Recheck Parameter - enable_recheck • To disable Recheck and get all the results retrieved by index scan • Values on/off =# SET pg_bigm.enable_recheck = on; =# SELECT * FROM tbl WHERE doc LIKE likequery('trial'); doc ---------------------He is awaiting trial (1 row) =# SET pg_bigm.enable_recheck = off; =# SELECT * FROM tbl WHERE doc LIKE likequery('trial'); doc -------------------------He is awaiting trial It was a trivial mistake (2 rows) © 2013 NTT DATA, Inc. 38
  • 39. pg_bigm Full Text Search Sample =# CREATE TABLE tbl (col text); =# CREATE INDEX tbl_idx ON tbl USING =# INSERT INTO tbl VALUES ('He is awaiting trial'), ('Those orchids are very special to ('pg_bigm performs full text search ('pg_trgm performs full text search gin (col gin_bigm_ops); her '), using 2 gram index'), using 3 gram index'); =# SELECT * FROM tbl WHERE col LIKE likequery('full text search'); col -----------------------------------------------------pg_bigm performs full text search using 2 gram index pg_trgm performs full text search using 3 gram index (2 rows) © 2013 NTT DATA, Inc. 39
  • 40. Similarity Search © 2013 NTT DATA, Inc. 40
  • 41. Function – bigm_similarity Argument: The 2 strings whose similarity is to be checked Return value - the similarity value of two arguments (0 - 1) • measures the similarity of two strings by counting the number of 2-grams they share. =# SELECT bigm_similarity ('test','text'); bigm_similarity ----------------0.6 (1 row) © 2013 NTT DATA, Inc. 41
  • 42. Parameter - similarity_limit • specifies threshold used for the similarity search • Search returns rows with similarity value >= similarity_limit • Default: 0.3 • SET command can be used to modify the value. =# SHOW pg_bigm.similarity_limit; pg_bigm.similarity_limit -------------------------0.3 (1 row) =# SET pg_bigm.similarity_limit = 0.5; © 2013 NTT DATA, Inc. 42
  • 43. Similarity Operator - =% • Used to perform similarity search • Uses full text search index. • Returns rows whose similarity is higher than or equal to the value of pg_bigm.similarity_limit SELECT * FROM <tbl> WHERE <col> =% ‘<key>'; © 2013 NTT DATA, Inc. 43
  • 44. Similarity Search Sample =# SET pg_bigm.similarity_limit = 0.2; =# SELECT *, bigm_similarity(col, 'test') 'test'; col | bigm_similarity -------+----------------test | 1 text | 0.6 treat | 0.333333 (3 rows) =# SET pg_bigm.similarity_limit = 0.5; =# SELECT *, bigm_similarity(col, 'test') 'test'; col | bigm_similarity ------+----------------test | 1 text | 0.6 (2 rows) © 2013 NTT DATA, Inc. FROM tbl WHERE col =% FROM tbl WHERE col =% 44
  • 45. References • PostgreSQL documents • wiki.postgresql.org • Understanding Full Text Search • http://linuxgazette.net/164/sephton.html • http://www.slideshare.net/billkarwin/full-text-search-in-postgresql • Understanding pg_bigm • pgbigm.sourceforge.jp • www.slideshare.net/masahikosawada98/pg-bigm © 2013 NTT DATA, Inc. 45
  • 46. © 2013 NTT DATA, Inc.