A comparison of different solutions for full-text search in web applications using PostgreSQL and other technology. Presented at the PostgreSQL Conference West, in Seattle, October 2009.
7. Naive Searching
Some people, when confronted with a problem,
think “I know, I’ll use regular expressions.”
Now they have two problems.
— Jamie Zawinsky
8. Performance issue
• LIKE with wildcards: time: 91 sec
SELECT * FROM Posts
WHERE body LIKE ‘%postgresql%’
• POSIX regular expressions:
SELECT * FROM Posts
WHERE body ~ ‘postgresql’ time: 105 sec
9. Why so slow?
CREATE TABLE telephone_book (
full_name
VARCHAR(50)
);
CREATE INDEX name_idx ON telephone_book
(full_name);
INSERT INTO telephone_book VALUES
(‘Riddle, Thomas’),
(‘Thomas, Dean’);
10. Why so slow?
• Search for all with last name “Thomas”
uses
SELECT * FROM telephone_book index
WHERE full_name LIKE ‘Thomas%’
• Search for all with first name “Thomas”
SELECT * FROM telephone_book
WHERE full_name LIKE ‘%Thomas’
doesn’t
use index
12. Accuracy issue
• Irrelevant or false matching words
‘one’, ‘money’, ‘prone’, etc.:
body LIKE ‘%one%’
• Regular expressions in PostgreSQL
support escapes for word boundaries:
body ~ ‘yoney’
15. PostgreSQL Text-Search
• Since PostgreSQL 8.3
• TSVECTOR to represent text data
• TSQUERY to represent search predicates
• Special indexes
16. PostgreSQL Text-Search:
Basic Querying
SELECT * FROM Posts
WHERE to_tsvector(title || ‘ ’ || body || ‘ ’ || tags)
@@ to_tsquery(‘postgresql & performance’);
text-search
matching
operator
17. PostgreSQL Text-Search:
Basic Querying
SELECT * FROM Posts
WHERE title || ‘ ’ || body || ‘ ’ || tags
@@ ‘postgresql & performance’;
time with no index:
8 min 2 sec
18. PostgreSQL Text-Search:
Add TSVECTOR column
ALTER TABLE Posts ADD COLUMN
PostText TSVECTOR;
UPDATE Posts SET PostText =
to_tsvector(‘english’, title || ‘ ’ || body || ‘ ’ || tags);
19. Special index types
• GIN (generalized inverted index)
• GiST (generalized search tree)
20. PostgreSQL Text-Search:
Indexing
CREATE INDEX PostText_GIN ON Posts
USING GIN(PostText);
time: 39 min 36 sec
21. PostgreSQL Text-Search:
Querying
SELECT * FROM Posts
WHERE PostText @@ ‘postgresql & performance’;
time with index:
20 milliseconds
22. PostgreSQL Text-Search:
Keep TSVECTOR in sync
CREATE TRIGGER TS_PostText
BEFORE INSERT OR UPDATE ON Posts
FOR EACH ROW
EXECUTE PROCEDURE
tsvector_update_trigger(
ostText,
P
‘english’, title, body, tags);
24. Lucene
• Full-text indexing and search engine
• Apache Project since 2001
• Apache License
• Java implementation
• Ports exist for C, Perl, Ruby, Python, PHP,
etc.
25. Lucene:
How to use
1. Add documents to index
2. Parse query
3. Execute query
26. Lucene:
Creating an index
• Programmatic solution in Java...
time: 8 minutes 55 seconds
27. Lucene:
Indexing
String url = "jdbc:postgresql:stackoverflow";
Properties props = new Properties();
props.setProperty("user", "postgres");
run any SQL query
Class.forName("org.postgresql.Driver");
Connection con = DriverManager.getConnection(url, props);
Statement stmt = con.createStatement();
String sql = "SELECT PostId, Title, Body, Tags FROM Posts";
ResultSet rs = stmt.executeQuery(sql);
open Lucene
Date start = new Date(); index writer
IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR),
new StandardAnalyzer(Version.LUCENE_CURRENT),
true, IndexWriter.MaxFieldLength.LIMITED);
28. Lucene:
Indexing
loop over SQL result
while (rs.next()) {
Document doc = new Document();
doc.add(new Field("PostId", rs.getString("PostId"), Field.Store.YES, Field.Index.NO));
doc.add(new Field("Title", rs.getString("Title"), Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("Body", rs.getString("Body"), Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("Tags", rs.getString("Tags"), Field.Store.YES, Field.Index.ANALYZED));
writer.addDocument(doc); each row is
}
a Document
writer.optimize();
writer.close();
with four Fields
finish and
close index
29. Lucene:
Querying
• Parse a Lucene query define fields
String[] fields = new String[3];
fields[0] = “title”; fields[1] = “body”; fields[2] = “tags”;
Query q = new MultiFieldQueryParser(fields,
new StandardAnalyzer()).parse(‘performance’);
• Execute the query parse search
query
Searcher s = new IndexSearcher(indexName);
Hits h = s.search(q);
time: 80 milliseconds
37. Sphinx Search:
Issues
• Index updates are as expensive as
rebuilding the index from scratch
• Maintain “main” index plus “delta” index for
recent changes
• Merge indexes periodically
• Not all data fits into this model
41. Inverted index:
Data definition
CREATE TABLE TagTypes (
TagId
SERIAL PRIMARY KEY,
Tag
VARCHAR(50) NOT NULL
);
CREATE UNIQUE INDEX TagTypes_Tag_index ON TagTypes(Tag);
CREATE TABLE Tags (
PostId
INT NOT NULL,
TagId
INT NOT NULL,
PRIMARY KEY (PostId, TagId),
FOREIGN KEY (PostId) REFERENCES Posts (PostId),
FOREIGN KEY (TagId) REFERENCES TagTypes (TagId)
);
CREATE INDEX Tags_PostId_index ON Tags(PostId);
CREATE INDEX Tags_TagId_index ON Tags(TagId);
42. Inverted index:
Indexing
INSERT INTO Tags (PostId, TagId)
SELECT p.PostId, t.TagId
FROM Posts p JOIN TagTypes t
ON (p.Tags LIKE ‘%<’ || t.Tag || ‘>%’);
90 seconds
per tag!!
43. Inverted index:
Querying
SELECT p.* FROM Posts p
JOIN Tags t USING (PostId)
JOIN TagTypes tt USING (TagId)
WHERE tt.Tag = ‘performance’;
40 milliseconds
45. Search engine services:
Google Custom Search Engine
• http://www.google.com/cse/
• DEMO ➪ http://www.karwin.com/demo/gcse-demo.html
even big web sites
use this solution
46. Search engine services:
Is it right for you?
• Your site is public and allows external index
• Search is a non-critical feature for you
• Search results are satisfactory
• You need to offload search processing
47. Comparison: Time to Build Index
LIKE predicate none
PostgreSQL / GIN 40 min
Sphinx Search 6 min
Apache Lucene 9 min
Inverted index high
Google / Yahoo! offline
48. Comparison: Index Storage
LIKE predicate none
PostgreSQL / GIN 532 MB
Sphinx Search 533 MB
Apache Lucene 1071 MB
Inverted index 101 MB
Google / Yahoo! offline
49. Comparison: Query Speed
LIKE predicate 90+ sec
PostgreSQL / GIN 20 ms
Sphinx Search 8 ms
Apache Lucene 80 ms
Inverted index 40 ms
Google / Yahoo! *
50. Comparison: Bottom-Line
indexing storage query solution
LIKE predicate none none 11,250x SQL
PostgreSQL / GIN 7x 5.3x 2.5x RDBMS
Sphinx Search 1x * 5.3x 1x 3rd party
Apache Lucene 1.5x 10x 10x 3rd party
Inverted index high 1x 5x SQL
Google / Yahoo! offline offline * Service
51. Copyright 2009 Bill Karwin
www.slideshare.net/billkarwin
Released under a Creative Commons 3.0 License:
http://creativecommons.org/licenses/by-nc-nd/3.0/
You are free to share - to copy, distribute and
transmit this work, under the following conditions:
Attribution. Noncommercial. No Derivative Works.
You must attribute this You may not use this work You may not alter,
work to Bill Karwin. for commercial purposes. transform, or build
upon this work.