SlideShare une entreprise Scribd logo
1  sur  51
Télécharger pour lire hors ligne
PgConf EU 2014 presents 
Javier Ramirez 
* in * 
PostgreSQL 
Full-text search 
demystified 
@supercoco9 
https://teowaki.com
The problem
our architecture
One does not simply 
SELECT * from stuff where 
content ilike '%postgresql%'
Basic search features 
* stemmers (run, runner, running) 
* unaccented (josé, jose) 
* results highlighting 
* rank results by relevance
Nice to have features 
* partial searches 
* search operators (OR, AND...) 
* synonyms (postgres, postgresql, pgsql) 
* thesaurus (OS=Operating System) 
* fast, and space-efficient 
* debugging
Good News: 
PostgreSQL supports all 
the requested features
Bad News: 
unless you already know about search 
engines, the official docs are not obvious
How a search engine works 
* An indexing phase 
* A search phase
The indexing phase 
Convert the input text to tokens
The search phase 
Match the search terms to 
the indexed tokens
indexing in depth 
* choose an index format 
* tokenize the words 
* apply token analysis/filters 
* discard unwanted tokens
the index format 
* r-tree (GIST in PostgreSQL) 
* inverse indexes (GIN in PostgreSQL) 
* dynamic/distributed indexes
dynamic indexes: segmentation 
* sometimes the token index is 
segmented to allow faster updates 
* consolidate segments to speed-up 
search and account for deletions
tokenizing 
* parse/strip/convert format 
* normalize terms (unaccent, ascii, 
charsets, case folding, number precision..)
token analysis/filters 
* find synonyms 
* expand thesaurus 
* stem (maybe in different languages)
more token analysis/filters 
* eliminate stopwords 
* store word distance/frequency 
* store the full contents of some fields 
* store some fields as attributes/facets
“the index file” is really 
* a token file, probably segmented/distributed 
* some dictionary files: synonyms, thesaurus, 
stopwords, stems/lexems (in different languages) 
* word distance/frequency info 
* attributes/original field files 
* optional geospatial index 
* auxiliary files: word/sentence boundaries, meta-info, 
parser definitions, datasource definitions...
the hardest 
part is now 
over
searching in depth 
* tokenize/analyse 
* prepare operators 
* retrieve information 
* rank the results 
* highlight the matched parts
searching in depth: tokenize 
normalize, tokenize, and analyse 
the original search term 
the result would be a tokenized, stemmed, 
“synonymised” term, without stopwords
searching in depth: operators 
* partial search 
* logical/geospatial/range operators 
* in-sentence/in-paragraph/word distance 
* faceting/grouping
searching in depth: retrieval 
Go through the token index files, use the 
attributes and geospatial files if necessary 
for operators and/or grouping 
You might need to do this in a distributed way
searching in depth: ranking 
algorithm to sort the most relevant results: 
* field weights 
* word frequency/density 
* geospatial or timestamp ranking 
* ad-hoc ranking strategies
searching in depth: highlighting 
Mark the matching parts of the results 
It can be tricky/slow if you are not storing the full contents 
in your indexes
PostgreSQL as a 
full-text 
search engine
search features 
* index format configuration 
* partial search 
* word boundaries parser (not configurable) 
* stemmers/synonyms/thesaurus/stopwords 
* full-text logical operators 
* attributes/geo/timestamp/range (using SQL) 
* ranking strategies 
* highlighting 
* debugging/testing commands
indexing in postgresql 
you don't actually need an index to use full-text search in PostgreSQL 
but unless your db is very small, you want to have one 
Choose GIST or GIN (faster search, slower indexing, 
larger index size) 
CREATE INDEX pgweb_idx ON pgweb USING 
gin(to_tsvector(config_name, body));
Two new things 
CREATE INDEX ... USING gin(to_tsvector (config_name, body)); 
* to_tsvector: postgresql way of saying “tokenize” 
* config_name: tokenizing/analysis rule set
Configuration 
CREATE TEXT SEARCH CONFIGURATION 
public.teowaki ( COPY = pg_catalog.english );
Configuration 
CREATE TEXT SEARCH DICTIONARY english_ispell ( 
TEMPLATE = ispell, 
DictFile = en_us, 
AffFile = en_us, 
StopWords = spanglish 
); 
CREATE TEXT SEARCH DICTIONARY spanish_ispell ( 
TEMPLATE = ispell, 
DictFile = es_any, 
AffFile = es_any, 
StopWords = spanish 
);
Configuration 
CREATE TEXT SEARCH DICTIONARY english_stem ( 
TEMPLATE = snowball, 
Language = english, 
StopWords = english 
); 
CREATE TEXT SEARCH DICTIONARY spanish_stem ( 
TEMPLATE= snowball, 
Language = spanish, 
Stopwords = spanish 
);
Configuration 
Parser. 
Word boundaries
Configuration 
Assign dictionaries (in specific to generic order) 
ALTER TEXT SEARCH CONFIGURATION teowaki 
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, 
hword_part 
WITH english_ispell, spanish_ispell, spanish_stem, unaccent, english_stem; 
ALTER TEXT SEARCH CONFIGURATION teowaki 
DROP MAPPING FOR email, url, url_path, sfloat, float;
debugging 
select * from ts_debug('teowaki', 'I am searching unas 
b squedas ú con postgresql database'); 
also ts_lexize and ts_parser
tokenizing 
tokens + position (stopwords are removed, tokens are folded)
searching 
SELECT guid, description from wakis where 
to_tsvector('teowaki',description) 
@@ to_tsquery('teowaki','postgres');
searching 
SELECT guid, description from wakis where 
to_tsvector('teowaki',description) 
@@ to_tsquery('teowaki','postgres:*');
operators 
SELECT guid, description from wakis where 
to_tsvector('teowaki',description) 
@@ to_tsquery('teowaki','postgres | mysql');
ranking weights 
SELECT setweight(to_tsvector(coalesce(name,'')),'A') || 
setweight(to_tsvector(coalesce(description,'')),'B') 
from wakis limit 1;
search by weight
ranking 
SELECT name, ts_rank(to_tsvector(name), query) rank 
from wakis, to_tsquery('postgres | indexes') query 
where to_tsvector(name) @@ query order by rank DESC; 
also ts_rank_cd
highlighting 
SELECT ts_headline(name, query) from wakis, 
to_tsquery('teowaki', 'game|play') query 
where to_tsvector('teowaki', name) @@ query;
USE POSTGRESQL 
FOR EVERYTHING
When PostgreSQL is not good 
* You need to index files (PDF, Odx...) 
* Your index is very big (slow reindex) 
* You need a distributed index 
* You need complex tokenizers 
* You need advanced rankers
When PostgreSQL is not good 
* You want a REST API 
* You want sentence/ proximity/ range/ 
more complex operators 
* You want search auto completion 
* You want advanced features (alerts...)
But it has been 
perfect for us so far. 
Our users don't care 
which search engine 
we use, as long as 
it works.
PgConf EU 2014 presents 
Javier Ramirez 
* in * 
PostgreSQL 
Full-text search 
demystified 
@supercoco9 
https://teowaki.com

Contenu connexe

Tendances

Morphia: Simplifying Persistence for Java and MongoDB
Morphia:  Simplifying Persistence for Java and MongoDBMorphia:  Simplifying Persistence for Java and MongoDB
Morphia: Simplifying Persistence for Java and MongoDB
Jeff Yemin
 
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
Donghyeok Kang
 
Indexing and Query Optimization
Indexing and Query OptimizationIndexing and Query Optimization
Indexing and Query Optimization
MongoDB
 
Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014
Holden Karau
 
Indexing & Query Optimization
Indexing & Query OptimizationIndexing & Query Optimization
Indexing & Query Optimization
MongoDB
 
Ts archiving
Ts   archivingTs   archiving
Ts archiving
Confiz
 

Tendances (20)

[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화
 
MongoDB-SESSION03
MongoDB-SESSION03MongoDB-SESSION03
MongoDB-SESSION03
 
[2 d1] elasticsearch 성능 최적화
[2 d1] elasticsearch 성능 최적화[2 d1] elasticsearch 성능 최적화
[2 d1] elasticsearch 성능 최적화
 
Getting started with Elasticsearch and .NET
Getting started with Elasticsearch and .NETGetting started with Elasticsearch and .NET
Getting started with Elasticsearch and .NET
 
Morphia: Simplifying Persistence for Java and MongoDB
Morphia:  Simplifying Persistence for Java and MongoDBMorphia:  Simplifying Persistence for Java and MongoDB
Morphia: Simplifying Persistence for Java and MongoDB
 
Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...
Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...
Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...
 
Fast querying indexing for performance (4)
Fast querying   indexing for performance (4)Fast querying   indexing for performance (4)
Fast querying indexing for performance (4)
 
Indexing and Query Optimizer (Mongo Austin)
Indexing and Query Optimizer (Mongo Austin)Indexing and Query Optimizer (Mongo Austin)
Indexing and Query Optimizer (Mongo Austin)
 
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
 
How to Use JSON in MySQL Wrong
How to Use JSON in MySQL WrongHow to Use JSON in MySQL Wrong
How to Use JSON in MySQL Wrong
 
Webinar: Index Tuning and Evaluation
Webinar: Index Tuning and EvaluationWebinar: Index Tuning and Evaluation
Webinar: Index Tuning and Evaluation
 
Elastic search 검색
Elastic search 검색Elastic search 검색
Elastic search 검색
 
Non-Relational Postgres / Bruce Momjian (EnterpriseDB)
Non-Relational Postgres / Bruce Momjian (EnterpriseDB)Non-Relational Postgres / Bruce Momjian (EnterpriseDB)
Non-Relational Postgres / Bruce Momjian (EnterpriseDB)
 
Indexing and Query Optimization
Indexing and Query OptimizationIndexing and Query Optimization
Indexing and Query Optimization
 
Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014
 
Indexing & Query Optimization
Indexing & Query OptimizationIndexing & Query Optimization
Indexing & Query Optimization
 
Ts archiving
Ts   archivingTs   archiving
Ts archiving
 
MongoDB World 2016: Deciphering .explain() Output
MongoDB World 2016: Deciphering .explain() OutputMongoDB World 2016: Deciphering .explain() Output
MongoDB World 2016: Deciphering .explain() Output
 
Reactive Access to MongoDB from Java 8
Reactive Access to MongoDB from Java 8Reactive Access to MongoDB from Java 8
Reactive Access to MongoDB from Java 8
 
[2C6]SQLite DB 의 입출력 특성분석 : Android 와 Tizen 사례
[2C6]SQLite DB 의 입출력 특성분석 : Android 와 Tizen 사례[2C6]SQLite DB 의 입출력 특성분석 : Android 와 Tizen 사례
[2C6]SQLite DB 의 입출력 특성분석 : Android 와 Tizen 사례
 

Similaire à Postgresql search demystified

Similaire à Postgresql search demystified (20)

Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Sumo Logic "How to" Webinar: Advanced Analytics
Sumo Logic "How to" Webinar: Advanced AnalyticsSumo Logic "How to" Webinar: Advanced Analytics
Sumo Logic "How to" Webinar: Advanced Analytics
 
Tutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginTutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component plugin
 
10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL
 
Building node.js applications with Database Jones
Building node.js applications with Database JonesBuilding node.js applications with Database Jones
Building node.js applications with Database Jones
 
Get to know PostgreSQL!
Get to know PostgreSQL!Get to know PostgreSQL!
Get to know PostgreSQL!
 
Examiness hints and tips from the trenches
Examiness hints and tips from the trenchesExaminess hints and tips from the trenches
Examiness hints and tips from the trenches
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
Compass Framework
Compass FrameworkCompass Framework
Compass Framework
 
Declarative Development Using Annotations In PHP
Declarative Development Using Annotations In PHPDeclarative Development Using Annotations In PHP
Declarative Development Using Annotations In PHP
 
Declarative Development Using Annotations In PHP
Declarative Development Using Annotations In PHPDeclarative Development Using Annotations In PHP
Declarative Development Using Annotations In PHP
 
Postgresql Database Administration Basic - Day2
Postgresql  Database Administration Basic  - Day2Postgresql  Database Administration Basic  - Day2
Postgresql Database Administration Basic - Day2
 
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 TaipeiPostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
 
Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...
 
Pxb For Yapc2008
Pxb For Yapc2008Pxb For Yapc2008
Pxb For Yapc2008
 
Simplifying Persistence for Java and MongoDB with Morphia
Simplifying Persistence for Java and MongoDB with MorphiaSimplifying Persistence for Java and MongoDB with Morphia
Simplifying Persistence for Java and MongoDB with Morphia
 
Hands On Spring Data
Hands On Spring DataHands On Spring Data
Hands On Spring Data
 
ERRest and Dojo
ERRest and DojoERRest and Dojo
ERRest and Dojo
 
Introducing Struts 2
Introducing Struts 2Introducing Struts 2
Introducing Struts 2
 
Softshake - Offline applications
Softshake - Offline applicationsSoftshake - Offline applications
Softshake - Offline applications
 

Plus de javier ramirez

Plus de javier ramirez (20)

¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfest¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfest
 
QuestDB: The building blocks of a fast open-source time-series database
QuestDB: The building blocks of a fast open-source time-series databaseQuestDB: The building blocks of a fast open-source time-series database
QuestDB: The building blocks of a fast open-source time-series database
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
Deduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDBDeduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDB
 
Your Database Cannot Do this (well)
Your Database Cannot Do this (well)Your Database Cannot Do this (well)
Your Database Cannot Do this (well)
 
Your Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic DatabaseYour Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic Database
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
 
QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728
 
Processing and analysing streaming data with Python. Pycon Italy 2022
Processing and analysing streaming  data with Python. Pycon Italy 2022Processing and analysing streaming  data with Python. Pycon Italy 2022
Processing and analysing streaming data with Python. Pycon Italy 2022
 
QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...
 
Servicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en AragónServicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en Aragón
 
Primeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverlessPrimeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverless
 
How AWS is reinventing the cloud
How AWS is reinventing the cloudHow AWS is reinventing the cloud
How AWS is reinventing the cloud
 
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAMAnalitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
 
Getting started with streaming analytics
Getting started with streaming analyticsGetting started with streaming analytics
Getting started with streaming analytics
 
Getting started with streaming analytics: Setting up a pipeline
Getting started with streaming analytics: Setting up a pipelineGetting started with streaming analytics: Setting up a pipeline
Getting started with streaming analytics: Setting up a pipeline
 
Getting started with streaming analytics: Deep Dive
Getting started with streaming analytics: Deep DiveGetting started with streaming analytics: Deep Dive
Getting started with streaming analytics: Deep Dive
 
Getting started with streaming analytics: streaming basics (1 of 3)
Getting started with streaming analytics: streaming basics (1 of 3)Getting started with streaming analytics: streaming basics (1 of 3)
Getting started with streaming analytics: streaming basics (1 of 3)
 
Monitorización de seguridad y detección de amenazas con AWS
Monitorización de seguridad y detección de amenazas con AWSMonitorización de seguridad y detección de amenazas con AWS
Monitorización de seguridad y detección de amenazas con AWS
 

Dernier

Jax, FL Admin Community Group 05.14.2024 Combined Deck
Jax, FL Admin Community Group 05.14.2024 Combined DeckJax, FL Admin Community Group 05.14.2024 Combined Deck
Jax, FL Admin Community Group 05.14.2024 Combined Deck
Marc Lester
 

Dernier (20)

From Knowledge Graphs via Lego Bricks to scientific conversations.pptx
From Knowledge Graphs via Lego Bricks to scientific conversations.pptxFrom Knowledge Graphs via Lego Bricks to scientific conversations.pptx
From Knowledge Graphs via Lego Bricks to scientific conversations.pptx
 
Effective Strategies for Wix's Scaling challenges - GeeCon
Effective Strategies for Wix's Scaling challenges - GeeConEffective Strategies for Wix's Scaling challenges - GeeCon
Effective Strategies for Wix's Scaling challenges - GeeCon
 
Jax, FL Admin Community Group 05.14.2024 Combined Deck
Jax, FL Admin Community Group 05.14.2024 Combined DeckJax, FL Admin Community Group 05.14.2024 Combined Deck
Jax, FL Admin Community Group 05.14.2024 Combined Deck
 
A Deep Dive into Secure Product Development Frameworks.pdf
A Deep Dive into Secure Product Development Frameworks.pdfA Deep Dive into Secure Product Development Frameworks.pdf
A Deep Dive into Secure Product Development Frameworks.pdf
 
Microsoft365_Dev_Security_2024_05_16.pdf
Microsoft365_Dev_Security_2024_05_16.pdfMicrosoft365_Dev_Security_2024_05_16.pdf
Microsoft365_Dev_Security_2024_05_16.pdf
 
Abortion Pill Prices Turfloop ](+27832195400*)[ 🏥 Women's Abortion Clinic in ...
Abortion Pill Prices Turfloop ](+27832195400*)[ 🏥 Women's Abortion Clinic in ...Abortion Pill Prices Turfloop ](+27832195400*)[ 🏥 Women's Abortion Clinic in ...
Abortion Pill Prices Turfloop ](+27832195400*)[ 🏥 Women's Abortion Clinic in ...
 
Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...
Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...
Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...
 
Weeding your micro service landscape.pdf
Weeding your micro service landscape.pdfWeeding your micro service landscape.pdf
Weeding your micro service landscape.pdf
 
From Theory to Practice: Utilizing SpiraPlan's REST API
From Theory to Practice: Utilizing SpiraPlan's REST APIFrom Theory to Practice: Utilizing SpiraPlan's REST API
From Theory to Practice: Utilizing SpiraPlan's REST API
 
Encryption Recap: A Refresher on Key Concepts
Encryption Recap: A Refresher on Key ConceptsEncryption Recap: A Refresher on Key Concepts
Encryption Recap: A Refresher on Key Concepts
 
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
 
Your Ultimate Web Studio for Streaming Anywhere | Evmux
Your Ultimate Web Studio for Streaming Anywhere | EvmuxYour Ultimate Web Studio for Streaming Anywhere | Evmux
Your Ultimate Web Studio for Streaming Anywhere | Evmux
 
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-CloudAlluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
 
Lessons Learned from Building a Serverless Notifications System.pdf
Lessons Learned from Building a Serverless Notifications System.pdfLessons Learned from Building a Serverless Notifications System.pdf
Lessons Learned from Building a Serverless Notifications System.pdf
 
Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024
Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024
Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024
 
Abortion Clinic In Pretoria ](+27832195400*)[ 🏥 Safe Abortion Pills in Pretor...
Abortion Clinic In Pretoria ](+27832195400*)[ 🏥 Safe Abortion Pills in Pretor...Abortion Clinic In Pretoria ](+27832195400*)[ 🏥 Safe Abortion Pills in Pretor...
Abortion Clinic In Pretoria ](+27832195400*)[ 🏥 Safe Abortion Pills in Pretor...
 
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit MilanWorkshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
 
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCAOpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
 
BusinessGPT - Security and Governance for Generative AI
BusinessGPT  - Security and Governance for Generative AIBusinessGPT  - Security and Governance for Generative AI
BusinessGPT - Security and Governance for Generative AI
 
CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...
CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...
CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...
 

Postgresql search demystified

  • 1. PgConf EU 2014 presents Javier Ramirez * in * PostgreSQL Full-text search demystified @supercoco9 https://teowaki.com
  • 4.
  • 5. One does not simply SELECT * from stuff where content ilike '%postgresql%'
  • 6.
  • 7.
  • 8. Basic search features * stemmers (run, runner, running) * unaccented (josé, jose) * results highlighting * rank results by relevance
  • 9. Nice to have features * partial searches * search operators (OR, AND...) * synonyms (postgres, postgresql, pgsql) * thesaurus (OS=Operating System) * fast, and space-efficient * debugging
  • 10. Good News: PostgreSQL supports all the requested features
  • 11. Bad News: unless you already know about search engines, the official docs are not obvious
  • 12. How a search engine works * An indexing phase * A search phase
  • 13. The indexing phase Convert the input text to tokens
  • 14. The search phase Match the search terms to the indexed tokens
  • 15. indexing in depth * choose an index format * tokenize the words * apply token analysis/filters * discard unwanted tokens
  • 16. the index format * r-tree (GIST in PostgreSQL) * inverse indexes (GIN in PostgreSQL) * dynamic/distributed indexes
  • 17. dynamic indexes: segmentation * sometimes the token index is segmented to allow faster updates * consolidate segments to speed-up search and account for deletions
  • 18. tokenizing * parse/strip/convert format * normalize terms (unaccent, ascii, charsets, case folding, number precision..)
  • 19. token analysis/filters * find synonyms * expand thesaurus * stem (maybe in different languages)
  • 20. more token analysis/filters * eliminate stopwords * store word distance/frequency * store the full contents of some fields * store some fields as attributes/facets
  • 21. “the index file” is really * a token file, probably segmented/distributed * some dictionary files: synonyms, thesaurus, stopwords, stems/lexems (in different languages) * word distance/frequency info * attributes/original field files * optional geospatial index * auxiliary files: word/sentence boundaries, meta-info, parser definitions, datasource definitions...
  • 22. the hardest part is now over
  • 23. searching in depth * tokenize/analyse * prepare operators * retrieve information * rank the results * highlight the matched parts
  • 24. searching in depth: tokenize normalize, tokenize, and analyse the original search term the result would be a tokenized, stemmed, “synonymised” term, without stopwords
  • 25. searching in depth: operators * partial search * logical/geospatial/range operators * in-sentence/in-paragraph/word distance * faceting/grouping
  • 26. searching in depth: retrieval Go through the token index files, use the attributes and geospatial files if necessary for operators and/or grouping You might need to do this in a distributed way
  • 27. searching in depth: ranking algorithm to sort the most relevant results: * field weights * word frequency/density * geospatial or timestamp ranking * ad-hoc ranking strategies
  • 28. searching in depth: highlighting Mark the matching parts of the results It can be tricky/slow if you are not storing the full contents in your indexes
  • 29. PostgreSQL as a full-text search engine
  • 30. search features * index format configuration * partial search * word boundaries parser (not configurable) * stemmers/synonyms/thesaurus/stopwords * full-text logical operators * attributes/geo/timestamp/range (using SQL) * ranking strategies * highlighting * debugging/testing commands
  • 31. indexing in postgresql you don't actually need an index to use full-text search in PostgreSQL but unless your db is very small, you want to have one Choose GIST or GIN (faster search, slower indexing, larger index size) CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(config_name, body));
  • 32. Two new things CREATE INDEX ... USING gin(to_tsvector (config_name, body)); * to_tsvector: postgresql way of saying “tokenize” * config_name: tokenizing/analysis rule set
  • 33. Configuration CREATE TEXT SEARCH CONFIGURATION public.teowaki ( COPY = pg_catalog.english );
  • 34. Configuration CREATE TEXT SEARCH DICTIONARY english_ispell ( TEMPLATE = ispell, DictFile = en_us, AffFile = en_us, StopWords = spanglish ); CREATE TEXT SEARCH DICTIONARY spanish_ispell ( TEMPLATE = ispell, DictFile = es_any, AffFile = es_any, StopWords = spanish );
  • 35. Configuration CREATE TEXT SEARCH DICTIONARY english_stem ( TEMPLATE = snowball, Language = english, StopWords = english ); CREATE TEXT SEARCH DICTIONARY spanish_stem ( TEMPLATE= snowball, Language = spanish, Stopwords = spanish );
  • 37. Configuration Assign dictionaries (in specific to generic order) ALTER TEXT SEARCH CONFIGURATION teowaki ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part WITH english_ispell, spanish_ispell, spanish_stem, unaccent, english_stem; ALTER TEXT SEARCH CONFIGURATION teowaki DROP MAPPING FOR email, url, url_path, sfloat, float;
  • 38. debugging select * from ts_debug('teowaki', 'I am searching unas b squedas ú con postgresql database'); also ts_lexize and ts_parser
  • 39. tokenizing tokens + position (stopwords are removed, tokens are folded)
  • 40. searching SELECT guid, description from wakis where to_tsvector('teowaki',description) @@ to_tsquery('teowaki','postgres');
  • 41. searching SELECT guid, description from wakis where to_tsvector('teowaki',description) @@ to_tsquery('teowaki','postgres:*');
  • 42. operators SELECT guid, description from wakis where to_tsvector('teowaki',description) @@ to_tsquery('teowaki','postgres | mysql');
  • 43. ranking weights SELECT setweight(to_tsvector(coalesce(name,'')),'A') || setweight(to_tsvector(coalesce(description,'')),'B') from wakis limit 1;
  • 45. ranking SELECT name, ts_rank(to_tsvector(name), query) rank from wakis, to_tsquery('postgres | indexes') query where to_tsvector(name) @@ query order by rank DESC; also ts_rank_cd
  • 46. highlighting SELECT ts_headline(name, query) from wakis, to_tsquery('teowaki', 'game|play') query where to_tsvector('teowaki', name) @@ query;
  • 47. USE POSTGRESQL FOR EVERYTHING
  • 48. When PostgreSQL is not good * You need to index files (PDF, Odx...) * Your index is very big (slow reindex) * You need a distributed index * You need complex tokenizers * You need advanced rankers
  • 49. When PostgreSQL is not good * You want a REST API * You want sentence/ proximity/ range/ more complex operators * You want search auto completion * You want advanced features (alerts...)
  • 50. But it has been perfect for us so far. Our users don't care which search engine we use, as long as it works.
  • 51. PgConf EU 2014 presents Javier Ramirez * in * PostgreSQL Full-text search demystified @supercoco9 https://teowaki.com