SlideShare une entreprise Scribd logo
1  sur  57
Télécharger pour lire hors ligne
What is the best
full text search engine
for Python?
Andrii Soldatenko
23-24 April 2016
@a_soldatenko
Agenda:
• Who am I?
• What is full text search?
• PostgreSQL FTS / Elastic / Whoosh
• Pros and Cons
• What’s next?
Andrii Soldatenko
• Backend Python Developer at
• CTO in Persollo.com
• Speaker at many PyCons and
Python meetups
• blogger at https://asoldatenko.com
Preface
Text Search
➜ cpython time ack OrderedDict
ack OrderedDict 2.53s user 0.22s system 94% cpu 2.915 total
➜ cpython time pt OrderedDict
pt OrderedDict 0.14s user 0.12s system 406% cpu 0.064 total
➜ cpython time pss OrderedDict
pss OrderedDict 1.08s user 0.14s system 88% cpu 1.370 total
➜ cpython time grep -r -i 'OrderedDict' .
grep -r -i 'OrderedDict' 2.70s user 0.13s system 94% cpu 2.998 total
Full text search
Search index
Simple sentences
1. The quick brown fox jumped over the lazy dog
2. Quick brown foxes leap over lazy dogs in summer
Inverted index
Inverted index
Inverted index:
normalization
Term Doc_1 Doc_2
-------------------------
brown | X | X
dog | X | X
fox | X | X
in | | X
jump | X | X
lazy | X | X
over | X | X
quick | X | X
summer | | X
the | X | X
------------------------
Term Doc_1 Doc_2
-------------------------
Quick | | X
The | X |
brown | X | X
dog | X |
dogs | | X
fox | X |
foxes | | X
in | | X
jumped | X |
lazy | X | X
leap | | X
over | X | X
quick | X |
summer | | X
the | X |
------------------------
Search Engines
PostgreSQL
Full Text Search
support from version 8.3
PostgreSQL
Full Text Search
SELECT to_tsvector('text') @@
to_tsquery('query');
Simple is better than complex. - by import this
SELECT 'python conference ukraine 2016'::tsvector @@
'python & ukraine'::tsquery;
?column?
----------
t
(1 row)
Do PostgreSQL FTS
without index
Do PostgreSQL FTS
with index
CREATE INDEX name ON table USING GIN
(column);
CREATE INDEX name ON table USING
GIST (column);
PostgreSQL FTS:

Ranking Search Results
ts_rank() -> float4 - based on the
frequency of their matching lexemes
ts_rank_cd() -> float4 - cover
density ranking for the given
document vector and query
PostgresSQL FTS
Highlighting Results
SELECT ts_headline('english',
'python conference ukraine 2016',
to_tsquery('python & 2016'));
ts_headline
----------------------------------------------
<b>python</b> conference ukraine <b>2016</b>
Stop Words
postgresql/9.5.2/share/postgresql/tsearch_data/english.stop
PostgresSQL FTS
Stop Words
SELECT to_tsvector('in the list of stop words');
to_tsvector
----------------------------
'list':3 'stop':5 'word':6
PG FTS

and Python
• Django 1.10 django.contrib.postgres.search
(36 hours ago)
• djorm-ext-pgfulltext
• sqlalchemy
PostgreSQL FTS
integration with django orm
https://github.com/linuxlewis/djorm-ext-pgfulltext
from djorm_pgfulltext.models import SearchManager
from djorm_pgfulltext.fields import VectorField
from django.db import models
class Page(models.Model):
name = models.CharField(max_length=200)
description = models.TextField()
search_index = VectorField()
objects = SearchManager(
fields = ('name', 'description'),
config = 'pg_catalog.english', # this is default
search_field = 'search_index', # this is default
auto_update_search_field = True
)
For search just use search
method of the manager
https://github.com/linuxlewis/djorm-ext-pgfulltext
>>> Page.objects.search("documentation & about")
[<Page: Page: Home page>]
>>> Page.objects.search("about | documentation | django | home", raw=True)
[<Page: Page: Home page>, <Page: Page: About>, <Page: Page: Navigation>]
Django 1.10
>>> Entry.objects.filter(body_text__search='recipe')
[<Entry: Cheese on Toast recipes>, <Entry: Pizza
recipes>]
>>> Entry.objects.annotate(
... search=SearchVector('blog__tagline',
'body_text'),
... ).filter(search='cheese')
[
<Entry: Cheese on Toast recipes>,
<Entry: Pizza Recipes>,
<Entry: Dairy farming in Argentina>,
]
https://github.com/django/django/commit/2d877da
Pros and Cons
Pros:
• Quick implementation
• No dependency
Cons:
• Need manually manage indexes
• depend on PostgreSQL
• no analytics data
• no DSL only `&` and `|` queries
• difficult to manage stop words
ElasticSearch
Who uses ElasticSearch?
ElasticSearch:
Quick Intro
Relational DB Databases TablesRows Columns
ElasticSearch Indices FieldsTypes Documents
ElasticSearch:
Locks
•Pessimistic concurrency control
•Optimistic concurrency control
ElasticSearch and
Python
• elasticsearch-py
• elasticsearch-dsl-py by Honza Kral
• elasticsearch-py-async by Honza Kral
ElasticSearch:
FTS
$ curl -XGET 'http://localhost:9200/
pyconua/talk/_search' -d '
{
    "query": {
        "match": {
            "user": "Andrii"
        }
    }
}'
ES: Create Index
$ curl -XPUT 'http://localhost:9200/
twitter/' -d '{
    "settings" : {
        "index" : {
            "number_of_shards" : 3,
            "number_of_replicas" : 2
        }
    }
}'
ES: Add json to Index
$ curl -XPUT 'http://localhost:9200/
pyconua/talk/1' -d '{
    "user" : "andrii",
    "description" : "Full text search"
}'
ES: Stopwords
$ curl -XPUT 'http://localhost:9200/pyconua' -d '{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english": {
          "type": "english",
          "stopwords_path": "stopwords/english.txt"
        }
      }
    }
  }
}'
ES: Highlight
$ curl -XGET 'http://localhost:9200/pyconua/talk/
_search' -d '{
    "query" : {...},
    "highlight" : {
        "pre_tags" : ["<tag1>"],
        "post_tags" : ["</tag1>"],
        "fields" : {
            "_all" : {}
        }
    }
}'
ES: Relevance
$ curl -XGET 'http://localhost:9200/_search?explain -d
'
{
"query" : { "match" : { "user" : "andrii" }}
}'
"_explanation": {
  "description": "weight(tweet:honeymoon in 0)
                  [PerFieldSimilarity], result of:",
  "value": 0.076713204,
  "details": [...]
}
Whoosh
• Pure-Python
• Whoosh was created by Matt Chaput.
• Pluggable scoring algorithm (including BM25F)
• more info at video from PyCon US 2013
Whoosh: Stop words
import os.path
import textwrap
names = os.listdir("stopwords")
for name in names:
f = open("stopwords/" + name)
wordls = [line.strip() for line in f]
words = " ".join(wordls)
print '"%s": frozenset(u"""' % name
print textwrap.fill(words, 72)
print '""".split())'
http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/
snowball/stopwords/
Whoosh: 

Highlight
results = pycon.search(myquery)
for hit in results:
print(hit["title"])
# Assume "content" field is stored
print(hit.highlights("content"))
Whoosh: 

Ranking search results
• Pluggable scoring algorithm
• including BM25F
Haystack
Adding search functionality
to Simple Model
$ cat myapp/models.py
from django.db import models
from django.contrib.auth.models import User
class Page(models.Model):
user = models.ForeignKey(User)
name = models.CharField(max_length=200)
description = models.TextField()
def __unicode__(self):
return self.name
Haystack: Installation
$ pip install django-haystack
$ cat settings.py
INSTALLED_APPS = [
'django.contrib.admin',
'django.contrib.auth',
'django.contrib.contenttypes',
'django.contrib.sessions',
'django.contrib.sites',
# Added.
'haystack',
# Then your usual apps...
'blog',
]
Haystack: Installation
$ pip install elasticsearch
$ cat settings.py
...
HAYSTACK_CONNECTIONS = {
'default': {
'ENGINE': 'haystack.backends.elasticsearch_backend.ElasticsearchSearchEngine',
'URL': 'http://127.0.0.1:9200/',
'INDEX_NAME': 'haystack',
},
}
...
Haystack:
Creating SearchIndexes
$ cat myapp/search_indexes.py
import datetime
from haystack import indexes
from myapp.models import Note
class PageIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, use_template=True)
author = indexes.CharField(model_attr='user')
pub_date = indexes.DateTimeField(model_attr='pub_date')
def get_model(self):
return Note
def index_queryset(self, using=None):
"""Used when the entire index for model is updated."""
return self.get_model().objects. 
filter(pub_date__lte=datetime.datetime.now())
Haystack:
SearchQuerySet API
from haystack.query import SearchQuerySet
from haystack.inputs import Raw
all_results = SearchQuerySet().all()
hello_results = SearchQuerySet().filter(content='hello')
unfriendly_results = SearchQuerySet().
exclude(content=‘hello’).
filter(content=‘world’)
# To send unescaped data:
sqs = SearchQuerySet().filter(title=Raw(trusted_query))
Keeping data in sync
# Update everything.
./manage.py update_index --settings=settings.prod
# Update everything with lots of information about what's going on.
./manage.py update_index --settings=settings.prod --verbosity=2
# Update everything, cleaning up after deleted models.
./manage.py update_index --remove --settings=settings.prod
# Update everything changed in the last 2 hours.
./manage.py update_index --age=2 --settings=settings.prod
# Update everything between Dec. 1, 2011 & Dec 31, 2011
./manage.py update_index --start='2011-12-01T00:00:00' --end='2011-12-31T23:59:59' --
settings=settings.prod
Signals
class RealtimeSignalProcessor(BaseSignalProcessor):
"""
Allows for observing when saves/deletes fire & automatically updates the
search engine appropriately.
"""
def setup(self):
# Naive (listen to all model saves).
models.signals.post_save.connect(self.handle_save)
models.signals.post_delete.connect(self.handle_delete)
# Efficient would be going through all backends & collecting all models
# being used, then hooking up signals only for those.
def teardown(self):
# Naive (listen to all model saves).
models.signals.post_save.disconnect(self.handle_save)
models.signals.post_delete.disconnect(self.handle_delete)
# Efficient would be going through all backends & collecting all models
# being used, then disconnecting signals only for those.
Haystack:
Pros and Cons
Pros:
• easy to setup
• looks like Django ORM but for searches
• search engine independent
• support 4 engines (Elastic, Solr, Xapian, Whoosh)
Cons:
• poor SearchQuerySet API
• difficult to manage stop words
• loose performance, because extra layer
• Model - based
Results
Python 

clients
Python 3
Django

support
elasticsearch-py

elasticsearch-dsl-
py

elasticsearch-py-
async
YES
haystack +

elasticstack

psycopg2 YES
djorm-ext-
pgfulltext

django.contrib.po
stgres
Whoosh YES
support using
haystack
ResultsIndexes Without indexes
PUT /index/ No support
GIN/GIST to_tsvector()
index folder No support
Results
ranking /
relevance
Configure

Stopwords
highlight
search
results
TF/IDF YES YES
cd_rank YES YES
Okapi BM25 YES YES
ResultsSynonyms Scale
YES YES
NO SUPPORT I’m not sure
NO SUPPORT NO
Final Thoughts
Questions
?
Thank You
andrii.soldatenko@toptal.com
@a_soldatenko
https://asoldatenko.com
We are hiring
https://www.toptal.com/#connect-
fantastic-computer-engineers

Contenu connexe

Plus de Andrii Soldatenko

Plus de Andrii Soldatenko (6)

What is the best full text search engine for Python?
What is the best full text search engine for Python?What is the best full text search engine for Python?
What is the best full text search engine for Python?
 
Kyiv.py #16 october 2015
Kyiv.py #16 october 2015Kyiv.py #16 october 2015
Kyiv.py #16 october 2015
 
PyCon Russian 2015 - Dive into full text search with python.
PyCon Russian 2015 - Dive into full text search with python.PyCon Russian 2015 - Dive into full text search with python.
PyCon Russian 2015 - Dive into full text search with python.
 
PyCon 2015 Belarus Andrii Soldatenko
PyCon 2015 Belarus Andrii SoldatenkoPyCon 2015 Belarus Andrii Soldatenko
PyCon 2015 Belarus Andrii Soldatenko
 
PyCon Ukraine 2014
PyCon Ukraine 2014PyCon Ukraine 2014
PyCon Ukraine 2014
 
SeleniumCamp 2015 Andrii Soldatenko
SeleniumCamp 2015 Andrii SoldatenkoSeleniumCamp 2015 Andrii Soldatenko
SeleniumCamp 2015 Andrii Soldatenko
 

Dernier

pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
JOHNBEBONYAP1
 
75539-Cyber Security Challenges PPT.pptx
75539-Cyber Security Challenges PPT.pptx75539-Cyber Security Challenges PPT.pptx
75539-Cyber Security Challenges PPT.pptx
Asmae Rabhi
 
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
ayvbos
 
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
ydyuyu
 

Dernier (20)

Microsoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck MicrosoftMicrosoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck Microsoft
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirt
 
Best SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency DallasBest SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency Dallas
 
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime NagercoilNagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
 
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
 
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency""Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
 
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
 
75539-Cyber Security Challenges PPT.pptx
75539-Cyber Security Challenges PPT.pptx75539-Cyber Security Challenges PPT.pptx
75539-Cyber Security Challenges PPT.pptx
 
Power point inglese - educazione civica di Nuria Iuzzolino
Power point inglese - educazione civica di Nuria IuzzolinoPower point inglese - educazione civica di Nuria Iuzzolino
Power point inglese - educazione civica di Nuria Iuzzolino
 
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
 
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
 
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53
 
Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf
 
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
 
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac RoomVip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirt
 

PyCon UA 2016

  • 1. What is the best full text search engine for Python? Andrii Soldatenko 23-24 April 2016 @a_soldatenko
  • 2. Agenda: • Who am I? • What is full text search? • PostgreSQL FTS / Elastic / Whoosh • Pros and Cons • What’s next?
  • 3. Andrii Soldatenko • Backend Python Developer at • CTO in Persollo.com • Speaker at many PyCons and Python meetups • blogger at https://asoldatenko.com
  • 5. Text Search ➜ cpython time ack OrderedDict ack OrderedDict 2.53s user 0.22s system 94% cpu 2.915 total ➜ cpython time pt OrderedDict pt OrderedDict 0.14s user 0.12s system 406% cpu 0.064 total ➜ cpython time pss OrderedDict pss OrderedDict 1.08s user 0.14s system 88% cpu 1.370 total ➜ cpython time grep -r -i 'OrderedDict' . grep -r -i 'OrderedDict' 2.70s user 0.13s system 94% cpu 2.998 total
  • 8. Simple sentences 1. The quick brown fox jumped over the lazy dog 2. Quick brown foxes leap over lazy dogs in summer
  • 11. Inverted index: normalization Term Doc_1 Doc_2 ------------------------- brown | X | X dog | X | X fox | X | X in | | X jump | X | X lazy | X | X over | X | X quick | X | X summer | | X the | X | X ------------------------ Term Doc_1 Doc_2 ------------------------- Quick | | X The | X | brown | X | X dog | X | dogs | | X fox | X | foxes | | X in | | X jumped | X | lazy | X | X leap | | X over | X | X quick | X | summer | | X the | X | ------------------------
  • 14. PostgreSQL Full Text Search SELECT to_tsvector('text') @@ to_tsquery('query'); Simple is better than complex. - by import this
  • 15. SELECT 'python conference ukraine 2016'::tsvector @@ 'python & ukraine'::tsquery; ?column? ---------- t (1 row) Do PostgreSQL FTS without index
  • 16. Do PostgreSQL FTS with index CREATE INDEX name ON table USING GIN (column); CREATE INDEX name ON table USING GIST (column);
  • 17. PostgreSQL FTS:
 Ranking Search Results ts_rank() -> float4 - based on the frequency of their matching lexemes ts_rank_cd() -> float4 - cover density ranking for the given document vector and query
  • 18. PostgresSQL FTS Highlighting Results SELECT ts_headline('english', 'python conference ukraine 2016', to_tsquery('python & 2016')); ts_headline ---------------------------------------------- <b>python</b> conference ukraine <b>2016</b>
  • 20. PostgresSQL FTS Stop Words SELECT to_tsvector('in the list of stop words'); to_tsvector ---------------------------- 'list':3 'stop':5 'word':6
  • 21. PG FTS
 and Python • Django 1.10 django.contrib.postgres.search (36 hours ago) • djorm-ext-pgfulltext • sqlalchemy
  • 22. PostgreSQL FTS integration with django orm https://github.com/linuxlewis/djorm-ext-pgfulltext from djorm_pgfulltext.models import SearchManager from djorm_pgfulltext.fields import VectorField from django.db import models class Page(models.Model): name = models.CharField(max_length=200) description = models.TextField() search_index = VectorField() objects = SearchManager( fields = ('name', 'description'), config = 'pg_catalog.english', # this is default search_field = 'search_index', # this is default auto_update_search_field = True )
  • 23. For search just use search method of the manager https://github.com/linuxlewis/djorm-ext-pgfulltext >>> Page.objects.search("documentation & about") [<Page: Page: Home page>] >>> Page.objects.search("about | documentation | django | home", raw=True) [<Page: Page: Home page>, <Page: Page: About>, <Page: Page: Navigation>]
  • 24. Django 1.10 >>> Entry.objects.filter(body_text__search='recipe') [<Entry: Cheese on Toast recipes>, <Entry: Pizza recipes>] >>> Entry.objects.annotate( ... search=SearchVector('blog__tagline', 'body_text'), ... ).filter(search='cheese') [ <Entry: Cheese on Toast recipes>, <Entry: Pizza Recipes>, <Entry: Dairy farming in Argentina>, ] https://github.com/django/django/commit/2d877da
  • 25. Pros and Cons Pros: • Quick implementation • No dependency Cons: • Need manually manage indexes • depend on PostgreSQL • no analytics data • no DSL only `&` and `|` queries • difficult to manage stop words
  • 28. ElasticSearch: Quick Intro Relational DB Databases TablesRows Columns ElasticSearch Indices FieldsTypes Documents
  • 30. ElasticSearch and Python • elasticsearch-py • elasticsearch-dsl-py by Honza Kral • elasticsearch-py-async by Honza Kral
  • 31. ElasticSearch: FTS $ curl -XGET 'http://localhost:9200/ pyconua/talk/_search' -d ' {     "query": {         "match": {             "user": "Andrii"         }     } }'
  • 32. ES: Create Index $ curl -XPUT 'http://localhost:9200/ twitter/' -d '{     "settings" : {         "index" : {             "number_of_shards" : 3,             "number_of_replicas" : 2         }     } }'
  • 33. ES: Add json to Index $ curl -XPUT 'http://localhost:9200/ pyconua/talk/1' -d '{     "user" : "andrii",     "description" : "Full text search" }'
  • 34. ES: Stopwords $ curl -XPUT 'http://localhost:9200/pyconua' -d '{   "settings": {     "analysis": {       "analyzer": {         "my_english": {           "type": "english",           "stopwords_path": "stopwords/english.txt"         }       }     }   } }'
  • 35. ES: Highlight $ curl -XGET 'http://localhost:9200/pyconua/talk/ _search' -d '{     "query" : {...},     "highlight" : {         "pre_tags" : ["<tag1>"],         "post_tags" : ["</tag1>"],         "fields" : {             "_all" : {}         }     } }'
  • 36. ES: Relevance $ curl -XGET 'http://localhost:9200/_search?explain -d ' { "query" : { "match" : { "user" : "andrii" }} }' "_explanation": {   "description": "weight(tweet:honeymoon in 0)                   [PerFieldSimilarity], result of:",   "value": 0.076713204,   "details": [...] }
  • 37. Whoosh • Pure-Python • Whoosh was created by Matt Chaput. • Pluggable scoring algorithm (including BM25F) • more info at video from PyCon US 2013
  • 38. Whoosh: Stop words import os.path import textwrap names = os.listdir("stopwords") for name in names: f = open("stopwords/" + name) wordls = [line.strip() for line in f] words = " ".join(wordls) print '"%s": frozenset(u"""' % name print textwrap.fill(words, 72) print '""".split())' http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/ snowball/stopwords/
  • 39. Whoosh: 
 Highlight results = pycon.search(myquery) for hit in results: print(hit["title"]) # Assume "content" field is stored print(hit.highlights("content"))
  • 40. Whoosh: 
 Ranking search results • Pluggable scoring algorithm • including BM25F
  • 42. Adding search functionality to Simple Model $ cat myapp/models.py from django.db import models from django.contrib.auth.models import User class Page(models.Model): user = models.ForeignKey(User) name = models.CharField(max_length=200) description = models.TextField() def __unicode__(self): return self.name
  • 43. Haystack: Installation $ pip install django-haystack $ cat settings.py INSTALLED_APPS = [ 'django.contrib.admin', 'django.contrib.auth', 'django.contrib.contenttypes', 'django.contrib.sessions', 'django.contrib.sites', # Added. 'haystack', # Then your usual apps... 'blog', ]
  • 44. Haystack: Installation $ pip install elasticsearch $ cat settings.py ... HAYSTACK_CONNECTIONS = { 'default': { 'ENGINE': 'haystack.backends.elasticsearch_backend.ElasticsearchSearchEngine', 'URL': 'http://127.0.0.1:9200/', 'INDEX_NAME': 'haystack', }, } ...
  • 45. Haystack: Creating SearchIndexes $ cat myapp/search_indexes.py import datetime from haystack import indexes from myapp.models import Note class PageIndex(indexes.SearchIndex, indexes.Indexable): text = indexes.CharField(document=True, use_template=True) author = indexes.CharField(model_attr='user') pub_date = indexes.DateTimeField(model_attr='pub_date') def get_model(self): return Note def index_queryset(self, using=None): """Used when the entire index for model is updated.""" return self.get_model().objects. filter(pub_date__lte=datetime.datetime.now())
  • 46. Haystack: SearchQuerySet API from haystack.query import SearchQuerySet from haystack.inputs import Raw all_results = SearchQuerySet().all() hello_results = SearchQuerySet().filter(content='hello') unfriendly_results = SearchQuerySet(). exclude(content=‘hello’). filter(content=‘world’) # To send unescaped data: sqs = SearchQuerySet().filter(title=Raw(trusted_query))
  • 47. Keeping data in sync # Update everything. ./manage.py update_index --settings=settings.prod # Update everything with lots of information about what's going on. ./manage.py update_index --settings=settings.prod --verbosity=2 # Update everything, cleaning up after deleted models. ./manage.py update_index --remove --settings=settings.prod # Update everything changed in the last 2 hours. ./manage.py update_index --age=2 --settings=settings.prod # Update everything between Dec. 1, 2011 & Dec 31, 2011 ./manage.py update_index --start='2011-12-01T00:00:00' --end='2011-12-31T23:59:59' -- settings=settings.prod
  • 48. Signals class RealtimeSignalProcessor(BaseSignalProcessor): """ Allows for observing when saves/deletes fire & automatically updates the search engine appropriately. """ def setup(self): # Naive (listen to all model saves). models.signals.post_save.connect(self.handle_save) models.signals.post_delete.connect(self.handle_delete) # Efficient would be going through all backends & collecting all models # being used, then hooking up signals only for those. def teardown(self): # Naive (listen to all model saves). models.signals.post_save.disconnect(self.handle_save) models.signals.post_delete.disconnect(self.handle_delete) # Efficient would be going through all backends & collecting all models # being used, then disconnecting signals only for those.
  • 49. Haystack: Pros and Cons Pros: • easy to setup • looks like Django ORM but for searches • search engine independent • support 4 engines (Elastic, Solr, Xapian, Whoosh) Cons: • poor SearchQuerySet API • difficult to manage stop words • loose performance, because extra layer • Model - based
  • 50. Results Python 
 clients Python 3 Django
 support elasticsearch-py
 elasticsearch-dsl- py
 elasticsearch-py- async YES haystack +
 elasticstack
 psycopg2 YES djorm-ext- pgfulltext
 django.contrib.po stgres Whoosh YES support using haystack
  • 51. ResultsIndexes Without indexes PUT /index/ No support GIN/GIST to_tsvector() index folder No support
  • 53. ResultsSynonyms Scale YES YES NO SUPPORT I’m not sure NO SUPPORT NO