SlideShare a Scribd company logo
1 of 19
Designing a generic python search
engine API
Richard Boulton
@rboulton
richard@cnav.co.uk
Lots of search engines
● Lucene, Xapian, Sphinx, Solr, ElasticSearch,
Whoosh, Riak Search, Terrier, Lemur/Indri
● Also MySQL, PostgreSQL Full Text
components
● Also client-side engines using Redis, Mongo,
etc.
Generic API?
● Don't know what search features you need in
advance
● So, don't want to be stuck with an early choice.
● Also, don't want to learn new API for trying out
new engine.
I Need Input
Philosophy
● Expose common features in standard way
● Emulate missing features
● Don't get in the way
● Build useful features on top
Backend variation
● Fields / Schemas: not in Xapian, Lucene
● Schema modification: Sphinx, Whoosh can't
Updates
● Sphinx: no updates (in progress)
● Does update happen synchronously?
● Do updates return docids, or do docids need to
be supplied by client?
● Can docids be set by client?
● When do updates become live?
Scaling
● Multiple database searches?
● Searching remote databases?
● Replication?
Analysers
● Vary wildly between backends
● Stemming, splitting, n-grams
● Language detection
● Fuzzy searches
● Soundex (well, Metaphone)
Queries
● Many different features available
● Most engines support arbitrary booleans
● Some have XOR!
● Some only permit sets of filters
● Weighting schemes
● Need to expose native backend query parsers
Facets
● Information about result set
● Can be emulated (slow)
● Some backends approximate
● Some backends give stats, histograms
Other features
● Spelling correction
● Numeric and Date range searches
● Geospatial searches (box, geohash, distance)
Proposed design
● SearchClient class for each backend.
● A definition of standard behaviours that all
backends should provide.
● A definition of optional behaviours when more
than one backend provides them.
Proposed design
● Test suite to ensure that all backends support
common features.
● Programmatic way of checking which features a
backend supports? (Or just raise exception)
Proposed design
● Convenience SearchClient factory function:
c = multisearch.SearchClient('xapian', path=dbpath)
Documents
● Must support dictionary of fields
● Unicode values
● List(unicode) values
● May support arbitrary other field types, or
different data structures, if backend wants to.
Schemas
● Fields have types
● Automatic type “guessing” (client or server side)
● Some standard minimal set of analysers
● Text in a language
● Untokenised values
● Don't define exact output; just intent of standard
analysers.
Search representation
● Abstract query representation
● Tree of python objects.
● Overloaded operators for boolean.
● Chainable methods.
● (have actually written this)
● SearchClient.my_query_type()
Code
● Such as it is, on
http://github.com/rboulton/multisearch
● Suggestions for a better name appreciated
● Query representation is pretty good, rest is
pretty rough.

More Related Content

What's hot

The ultimate guide for Elasticsearch plugins
The ultimate guide for Elasticsearch pluginsThe ultimate guide for Elasticsearch plugins
The ultimate guide for Elasticsearch pluginsItamar
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.Jurriaan Persyn
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with RBarbara Fusinska
 
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search EngineElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search EngineDaniel N
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido
 
ElasticSearch - DevNexus Atlanta - 2014
ElasticSearch - DevNexus Atlanta - 2014ElasticSearch - DevNexus Atlanta - 2014
ElasticSearch - DevNexus Atlanta - 2014Roy Russo
 
Webinar: Solr 6 Deep Dive - SQL and Graph
Webinar: Solr 6 Deep Dive - SQL and GraphWebinar: Solr 6 Deep Dive - SQL and Graph
Webinar: Solr 6 Deep Dive - SQL and GraphLucidworks
 
Relevancy hacks for eCommerce
Relevancy hacks for eCommerceRelevancy hacks for eCommerce
Relevancy hacks for eCommerceVarun Thacker
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with RBarbara Fusinska
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with RBarbara Fusinska
 
Getting Started with the Alma API
Getting Started with the Alma APIGetting Started with the Alma API
Getting Started with the Alma APIKyle Banerjee
 
Future of pandas
Future of pandasFuture of pandas
Future of pandasJeff Reback
 
Elasticsearch Distributed search & analytics on BigData made easy
Elasticsearch Distributed search & analytics on BigData made easyElasticsearch Distributed search & analytics on BigData made easy
Elasticsearch Distributed search & analytics on BigData made easyItamar
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with RBarbara Fusinska
 
Intro to elasticsearch
Intro to elasticsearchIntro to elasticsearch
Intro to elasticsearchJoey Wen
 
Www Search Engine But Not In Perl
Www Search Engine But Not In PerlWww Search Engine But Not In Perl
Www Search Engine But Not In PerlKonstantin Ivinsky
 
Elgg solr presentation
Elgg solr presentationElgg solr presentation
Elgg solr presentationbeck24
 

What's hot (19)

The ultimate guide for Elasticsearch plugins
The ultimate guide for Elasticsearch pluginsThe ultimate guide for Elasticsearch plugins
The ultimate guide for Elasticsearch plugins
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with R
 
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search EngineElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
ElasticSearch - DevNexus Atlanta - 2014
ElasticSearch - DevNexus Atlanta - 2014ElasticSearch - DevNexus Atlanta - 2014
ElasticSearch - DevNexus Atlanta - 2014
 
Webinar: Solr 6 Deep Dive - SQL and Graph
Webinar: Solr 6 Deep Dive - SQL and GraphWebinar: Solr 6 Deep Dive - SQL and Graph
Webinar: Solr 6 Deep Dive - SQL and Graph
 
Relevancy hacks for eCommerce
Relevancy hacks for eCommerceRelevancy hacks for eCommerce
Relevancy hacks for eCommerce
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with R
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with R
 
Getting Started with the Alma API
Getting Started with the Alma APIGetting Started with the Alma API
Getting Started with the Alma API
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Future of pandas
Future of pandasFuture of pandas
Future of pandas
 
The eZ Platform Query Field
The eZ Platform Query FieldThe eZ Platform Query Field
The eZ Platform Query Field
 
Elasticsearch Distributed search & analytics on BigData made easy
Elasticsearch Distributed search & analytics on BigData made easyElasticsearch Distributed search & analytics on BigData made easy
Elasticsearch Distributed search & analytics on BigData made easy
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with R
 
Intro to elasticsearch
Intro to elasticsearchIntro to elasticsearch
Intro to elasticsearch
 
Www Search Engine But Not In Perl
Www Search Engine But Not In PerlWww Search Engine But Not In Perl
Www Search Engine But Not In Perl
 
Elgg solr presentation
Elgg solr presentationElgg solr presentation
Elgg solr presentation
 

Viewers also liked

Divoli & Medelyan: HCIR-2011 Presentation
Divoli & Medelyan: HCIR-2011 PresentationDivoli & Medelyan: HCIR-2011 Presentation
Divoli & Medelyan: HCIR-2011 PresentationAlyona Medelyan
 
新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)
新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)
新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)David Chiu
 
SEO Best Practice Techniques
SEO Best Practice TechniquesSEO Best Practice Techniques
SEO Best Practice Techniquespixelbuilders
 
IMAGE COMPRESSION AND DECOMPRESSION SYSTEM
IMAGE COMPRESSION AND DECOMPRESSION SYSTEMIMAGE COMPRESSION AND DECOMPRESSION SYSTEM
IMAGE COMPRESSION AND DECOMPRESSION SYSTEMVishesh Banga
 
image compression using matlab project report
image compression  using matlab project reportimage compression  using matlab project report
image compression using matlab project reportkgaurav113
 
artificial neural network
artificial neural networkartificial neural network
artificial neural networkPallavi Yadav
 
Image processing and compression techniques
Image processing and compression techniquesImage processing and compression techniques
Image processing and compression techniquesAshwin Venkataraman
 
types of computer networks, protocols and standards
types of computer networks, protocols and standardstypes of computer networks, protocols and standards
types of computer networks, protocols and standardsMidhun Menon
 
neural network
neural networkneural network
neural networkSTUDENT
 
Evaluation question 2
Evaluation question 2Evaluation question 2
Evaluation question 2beerogers
 
Tools and Methodology for Research: Future of Science
Tools and Methodology for Research: Future of ScienceTools and Methodology for Research: Future of Science
Tools and Methodology for Research: Future of ScienceYannick Prié (Enseignement)
 
Supertech supernova,Supertech supernova
Supertech supernova,Supertech supernovaSupertech supernova,Supertech supernova
Supertech supernova,Supertech supernovaJagat Bharti
 
Q6. Evaluation.
Q6. Evaluation.Q6. Evaluation.
Q6. Evaluation.karleab
 

Viewers also liked (20)

Divoli & Medelyan: HCIR-2011 Presentation
Divoli & Medelyan: HCIR-2011 PresentationDivoli & Medelyan: HCIR-2011 Presentation
Divoli & Medelyan: HCIR-2011 Presentation
 
新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)
新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)
新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)
 
SEO Best Practice Techniques
SEO Best Practice TechniquesSEO Best Practice Techniques
SEO Best Practice Techniques
 
IMAGE COMPRESSION AND DECOMPRESSION SYSTEM
IMAGE COMPRESSION AND DECOMPRESSION SYSTEMIMAGE COMPRESSION AND DECOMPRESSION SYSTEM
IMAGE COMPRESSION AND DECOMPRESSION SYSTEM
 
image compression using matlab project report
image compression  using matlab project reportimage compression  using matlab project report
image compression using matlab project report
 
Application Layer
Application LayerApplication Layer
Application Layer
 
artificial neural network
artificial neural networkartificial neural network
artificial neural network
 
Image processing and compression techniques
Image processing and compression techniquesImage processing and compression techniques
Image processing and compression techniques
 
types of computer networks, protocols and standards
types of computer networks, protocols and standardstypes of computer networks, protocols and standards
types of computer networks, protocols and standards
 
image compression ppt
image compression pptimage compression ppt
image compression ppt
 
neural network
neural networkneural network
neural network
 
15
1515
15
 
5784
57845784
5784
 
Igualdad libertad
Igualdad libertadIgualdad libertad
Igualdad libertad
 
Evaluation question 2
Evaluation question 2Evaluation question 2
Evaluation question 2
 
Tools and Methodology for Research: Future of Science
Tools and Methodology for Research: Future of ScienceTools and Methodology for Research: Future of Science
Tools and Methodology for Research: Future of Science
 
Supertech supernova,Supertech supernova
Supertech supernova,Supertech supernovaSupertech supernova,Supertech supernova
Supertech supernova,Supertech supernova
 
Prueba
PruebaPrueba
Prueba
 
chi phi thuc te khi du hoc nhat ban
chi phi thuc te khi du hoc nhat banchi phi thuc te khi du hoc nhat ban
chi phi thuc te khi du hoc nhat ban
 
Q6. Evaluation.
Q6. Evaluation.Q6. Evaluation.
Q6. Evaluation.
 

Similar to Designing a generic Python Search Engine API - BarCampLondon 8

Becoming "Facet"-nated with Search API
Becoming "Facet"-nated with Search APIBecoming "Facet"-nated with Search API
Becoming "Facet"-nated with Search APIcgmonroe
 
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)Gabriele Bartolini
 
Journey and evolution of Presto@Grab
Journey and evolution of Presto@GrabJourney and evolution of Presto@Grab
Journey and evolution of Presto@GrabShubham Tagra
 
ElasticSearch as (only) datastore
ElasticSearch as (only) datastoreElasticSearch as (only) datastore
ElasticSearch as (only) datastoreTomas Sirny
 
PostgreSQL and Sphinx pgcon 2013
PostgreSQL and Sphinx   pgcon 2013PostgreSQL and Sphinx   pgcon 2013
PostgreSQL and Sphinx pgcon 2013Emanuel Calvo
 
An evening with Postgresql
An evening with PostgresqlAn evening with Postgresql
An evening with PostgresqlJoshua Drake
 
The Professional Programmer
The Professional ProgrammerThe Professional Programmer
The Professional ProgrammerDave Cross
 
GraphQL is actually rest
GraphQL is actually restGraphQL is actually rest
GraphQL is actually restJakub Riedl
 
Rapid API Development ArangoDB Foxx
Rapid API Development ArangoDB FoxxRapid API Development ArangoDB Foxx
Rapid API Development ArangoDB FoxxMichael Hackstein
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentationhadooparchbook
 
Azure Cosmos DB: Features, Practical Use and Optimization "
Azure Cosmos DB: Features, Practical Use and Optimization "Azure Cosmos DB: Features, Practical Use and Optimization "
Azure Cosmos DB: Features, Practical Use and Optimization "GlobalLogic Ukraine
 
Ducksboard - A real-time data oriented webservice architecture
Ducksboard - A real-time data oriented webservice architectureDucksboard - A real-time data oriented webservice architecture
Ducksboard - A real-time data oriented webservice architectureDucksboard
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industryTommaso Teofili
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsZhenxiao Luo
 
Android developer fundamentals training overview Part II
Android developer fundamentals training overview Part IIAndroid developer fundamentals training overview Part II
Android developer fundamentals training overview Part IIYoza Aprilio
 
How to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldHow to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldMilo Yip
 
"Building Modern PHP Applications" - Jackson Murtha, South Dakota Code Camp 2012
"Building Modern PHP Applications" - Jackson Murtha, South Dakota Code Camp 2012"Building Modern PHP Applications" - Jackson Murtha, South Dakota Code Camp 2012
"Building Modern PHP Applications" - Jackson Murtha, South Dakota Code Camp 2012Blend Interactive
 

Similar to Designing a generic Python Search Engine API - BarCampLondon 8 (20)

Handout: 'Open Source Tools & Resources'
Handout: 'Open Source Tools & Resources'Handout: 'Open Source Tools & Resources'
Handout: 'Open Source Tools & Resources'
 
Becoming "Facet"-nated with Search API
Becoming "Facet"-nated with Search APIBecoming "Facet"-nated with Search API
Becoming "Facet"-nated with Search API
 
Introducing Datawave
Introducing DatawaveIntroducing Datawave
Introducing Datawave
 
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
 
Journey and evolution of Presto@Grab
Journey and evolution of Presto@GrabJourney and evolution of Presto@Grab
Journey and evolution of Presto@Grab
 
ElasticSearch as (only) datastore
ElasticSearch as (only) datastoreElasticSearch as (only) datastore
ElasticSearch as (only) datastore
 
PostgreSQL and Sphinx pgcon 2013
PostgreSQL and Sphinx   pgcon 2013PostgreSQL and Sphinx   pgcon 2013
PostgreSQL and Sphinx pgcon 2013
 
An evening with Postgresql
An evening with PostgresqlAn evening with Postgresql
An evening with Postgresql
 
The Professional Programmer
The Professional ProgrammerThe Professional Programmer
The Professional Programmer
 
GraphQL is actually rest
GraphQL is actually restGraphQL is actually rest
GraphQL is actually rest
 
Rapid API Development ArangoDB Foxx
Rapid API Development ArangoDB FoxxRapid API Development ArangoDB Foxx
Rapid API Development ArangoDB Foxx
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
Azure Cosmos DB: Features, Practical Use and Optimization "
Azure Cosmos DB: Features, Practical Use and Optimization "Azure Cosmos DB: Features, Practical Use and Optimization "
Azure Cosmos DB: Features, Practical Use and Optimization "
 
Ducksboard - A real-time data oriented webservice architecture
Ducksboard - A real-time data oriented webservice architectureDucksboard - A real-time data oriented webservice architecture
Ducksboard - A real-time data oriented webservice architecture
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industry
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
Android developer fundamentals training overview Part II
Android developer fundamentals training overview Part IIAndroid developer fundamentals training overview Part II
Android developer fundamentals training overview Part II
 
Discovering python search engines
Discovering python search enginesDiscovering python search engines
Discovering python search engines
 
How to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldHow to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the World
 
"Building Modern PHP Applications" - Jackson Murtha, South Dakota Code Camp 2012
"Building Modern PHP Applications" - Jackson Murtha, South Dakota Code Camp 2012"Building Modern PHP Applications" - Jackson Murtha, South Dakota Code Camp 2012
"Building Modern PHP Applications" - Jackson Murtha, South Dakota Code Camp 2012
 

More from Richard Boulton

Improving relevance with log information
Improving relevance with log informationImproving relevance with log information
Improving relevance with log informationRichard Boulton
 
Making a simple question into a complicated query
Making a simple question into a complicated queryMaking a simple question into a complicated query
Making a simple question into a complicated queryRichard Boulton
 
Search as a Service with Xapian - Search Solutions 2009
Search as a Service with Xapian - Search Solutions 2009Search as a Service with Xapian - Search Solutions 2009
Search as a Service with Xapian - Search Solutions 2009Richard Boulton
 
Comparing open source search engines
Comparing open source search enginesComparing open source search engines
Comparing open source search enginesRichard Boulton
 
The Xapian Open Source Search Engine
The Xapian Open Source Search EngineThe Xapian Open Source Search Engine
The Xapian Open Source Search EngineRichard Boulton
 

More from Richard Boulton (8)

Improving relevance with log information
Improving relevance with log informationImproving relevance with log information
Improving relevance with log information
 
Making a simple question into a complicated query
Making a simple question into a complicated queryMaking a simple question into a complicated query
Making a simple question into a complicated query
 
Interfaces to xapian
Interfaces to xapianInterfaces to xapian
Interfaces to xapian
 
Haystack
HaystackHaystack
Haystack
 
Search as a Service with Xapian - Search Solutions 2009
Search as a Service with Xapian - Search Solutions 2009Search as a Service with Xapian - Search Solutions 2009
Search as a Service with Xapian - Search Solutions 2009
 
Comparing open source search engines
Comparing open source search enginesComparing open source search engines
Comparing open source search engines
 
Optimising Xapian
Optimising XapianOptimising Xapian
Optimising Xapian
 
The Xapian Open Source Search Engine
The Xapian Open Source Search EngineThe Xapian Open Source Search Engine
The Xapian Open Source Search Engine
 

Recently uploaded

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 

Recently uploaded (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 

Designing a generic Python Search Engine API - BarCampLondon 8

  • 1. Designing a generic python search engine API Richard Boulton @rboulton richard@cnav.co.uk
  • 2. Lots of search engines ● Lucene, Xapian, Sphinx, Solr, ElasticSearch, Whoosh, Riak Search, Terrier, Lemur/Indri ● Also MySQL, PostgreSQL Full Text components ● Also client-side engines using Redis, Mongo, etc.
  • 3. Generic API? ● Don't know what search features you need in advance ● So, don't want to be stuck with an early choice. ● Also, don't want to learn new API for trying out new engine.
  • 5. Philosophy ● Expose common features in standard way ● Emulate missing features ● Don't get in the way ● Build useful features on top
  • 6. Backend variation ● Fields / Schemas: not in Xapian, Lucene ● Schema modification: Sphinx, Whoosh can't
  • 7. Updates ● Sphinx: no updates (in progress) ● Does update happen synchronously? ● Do updates return docids, or do docids need to be supplied by client? ● Can docids be set by client? ● When do updates become live?
  • 8. Scaling ● Multiple database searches? ● Searching remote databases? ● Replication?
  • 9. Analysers ● Vary wildly between backends ● Stemming, splitting, n-grams ● Language detection ● Fuzzy searches ● Soundex (well, Metaphone)
  • 10. Queries ● Many different features available ● Most engines support arbitrary booleans ● Some have XOR! ● Some only permit sets of filters ● Weighting schemes ● Need to expose native backend query parsers
  • 11. Facets ● Information about result set ● Can be emulated (slow) ● Some backends approximate ● Some backends give stats, histograms
  • 12. Other features ● Spelling correction ● Numeric and Date range searches ● Geospatial searches (box, geohash, distance)
  • 13. Proposed design ● SearchClient class for each backend. ● A definition of standard behaviours that all backends should provide. ● A definition of optional behaviours when more than one backend provides them.
  • 14. Proposed design ● Test suite to ensure that all backends support common features. ● Programmatic way of checking which features a backend supports? (Or just raise exception)
  • 15. Proposed design ● Convenience SearchClient factory function: c = multisearch.SearchClient('xapian', path=dbpath)
  • 16. Documents ● Must support dictionary of fields ● Unicode values ● List(unicode) values ● May support arbitrary other field types, or different data structures, if backend wants to.
  • 17. Schemas ● Fields have types ● Automatic type “guessing” (client or server side) ● Some standard minimal set of analysers ● Text in a language ● Untokenised values ● Don't define exact output; just intent of standard analysers.
  • 18. Search representation ● Abstract query representation ● Tree of python objects. ● Overloaded operators for boolean. ● Chainable methods. ● (have actually written this) ● SearchClient.my_query_type()
  • 19. Code ● Such as it is, on http://github.com/rboulton/multisearch ● Suggestions for a better name appreciated ● Query representation is pretty good, rest is pretty rough.