SlideShare a Scribd company logo
1 of 23
Searching for The Matrix in haystack
        (with Elasticsearch)
         Synopsi.TV case study



           Tomáš Sirný
           @junckritter

 Pyvo/Rubyslava November 2012
The Environment
●   Recommendation service for movies, TV shows
●   People mark titles they watched(check-in), rate
    them
●   Get recommendations
●   Make „Watch Later“ or other-purpose lists
●   …
●   Search (to check-in, add to list, share, etc.)
The Problem
●   Input box for search on top of web page
●   Many movies, TV shows in database
●   Lot of them have similar titles, use similar
    words
●   Some are more probable to be searched for
●   Few input information – 3, 4 letters
●   Autocomplete, not only exact match
The Red Pill
The Blue Pill
The Tool
●   Elasticsearch – designed for searching in
    documents
●   Based on Lucene – de facto standard
●   Young yet feature-rich
●   Quick development (despite 1 core developer)
●   Business company recently founded
●   10M funding in A-round
The (Wannabe) Solution
●   Differentiate titles
●   Have cover, plot, cast, directors
●   Year
●   Popularity (whatever it means)
●   Prefer ones with more data, more popular
The Text – First Attempt

●   Text Query (now Match Query)
●   phrase_prefix type – all words in input with
    matching of prefixes („m“, „ma“, „mat“, …), same
    order of words
●   operator and
●   not_analyzed „name“ field (not broke down to
    words)
The Text – First Attempt

●   slop parameter - allows change of order, skip
    words
                 „matrix revolutions“

                 „revolutions matrix“

              „matrix first revolutions“
The Sorting – First Attempt
●   Default scoring considers only occurence text in
    documents
●   We also want other properties of document to
    count
●   Custom Score Query
●   Define script for scoring

        „script“: „_score * doc[„rating“].value“
The Rating
●   Allows to prefer more „popular“ titles
●   External – top lists, links, etc.
●   Internal – usage data from system
●   Problem for newly added titles – lack of data of
    both types
The Tuning of Rating
●   Get rid off external data
●   Only score „completeness“ of each document
●   Release year


               „script“: „3 * log(_score) +
       1 * log(doc["year"].date.year – 1880) +
    0.75 * log(doc["watched_count"].value +1)“
The Tuning of Query
●    Name field analyzed, edgeNGram filter

index:
    analysis:
     filter:
      my_ngram:
        type: edgeNGram
        min_gram : 1
        max_gram : 11
        side : front
     analyzer:
      my_analyzer:
        type: custom
        tokenizer: standard
        filter: [lowercase, asciifolding, my_ngram]
The AKA's

●   Also know as – names of title in different
    countries
●   Lot of additional data, sometimes only „noise“
●   „original“ is still most important
The AKA's
●   Array of AKAs – problems with scoring of short
    names
●   Nested AKA documents - query does not return
    nested document which matched

●   AKA document is child of title – have own
    information (original, country, slug)
●   Top Children Query – which AKA matched
●   Another query with Ids Filter – get titles
The Sorting – Second Attempt
●   Custom Filter Score Query – apply set of filters,
    each filter boosts documents which pass its
    condition
●   boost parameter of filter – differentiate
    importance of that filter
●   score_mode – sum, product of boost values
The Sorting – Used Score Filters
●   Release date (in case of TV show last episode)
    in last 6 months
●   Release date in next 3 months
●   „original“ AKA
●   Have all important categories filled
●   Not Short genre
●   Not TV movie
The Sorting – Short Input
●   Special case 1 – 3 letters
●   Very rare to exact match
●   Should work after typing of first letter
●   Only titles from this year
●   3 letters – also titles in near future and previous
    year
The Year in Input
●   Matrix 1999
●   Matrix Reloaded (2003)
●   Matrix 2000- released to 2000
●   Matrix 2000+ released since 2000
One More Thing – Advanced Search
●   Titles have also data about their usage
●   „Watched by Friends“ Filter
    Shows titles with IDs of your „friends“ in proper
    field (TermsFilter([IDS]))
●   „Not Watched“ filter
    Show titles in which is your ID absent
    (NotFilter(TermFilter(ID))
●   combination – titles to watch to catch up with
    friends
The End




  Thanks


Tomáš Sirný
@junckritter

More Related Content

What's hot

2015-04-11-PseudoConstants
2015-04-11-PseudoConstants2015-04-11-PseudoConstants
2015-04-11-PseudoConstants
Riley Major
 
Best Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingBest Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining Processing
Ontotext
 

What's hot (16)

Semantic & Multilingual Strategies in Lucene/Solr
Semantic & Multilingual Strategies in Lucene/SolrSemantic & Multilingual Strategies in Lucene/Solr
Semantic & Multilingual Strategies in Lucene/Solr
 
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
 
Xml presentation
Xml presentationXml presentation
Xml presentation
 
An hour with Database and SQL
An hour with Database and SQLAn hour with Database and SQL
An hour with Database and SQL
 
Introduction to DB design
Introduction to DB designIntroduction to DB design
Introduction to DB design
 
Final presentation
Final presentationFinal presentation
Final presentation
 
2015-04-11-PseudoConstants
2015-04-11-PseudoConstants2015-04-11-PseudoConstants
2015-04-11-PseudoConstants
 
Webinar: OpenNLP and Solr for Superior Relevance
Webinar: OpenNLP and Solr for Superior RelevanceWebinar: OpenNLP and Solr for Superior Relevance
Webinar: OpenNLP and Solr for Superior Relevance
 
Best Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingBest Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining Processing
 
Python training in hyderabad
Python training in hyderabadPython training in hyderabad
Python training in hyderabad
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
ANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy WayANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy Way
 
02 well formed and valid documents
02 well formed and valid documents02 well formed and valid documents
02 well formed and valid documents
 
XML's validation - XML Schema
XML's validation - XML SchemaXML's validation - XML Schema
XML's validation - XML Schema
 
Using OWL for the RESO Data Dictionary
Using OWL for the RESO Data DictionaryUsing OWL for the RESO Data Dictionary
Using OWL for the RESO Data Dictionary
 
Order #188231367 (status writer assigned) role model albert e
Order #188231367 (status writer assigned) role model   albert eOrder #188231367 (status writer assigned) role model   albert e
Order #188231367 (status writer assigned) role model albert e
 

Similar to Searching for The Matrix in haystack (with Elasticsearch)

MLConf - Emmys, Oscars & Machine Learning Algorithms at Netflix
MLConf - Emmys, Oscars & Machine Learning Algorithms at NetflixMLConf - Emmys, Oscars & Machine Learning Algorithms at Netflix
MLConf - Emmys, Oscars & Machine Learning Algorithms at Netflix
Xavier Amatriain
 
ClassifyingIssuesFromSRTextAzureML
ClassifyingIssuesFromSRTextAzureMLClassifyingIssuesFromSRTextAzureML
ClassifyingIssuesFromSRTextAzureML
George Simov
 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologies
enterprisesearchmeetup
 
Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...
Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...
Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...
Citus Data
 

Similar to Searching for The Matrix in haystack (with Elasticsearch) (20)

Find it, possibly also near you!
Find it, possibly also near you!Find it, possibly also near you!
Find it, possibly also near you!
 
Advanced Document Similarity With Apache Lucene
Advanced Document Similarity With Apache LuceneAdvanced Document Similarity With Apache Lucene
Advanced Document Similarity With Apache Lucene
 
Advanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneAdvanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache Lucene
 
MLConf - Emmys, Oscars & Machine Learning Algorithms at Netflix
MLConf - Emmys, Oscars & Machine Learning Algorithms at NetflixMLConf - Emmys, Oscars & Machine Learning Algorithms at Netflix
MLConf - Emmys, Oscars & Machine Learning Algorithms at Netflix
 
Xavier amatriain, dir algorithms netflix m lconf 2013
Xavier amatriain, dir algorithms netflix m lconf 2013Xavier amatriain, dir algorithms netflix m lconf 2013
Xavier amatriain, dir algorithms netflix m lconf 2013
 
Fast & relevant search: solutions and trade-offs (January 2020 - Search Techn...
Fast & relevant search: solutions and trade-offs (January 2020 - Search Techn...Fast & relevant search: solutions and trade-offs (January 2020 - Search Techn...
Fast & relevant search: solutions and trade-offs (January 2020 - Search Techn...
 
Search summit-2018-ltr-presentation
Search summit-2018-ltr-presentationSearch summit-2018-ltr-presentation
Search summit-2018-ltr-presentation
 
ClassifyingIssuesFromSRTextAzureML
ClassifyingIssuesFromSRTextAzureMLClassifyingIssuesFromSRTextAzureML
ClassifyingIssuesFromSRTextAzureML
 
Oracle by Muhammad Iqbal
Oracle by Muhammad IqbalOracle by Muhammad Iqbal
Oracle by Muhammad Iqbal
 
Get the most out of Solr search with PHP
Get the most out of Solr search with PHPGet the most out of Solr search with PHP
Get the most out of Solr search with PHP
 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologies
 
Scalable data structures for data science
Scalable data structures for data scienceScalable data structures for data science
Scalable data structures for data science
 
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...
 
Solr ce si cum
Solr ce si cumSolr ce si cum
Solr ce si cum
 
Type theory in practice
Type theory in practiceType theory in practice
Type theory in practice
 
Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...
Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...
Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...
 
Recommender Systems, Matrices and Graphs
Recommender Systems, Matrices and GraphsRecommender Systems, Matrices and Graphs
Recommender Systems, Matrices and Graphs
 
XML
XMLXML
XML
 
DynamodbDB Deep Dive
DynamodbDB Deep DiveDynamodbDB Deep Dive
DynamodbDB Deep Dive
 
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Searching for The Matrix in haystack (with Elasticsearch)

  • 1. Searching for The Matrix in haystack (with Elasticsearch) Synopsi.TV case study Tomáš Sirný @junckritter Pyvo/Rubyslava November 2012
  • 2. The Environment ● Recommendation service for movies, TV shows ● People mark titles they watched(check-in), rate them ● Get recommendations ● Make „Watch Later“ or other-purpose lists ● … ● Search (to check-in, add to list, share, etc.)
  • 3. The Problem ● Input box for search on top of web page ● Many movies, TV shows in database ● Lot of them have similar titles, use similar words ● Some are more probable to be searched for ● Few input information – 3, 4 letters ● Autocomplete, not only exact match
  • 6. The Tool ● Elasticsearch – designed for searching in documents ● Based on Lucene – de facto standard ● Young yet feature-rich ● Quick development (despite 1 core developer) ● Business company recently founded ● 10M funding in A-round
  • 7. The (Wannabe) Solution ● Differentiate titles ● Have cover, plot, cast, directors ● Year ● Popularity (whatever it means) ● Prefer ones with more data, more popular
  • 8. The Text – First Attempt ● Text Query (now Match Query) ● phrase_prefix type – all words in input with matching of prefixes („m“, „ma“, „mat“, …), same order of words ● operator and ● not_analyzed „name“ field (not broke down to words)
  • 9. The Text – First Attempt ● slop parameter - allows change of order, skip words „matrix revolutions“ „revolutions matrix“ „matrix first revolutions“
  • 10. The Sorting – First Attempt ● Default scoring considers only occurence text in documents ● We also want other properties of document to count ● Custom Score Query ● Define script for scoring „script“: „_score * doc[„rating“].value“
  • 11. The Rating ● Allows to prefer more „popular“ titles ● External – top lists, links, etc. ● Internal – usage data from system ● Problem for newly added titles – lack of data of both types
  • 12. The Tuning of Rating ● Get rid off external data ● Only score „completeness“ of each document ● Release year „script“: „3 * log(_score) + 1 * log(doc["year"].date.year – 1880) + 0.75 * log(doc["watched_count"].value +1)“
  • 13. The Tuning of Query ● Name field analyzed, edgeNGram filter index: analysis: filter: my_ngram: type: edgeNGram min_gram : 1 max_gram : 11 side : front analyzer: my_analyzer: type: custom tokenizer: standard filter: [lowercase, asciifolding, my_ngram]
  • 14. The AKA's ● Also know as – names of title in different countries ● Lot of additional data, sometimes only „noise“ ● „original“ is still most important
  • 15.
  • 16. The AKA's ● Array of AKAs – problems with scoring of short names ● Nested AKA documents - query does not return nested document which matched ● AKA document is child of title – have own information (original, country, slug) ● Top Children Query – which AKA matched ● Another query with Ids Filter – get titles
  • 17. The Sorting – Second Attempt ● Custom Filter Score Query – apply set of filters, each filter boosts documents which pass its condition ● boost parameter of filter – differentiate importance of that filter ● score_mode – sum, product of boost values
  • 18. The Sorting – Used Score Filters ● Release date (in case of TV show last episode) in last 6 months ● Release date in next 3 months ● „original“ AKA ● Have all important categories filled ● Not Short genre ● Not TV movie
  • 19. The Sorting – Short Input ● Special case 1 – 3 letters ● Very rare to exact match ● Should work after typing of first letter ● Only titles from this year ● 3 letters – also titles in near future and previous year
  • 20. The Year in Input ● Matrix 1999 ● Matrix Reloaded (2003) ● Matrix 2000- released to 2000 ● Matrix 2000+ released since 2000
  • 21. One More Thing – Advanced Search ● Titles have also data about their usage ● „Watched by Friends“ Filter Shows titles with IDs of your „friends“ in proper field (TermsFilter([IDS])) ● „Not Watched“ filter Show titles in which is your ID absent (NotFilter(TermFilter(ID)) ● combination – titles to watch to catch up with friends
  • 22.
  • 23. The End Thanks Tomáš Sirný @junckritter