SlideShare une entreprise Scribd logo
1  sur  26
Télécharger pour lire hors ligne
Multilingual Search and Text Analytics
with Solr
Steve Kearns
Director of Product Management
Basis Technology
Basis Technology – Open Source Search Conference 2012   1
Agenda	
  
•  Why	
  is	
  Language	
  Important?	
  
•  Approaches	
  for	
  language-­‐aware	
  search	
  
•  Solr	
  Configura>on	
  Op>ons	
  




 Basis Technology – Open Source Search Conference 2012   2
Language	
  is	
  
        Important	
  
Basis Technology – Open Source Search Conference 2012   3
Why	
  is	
  language	
  important?	
  
•  Content	
  is	
  produced	
  and	
  consumed	
  in	
  the	
  na>ve	
  
   language	
  

•  Document	
  collec>ons	
  oBen	
  contain	
  more	
  than	
  one	
  
   language	
  

•  Each	
  language	
  is	
  unique,	
  and	
  presents	
  different	
  
   challenges	
  to	
  the	
  search	
  engine	
  




 Basis Technology – Open Source Search Conference 2012                      4
Language	
  is	
  Complex	
  
•  Tokeniza>on	
  
    •  Some	
  languages	
  do	
  not	
  use	
  spaces	
  
    •  Compound	
  words	
  combine	
  two	
  or	
  more	
  words	
  
    •  Conjunc>ons	
  
    	
  
•  Inflec>on	
  
    •  In	
  grammar,	
  inflec>on	
  is	
  the	
  modifica>on	
  of	
  a	
  word	
  to	
  
         express	
  different	
  gramma>cal	
  categories	
  such	
  as	
  
         tense,	
  gramma>cal	
  mood,	
  gramma>cal	
  voice,	
  aspect,	
  
         person,	
  number,	
  gender	
  and	
  case.	
  

 Basis Technology – Open Source Search Conference 2012                              5
Language	
  is	
  Complex	
  




 Basis Technology – Open Source Search Conference 2012                                                       6
                                                         hOp://en.wikipedia.org/wiki/File:Flexi%C3%B3nGato.png	
  
Language	
  is	
  Complex!	
  
•  The	
  Spanish	
  word	
  “pasaportar”	
  has	
  more	
  than	
  50	
  
   inflected	
  forms:	
  

   pasaportando	
                       pasaportareis	
                     pasaportarán	
  
   pasaportes	
                         pasaportaron	
                      pasaporte	
  
   pasaportada	
                        pasaportase	
                       pasaportan	
  
   pasaportaba	
                        pasaportemos	
                      pasaporta	
  
   pasaportarían	
                      pasaportaría	
                      pasaportaste	
  
   pasaportarais	
                      pasaportara	
                       pasaportad	
  
   pasaportasen	
                       pasaportasteis	
                    pasaportéis	
  
   pasaportaren	
                       pasaportáramos	
                    pasaportadas	
  
   pasaportado	
                        pasaportaban	
                      pasaporté	
  
   pasaportaremos	
                     pasaportásemos	
                    pasaportados	
  
   pasaportábamos	
                     pasaportamos	
                      pasaportaré	
  
   pasaportases	
                       pasaporten	
                        pasaportare	
  
   pasaportaríais	
                     pasaportaréis	
                     pasaportará	
  
   pasaportaran	
                       pasaportabas	
                      pasaportó	
  
   pasaportarías	
                      pasaportaríamos	
                   pasaportabais	
  
   pasaportaras	
                       pasaportáremos	
                    pasaportaseis	
  
   pasaportarás	
                       pasaporto	
                         …	
  


 Basis Technology – Open Source Search Conference 2012                                                           7
                                                      http://education.yahoo.com/reference/dict_en_es/spanish/pasaportar
Language	
  Examples	
  
•  English:	
  
                                 spoke	
  (Noun	
  –	
  wheel	
  part)	
                                  →	
  spoke	
  
                                   spoke	
  (Verb,	
  past	
  tense)	
                                    →	
  speak	
  

•  French:	
  
                                                    été	
  (summer)	
                      →	
  	
  été	
  (summer)	
  
                                                            été	
  (was)	
  	
  	
  	
          →	
  être	
  (to	
  be)	
  

•  German:	
  	
                                Robbe	
  (seal)	
                          →	
  Robbe	
  (seal)	
  
                                              robbe	
  (I	
  crawl)	
               →	
  robben	
  (to	
  crawl)	
  
                        Samstagmorgen	
  (Saturday	
  Morning)	
   →	
  Samstag,	
  Morgen	
  (compound)	
  

•  Japanese:	
  
           •  首脳会談後、オバマ大統領は記者団の質問に答える予定	
  
                 –  Where	
  are	
  the	
  words??	
  
  Basis Technology – Open Source Search Conference 2012                                                                       8
Language-­‐Aware	
  Search	
  Technology	
  
•  RoseOe	
  Linguis>c	
  Plaiorm	
  	
  
    •  Language	
  Iden>fica>on	
  
    •  Tokeniza>on	
  
                        »  Morphological	
  
      •  Token	
  processing	
  
                        »  Lemma>za>on	
  
      •  Higher	
  level	
  analy>cs	
  
                        »  En>ty	
  Extrac>on	
  
                        »  Rela>onship	
  Extrac>on	
  
      •  En>ty	
  Transla>on	
  and	
  En>ty	
  Search	
  


 Basis Technology – Open Source Search Conference 2012       9
Language	
  Iden>fica>on	
  

 •  Find	
  a	
  single	
  dominant	
  language	
  in	
  a	
  document	
  
 •  Find	
  mul>ple	
  languages	
  in	
  a	
  single	
  document	
  




 Basis Technology – Open Source Search Conference 2012                       10
Tokeniza>on	
  
•  Morphological	
  Analysis	
  vs.	
  N-­‐gram	
  
•  Search	
  Term:	
  	
  東京 ルパン上映時間	
•  N-­‐gram:	
  




•  Morphological	
  Analysis:	
  	
  	
  




 Basis Technology – Open Source Search Conference 2012   11
Token	
  Processing	
  
•  Stemming	
  vs.	
  Lemma>za>on	
  
•  English:	
  “I	
  have	
  spoken	
  at	
  several	
  conferences”	
  
•  Stemming:	
  




•  Lemma>za>on:	
  




 Basis Technology – Open Source Search Conference 2012                     12
Stemming	
  vs.	
  Lemma>za>on	
  
•  Two	
  words	
  with	
  the	
  same	
  spelling,	
  but	
  different	
  
   meanings	
  create	
  the	
  same	
  stem.	
  

  Stemming	
                                                                                         LemmaCzaCon	
  
  prensa	
  	
                              →	
  prens	
                        Prensa	
                   →	
  prensa	
  
  (media)	
                                                                	
  (media)	
                      (media)	
  
  prensa	
  	
                              →	
  prens	
                        prensa	
  	
            →	
  prensar	
  
  	
  (he/she	
  presses)	
  	
  	
  	
                      	
  (he/she	
  presses)	
  	
  	
  	
      	
  (to	
  press)	
  
  	
  	
                                    INCORRECT	
                               	
                    CORRECT	
  




 Basis Technology – Open Source Search Conference 2012                                                                   13
Stemming	
  vs.	
  Lemma>za>on	
  
•  Two	
  different	
  words	
  create	
  the	
  same	
  stem.	
  


  Stemming	
                                                                      LemmaCzaCon	
  
  publicaciones	
  	
         →	
  public	
              publicaciones	
           →	
  publicación	
  	
  
  (publicaCons)	
                                        (publicaCons)	
  	
  
  publico	
  	
               →	
  public	
                     publico	
  	
              →	
  public	
  	
  
  (public)	
                                                   (public)	
                   (public)	
  
  	
  	
                      INCORRECT	
                              	
                  CORRECT	
  




 Basis Technology – Open Source Search Conference 2012                                                   14
Token	
  Processing	
  
German:	
  “Am	
  Samstagmorgen	
  fliege	
  ich	
  zurueck	
  nach	
  
   Boston.”	
  
•  Stemming:	
  




•  Lemma>za>on	
  (and	
  decompounding!):	
  




 Basis Technology – Open Source Search Conference 2012                   15
How	
  to	
  Configure	
  Solr	
  
•  Challenges	
  
    •  Mul>ple	
  languages	
  in	
  the	
  data	
  set	
  



•  Goals:	
  
    1.  Language	
  Iden>fica>on	
  
    2.  Language-­‐aware	
  Search:	
  
            •  Tokeniza>on	
  
            •  Token	
  Processing	
  



 Basis Technology – Open Source Search Conference 2012        16
How	
  to	
  Configure	
  Solr	
  
•  What	
  tools	
  does	
  Solr	
  have	
  to	
  work	
  with?	
  
   •  UpdateRequestProcessor	
  
   •  Analyzer/CharFilter/Tokenizer/TokenFilter	
  
   •  Solr	
  Cores	
  

•  Pre-­‐process	
  data	
  before	
  Solr?	
  




 Basis Technology – Open Source Search Conference 2012                17
Solr	
  UpdateRequestProcessor	
  
•  Runs	
  Before	
  Analyzers	
  
•  Full	
  Access	
  to	
  Document	
  

•  Two	
  op>ons:	
  	
  
    •  Run	
  the	
  analysis	
  directly	
  in	
  Solr	
  
            •  Good	
  for	
  Lightweight	
  Analysis	
  
      •  Call	
  out	
  to	
  external	
  analysis	
  services	
  
            •  Web	
  Services/UIMA.	
  Increases	
  Complexity	
  


•  Limita>ons:	
  	
  
    •  Think	
  through	
  your	
  indexing	
  strategy	
  	
  
 Basis Technology – Open Source Search Conference 2012                18
Solr	
  Analyzer/Tokenizer	
  
•  Good	
  for:	
  
     •  Segmenta>on	
  of	
  Asian	
  Language	
  
     •  Linguis>cs	
  -­‐	
  Lemma>za>on	
  
•  Limita>ons:	
  
     •  No	
  access	
  to	
  document	
  object	
  
	
  
•  Schema.xml	
  
      •  FieldType	
  
            •  Analyzer	
  
                  –  CharFilter	
  
                  –  Tokenize	
  
                  –  TokenFilter	
  
 Basis Technology – Open Source Search Conference 2012   19
Goal	
  1:	
  Language	
  ID	
  	
  
•  UpdateRequestProcessor	
  
    •  Runs	
  before	
  field-­‐level	
  analysis	
  takes	
  place	
  
    •  Basic	
  Language	
  Iden>fier	
  URP	
  to	
  be	
  included	
  in	
  Solr	
  

•  Outside	
  Solr	
  

	
  
What	
  do	
  you	
  do	
  with	
  the	
  language	
  informa>on??	
  



 Basis Technology – Open Source Search Conference 2012                             20
Goal	
  2:	
  Mul>-­‐Lingual	
  Support	
  in	
  Solr	
  
      •  Three	
  main	
  approaches:	
  

              1.  One	
  Solr	
  field	
  for	
  each	
  language	
  

              2.  One	
  Solr	
  Core	
  per	
  language	
  

              3.  All	
  Languages	
  in	
  a	
  Single	
  Field	
  


Informed	
  by	
  Trey	
  Grainger	
  	
  @	
  Careerbuilder:	
  hOp://www.lucidimagina>on.com/sites/default/files/Grainger%20Trey%20-­‐%20Extending%20Solr,
%20Building%20a%20Cloud-­‐Like%20Knowledge%20Discovery%20Plaiorm%20-­‐%20rev.pdf	
  



       Basis Technology – Open Source Search Conference 2012                                                                                       21
Mul>ple	
  Languages:	
  Method	
  1	
  
•  One	
  field	
  for	
  each	
  language	
  
   •  Pro:	
  
            •  Simple	
  approach	
  and	
  implementa>on	
  
            •  Guarantees	
  that	
  queries	
  are	
  processed	
  the	
  same	
  way	
  as	
  
               index	
  
      •  Con:	
  
            •  Increased	
  query-­‐>me	
  complexity	
  (mi>gate	
  with	
  Dismax)	
  
            •  Decreased	
  query	
  speed	
  as	
  addi>onal	
  fields	
  are	
  queried	
  
            •  May	
  require	
  storing	
  mul>ple	
  copies	
  of	
  text	
  




 Basis Technology – Open Source Search Conference 2012                                         22
Mul>ple	
  Languages:	
  Method	
  2	
  
•  One	
  Solr	
  core	
  per	
  language	
  
   	
  Each	
  Core	
  has	
  the	
  same	
  field,	
  with	
  a	
  language-­‐specific	
  	
  
       	
  Analyzer/Tokenizer	
  
           •  Pros:	
  
            •  No	
  query-­‐>me	
  performance	
  overhead	
  
            •  Guarantees	
  that	
  queries	
  are	
  processed	
  the	
  same	
  way	
  as	
  
               index	
  
      •  Cons:	
  
            •  Significant	
  complexity	
  in	
  managing	
  mul>ple	
  cores	
  
            •  Must	
  implement	
  custom	
  sharding	
  
            •  Does	
  not	
  support	
  mul>lingual	
  documents	
  

 Basis Technology – Open Source Search Conference 2012                                         23
Mul>ple	
  Languages:	
  Method	
  3	
  
•  All	
  Languages	
  in	
  one	
  field	
  
    •  Pros:	
  
            •  Single	
  field	
  makes	
  queries	
  and	
  indexing	
  easy	
  
            •  Same	
  schema/core	
  as	
  more	
  languages	
  added	
  
      •  Cons:	
  
            •  Requires	
  complex	
  custom	
  Tokenizer/Analyzer	
  
            •  Must	
  pass	
  in	
  language	
  informa>on	
  for	
  queries	
  and	
  indexing	
  
            •  Does	
  not	
  guarantee	
  queries	
  are	
  processed	
  the	
  same	
  as	
  the	
  
               index	
  
            •  Poten>al	
  TF/IDF	
  confusion	
  	
  	
  



 Basis Technology – Open Source Search Conference 2012                                          24
Language	
  is	
  Important	
  
•  Use	
  language	
  informa>on	
  at	
  index	
  and	
  query	
  >me	
  
•  Increase	
  recall,	
  maintain	
  precision	
  



•  BeOer	
  search	
  results	
  for	
  your	
  users	
  




 Basis Technology – Open Source Search Conference 2012                       25
My	
  Contact	
  Info	
  
•  Steve	
  Kearns	
  
    •  skearns@basistech.com	
  
    •  hOp://www.basistech.com	
  




 Basis Technology – Open Source Search Conference 2012   26

Contenu connexe

Similaire à Multilingual Search with Solr

NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA DATASCIENCE
 
RDF and other linked data standards — how to make use of big localization data
RDF and other linked data standards — how to make use of big localization dataRDF and other linked data standards — how to make use of big localization data
RDF and other linked data standards — how to make use of big localization dataDave Lewis
 
Semantic Integration with Apache Jena and Stanbol
Semantic Integration with Apache Jena and StanbolSemantic Integration with Apache Jena and Stanbol
Semantic Integration with Apache Jena and StanbolAll Things Open
 
Text Representations for Deep learning
Text Representations for Deep learningText Representations for Deep learning
Text Representations for Deep learningZachary S. Brown
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 
How To "Speak Developer"
How To "Speak Developer"How To "Speak Developer"
How To "Speak Developer"Nick Malcolm
 
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...Apache OpenNLP
 
Open source and free technologies for study skills
Open source and free technologies for study skillsOpen source and free technologies for study skills
Open source and free technologies for study skillsE.A. Draffan
 
Fluid, Fluent APIs
Fluid, Fluent APIsFluid, Fluent APIs
Fluid, Fluent APIsErik Rose
 
Natural Language Processing for Irish
Natural Language Processing for IrishNatural Language Processing for Irish
Natural Language Processing for IrishTeresa Lynn
 
Beyond Sharing: Open Source Design
Beyond Sharing: Open Source DesignBeyond Sharing: Open Source Design
Beyond Sharing: Open Source DesignMushon Zer-Aviv
 
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹台灣資料科學年會
 
Nlp presentation
Nlp presentationNlp presentation
Nlp presentationSurya Sg
 
Geek Empowerment - The Real Heart of Open Source
Geek Empowerment - The Real Heart of Open SourceGeek Empowerment - The Real Heart of Open Source
Geek Empowerment - The Real Heart of Open SourceRussell Pavlicek
 
Becoming an Open Source developer, Dimitris Andreadis
Becoming an Open Source developer, Dimitris AndreadisBecoming an Open Source developer, Dimitris Andreadis
Becoming an Open Source developer, Dimitris AndreadisOpenBlend society
 
PhiloWeb panel. "Philosophy" of the Web
PhiloWeb panel. "Philosophy" of the WebPhiloWeb panel. "Philosophy" of the Web
PhiloWeb panel. "Philosophy" of the WebPhiloWeb
 

Similaire à Multilingual Search with Solr (20)

NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2
 
RDF and other linked data standards — how to make use of big localization data
RDF and other linked data standards — how to make use of big localization dataRDF and other linked data standards — how to make use of big localization data
RDF and other linked data standards — how to make use of big localization data
 
Semantic Integration with Apache Jena and Stanbol
Semantic Integration with Apache Jena and StanbolSemantic Integration with Apache Jena and Stanbol
Semantic Integration with Apache Jena and Stanbol
 
Text Representations for Deep learning
Text Representations for Deep learningText Representations for Deep learning
Text Representations for Deep learning
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
How To "Speak Developer"
How To "Speak Developer"How To "Speak Developer"
How To "Speak Developer"
 
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
 
Intro
IntroIntro
Intro
 
Intro
IntroIntro
Intro
 
About programming languages
About programming languagesAbout programming languages
About programming languages
 
Open source and free technologies for study skills
Open source and free technologies for study skillsOpen source and free technologies for study skills
Open source and free technologies for study skills
 
Fluid, Fluent APIs
Fluid, Fluent APIsFluid, Fluent APIs
Fluid, Fluent APIs
 
Natural Language Processing for Irish
Natural Language Processing for IrishNatural Language Processing for Irish
Natural Language Processing for Irish
 
Beyond Sharing: Open Source Design
Beyond Sharing: Open Source DesignBeyond Sharing: Open Source Design
Beyond Sharing: Open Source Design
 
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
 
Nlp presentation
Nlp presentationNlp presentation
Nlp presentation
 
Geek Empowerment - The Real Heart of Open Source
Geek Empowerment - The Real Heart of Open SourceGeek Empowerment - The Real Heart of Open Source
Geek Empowerment - The Real Heart of Open Source
 
Becoming an Open Source developer, Dimitris Andreadis
Becoming an Open Source developer, Dimitris AndreadisBecoming an Open Source developer, Dimitris Andreadis
Becoming an Open Source developer, Dimitris Andreadis
 
The One Way
The One WayThe One Way
The One Way
 
PhiloWeb panel. "Philosophy" of the Web
PhiloWeb panel. "Philosophy" of the WebPhiloWeb panel. "Philosophy" of the Web
PhiloWeb panel. "Philosophy" of the Web
 

Plus de Basis Technology

Product Update: Customization with Rosette
Product Update: Customization with RosetteProduct Update: Customization with Rosette
Product Update: Customization with RosetteBasis Technology
 
Smart Matching for Screening Webinar - May 2020
Smart Matching for Screening Webinar - May 2020Smart Matching for Screening Webinar - May 2020
Smart Matching for Screening Webinar - May 2020Basis Technology
 
Understanding Names with Neural Networks - May 2020
Understanding Names with Neural Networks - May 2020Understanding Names with Neural Networks - May 2020
Understanding Names with Neural Networks - May 2020Basis Technology
 
Rosette Product Update (May 2019)
Rosette Product Update (May 2019)Rosette Product Update (May 2019)
Rosette Product Update (May 2019)Basis Technology
 
Simple fuzzy name matching in elasticsearch paris meetup
Simple fuzzy name matching in elasticsearch   paris meetupSimple fuzzy name matching in elasticsearch   paris meetup
Simple fuzzy name matching in elasticsearch paris meetupBasis Technology
 
Simple fuzzy Name Matching in Elasticsearch - Graham Morehead
Simple fuzzy Name Matching in Elasticsearch - Graham MoreheadSimple fuzzy Name Matching in Elasticsearch - Graham Morehead
Simple fuzzy Name Matching in Elasticsearch - Graham MoreheadBasis Technology
 
Optimizing multilingual search in SOLR
Optimizing multilingual search in SOLROptimizing multilingual search in SOLR
Optimizing multilingual search in SOLRBasis Technology
 
Gregor Stewart - OSIRA 2014
Gregor Stewart - OSIRA 2014Gregor Stewart - OSIRA 2014
Gregor Stewart - OSIRA 2014Basis Technology
 
Basis Technology showcase at elasticsearch meetup in Japan
Basis Technology showcase at elasticsearch meetup in JapanBasis Technology showcase at elasticsearch meetup in Japan
Basis Technology showcase at elasticsearch meetup in JapanBasis Technology
 
Rosette Search Essentials for Elasticsearch
Rosette Search Essentials for ElasticsearchRosette Search Essentials for Elasticsearch
Rosette Search Essentials for ElasticsearchBasis Technology
 
OSDF 2013 - Autopsy 3: Extensible Desktop Forensics by Brian Carrier
OSDF 2013 - Autopsy 3: Extensible Desktop Forensics by Brian CarrierOSDF 2013 - Autopsy 3: Extensible Desktop Forensics by Brian Carrier
OSDF 2013 - Autopsy 3: Extensible Desktop Forensics by Brian CarrierBasis Technology
 
HLT 2013 - Big Data Navigation and Discovery by Stefan Andreasen & Jeff Godbold
HLT 2013 - Big Data Navigation and Discovery by Stefan Andreasen & Jeff GodboldHLT 2013 - Big Data Navigation and Discovery by Stefan Andreasen & Jeff Godbold
HLT 2013 - Big Data Navigation and Discovery by Stefan Andreasen & Jeff GodboldBasis Technology
 
HLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian Carrier
HLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian CarrierHLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian Carrier
HLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian CarrierBasis Technology
 
OSS 2013 - Real World Facets with Entity Resolution by Benson Margulies
OSS 2013 - Real World Facets with Entity Resolution by Benson MarguliesOSS 2013 - Real World Facets with Entity Resolution by Benson Margulies
OSS 2013 - Real World Facets with Entity Resolution by Benson MarguliesBasis Technology
 
HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emergin...
HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emergin...HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emergin...
HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emergin...Basis Technology
 
HLT 2013 - From Research to Reality: Advances in HLT by David Murgatroyd
HLT 2013 - From Research to Reality: Advances in HLT by David MurgatroydHLT 2013 - From Research to Reality: Advances in HLT by David Murgatroyd
HLT 2013 - From Research to Reality: Advances in HLT by David MurgatroydBasis Technology
 
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics PlatformAutopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics PlatformBasis Technology
 
Autopsy 3.0 - Open Source Digital Forensics Conference
Autopsy 3.0 - Open Source Digital Forensics ConferenceAutopsy 3.0 - Open Source Digital Forensics Conference
Autopsy 3.0 - Open Source Digital Forensics ConferenceBasis Technology
 
Moving Beyond Entity Extraction to Entity Resolution - Human Language Technol...
Moving Beyond Entity Extraction to Entity Resolution - Human Language Technol...Moving Beyond Entity Extraction to Entity Resolution - Human Language Technol...
Moving Beyond Entity Extraction to Entity Resolution - Human Language Technol...Basis Technology
 
Big Data Triage with Rosette Human Language Technology Conference
Big Data Triage with Rosette Human Language Technology ConferenceBig Data Triage with Rosette Human Language Technology Conference
Big Data Triage with Rosette Human Language Technology ConferenceBasis Technology
 

Plus de Basis Technology (20)

Product Update: Customization with Rosette
Product Update: Customization with RosetteProduct Update: Customization with Rosette
Product Update: Customization with Rosette
 
Smart Matching for Screening Webinar - May 2020
Smart Matching for Screening Webinar - May 2020Smart Matching for Screening Webinar - May 2020
Smart Matching for Screening Webinar - May 2020
 
Understanding Names with Neural Networks - May 2020
Understanding Names with Neural Networks - May 2020Understanding Names with Neural Networks - May 2020
Understanding Names with Neural Networks - May 2020
 
Rosette Product Update (May 2019)
Rosette Product Update (May 2019)Rosette Product Update (May 2019)
Rosette Product Update (May 2019)
 
Simple fuzzy name matching in elasticsearch paris meetup
Simple fuzzy name matching in elasticsearch   paris meetupSimple fuzzy name matching in elasticsearch   paris meetup
Simple fuzzy name matching in elasticsearch paris meetup
 
Simple fuzzy Name Matching in Elasticsearch - Graham Morehead
Simple fuzzy Name Matching in Elasticsearch - Graham MoreheadSimple fuzzy Name Matching in Elasticsearch - Graham Morehead
Simple fuzzy Name Matching in Elasticsearch - Graham Morehead
 
Optimizing multilingual search in SOLR
Optimizing multilingual search in SOLROptimizing multilingual search in SOLR
Optimizing multilingual search in SOLR
 
Gregor Stewart - OSIRA 2014
Gregor Stewart - OSIRA 2014Gregor Stewart - OSIRA 2014
Gregor Stewart - OSIRA 2014
 
Basis Technology showcase at elasticsearch meetup in Japan
Basis Technology showcase at elasticsearch meetup in JapanBasis Technology showcase at elasticsearch meetup in Japan
Basis Technology showcase at elasticsearch meetup in Japan
 
Rosette Search Essentials for Elasticsearch
Rosette Search Essentials for ElasticsearchRosette Search Essentials for Elasticsearch
Rosette Search Essentials for Elasticsearch
 
OSDF 2013 - Autopsy 3: Extensible Desktop Forensics by Brian Carrier
OSDF 2013 - Autopsy 3: Extensible Desktop Forensics by Brian CarrierOSDF 2013 - Autopsy 3: Extensible Desktop Forensics by Brian Carrier
OSDF 2013 - Autopsy 3: Extensible Desktop Forensics by Brian Carrier
 
HLT 2013 - Big Data Navigation and Discovery by Stefan Andreasen & Jeff Godbold
HLT 2013 - Big Data Navigation and Discovery by Stefan Andreasen & Jeff GodboldHLT 2013 - Big Data Navigation and Discovery by Stefan Andreasen & Jeff Godbold
HLT 2013 - Big Data Navigation and Discovery by Stefan Andreasen & Jeff Godbold
 
HLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian Carrier
HLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian CarrierHLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian Carrier
HLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian Carrier
 
OSS 2013 - Real World Facets with Entity Resolution by Benson Margulies
OSS 2013 - Real World Facets with Entity Resolution by Benson MarguliesOSS 2013 - Real World Facets with Entity Resolution by Benson Margulies
OSS 2013 - Real World Facets with Entity Resolution by Benson Margulies
 
HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emergin...
HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emergin...HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emergin...
HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emergin...
 
HLT 2013 - From Research to Reality: Advances in HLT by David Murgatroyd
HLT 2013 - From Research to Reality: Advances in HLT by David MurgatroydHLT 2013 - From Research to Reality: Advances in HLT by David Murgatroyd
HLT 2013 - From Research to Reality: Advances in HLT by David Murgatroyd
 
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics PlatformAutopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
 
Autopsy 3.0 - Open Source Digital Forensics Conference
Autopsy 3.0 - Open Source Digital Forensics ConferenceAutopsy 3.0 - Open Source Digital Forensics Conference
Autopsy 3.0 - Open Source Digital Forensics Conference
 
Moving Beyond Entity Extraction to Entity Resolution - Human Language Technol...
Moving Beyond Entity Extraction to Entity Resolution - Human Language Technol...Moving Beyond Entity Extraction to Entity Resolution - Human Language Technol...
Moving Beyond Entity Extraction to Entity Resolution - Human Language Technol...
 
Big Data Triage with Rosette Human Language Technology Conference
Big Data Triage with Rosette Human Language Technology ConferenceBig Data Triage with Rosette Human Language Technology Conference
Big Data Triage with Rosette Human Language Technology Conference
 

Dernier

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 

Dernier (20)

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 

Multilingual Search with Solr

  • 1. Multilingual Search and Text Analytics with Solr Steve Kearns Director of Product Management Basis Technology Basis Technology – Open Source Search Conference 2012 1
  • 2. Agenda   •  Why  is  Language  Important?   •  Approaches  for  language-­‐aware  search   •  Solr  Configura>on  Op>ons   Basis Technology – Open Source Search Conference 2012 2
  • 3. Language  is   Important   Basis Technology – Open Source Search Conference 2012 3
  • 4. Why  is  language  important?   •  Content  is  produced  and  consumed  in  the  na>ve   language   •  Document  collec>ons  oBen  contain  more  than  one   language   •  Each  language  is  unique,  and  presents  different   challenges  to  the  search  engine   Basis Technology – Open Source Search Conference 2012 4
  • 5. Language  is  Complex   •  Tokeniza>on   •  Some  languages  do  not  use  spaces   •  Compound  words  combine  two  or  more  words   •  Conjunc>ons     •  Inflec>on   •  In  grammar,  inflec>on  is  the  modifica>on  of  a  word  to   express  different  gramma>cal  categories  such  as   tense,  gramma>cal  mood,  gramma>cal  voice,  aspect,   person,  number,  gender  and  case.   Basis Technology – Open Source Search Conference 2012 5
  • 6. Language  is  Complex   Basis Technology – Open Source Search Conference 2012 6 hOp://en.wikipedia.org/wiki/File:Flexi%C3%B3nGato.png  
  • 7. Language  is  Complex!   •  The  Spanish  word  “pasaportar”  has  more  than  50   inflected  forms:   pasaportando   pasaportareis   pasaportarán   pasaportes   pasaportaron   pasaporte   pasaportada   pasaportase   pasaportan   pasaportaba   pasaportemos   pasaporta   pasaportarían   pasaportaría   pasaportaste   pasaportarais   pasaportara   pasaportad   pasaportasen   pasaportasteis   pasaportéis   pasaportaren   pasaportáramos   pasaportadas   pasaportado   pasaportaban   pasaporté   pasaportaremos   pasaportásemos   pasaportados   pasaportábamos   pasaportamos   pasaportaré   pasaportases   pasaporten   pasaportare   pasaportaríais   pasaportaréis   pasaportará   pasaportaran   pasaportabas   pasaportó   pasaportarías   pasaportaríamos   pasaportabais   pasaportaras   pasaportáremos   pasaportaseis   pasaportarás   pasaporto   …   Basis Technology – Open Source Search Conference 2012 7 http://education.yahoo.com/reference/dict_en_es/spanish/pasaportar
  • 8. Language  Examples   •  English:   spoke  (Noun  –  wheel  part)   →  spoke   spoke  (Verb,  past  tense)   →  speak   •  French:   été  (summer)   →    été  (summer)   été  (was)         →  être  (to  be)   •  German:     Robbe  (seal)   →  Robbe  (seal)   robbe  (I  crawl)   →  robben  (to  crawl)   Samstagmorgen  (Saturday  Morning)   →  Samstag,  Morgen  (compound)   •  Japanese:   •  首脳会談後、オバマ大統領は記者団の質問に答える予定   –  Where  are  the  words??   Basis Technology – Open Source Search Conference 2012 8
  • 9. Language-­‐Aware  Search  Technology   •  RoseOe  Linguis>c  Plaiorm     •  Language  Iden>fica>on   •  Tokeniza>on   »  Morphological   •  Token  processing   »  Lemma>za>on   •  Higher  level  analy>cs   »  En>ty  Extrac>on   »  Rela>onship  Extrac>on   •  En>ty  Transla>on  and  En>ty  Search   Basis Technology – Open Source Search Conference 2012 9
  • 10. Language  Iden>fica>on   •  Find  a  single  dominant  language  in  a  document   •  Find  mul>ple  languages  in  a  single  document   Basis Technology – Open Source Search Conference 2012 10
  • 11. Tokeniza>on   •  Morphological  Analysis  vs.  N-­‐gram   •  Search  Term:    東京 ルパン上映時間 •  N-­‐gram:   •  Morphological  Analysis:       Basis Technology – Open Source Search Conference 2012 11
  • 12. Token  Processing   •  Stemming  vs.  Lemma>za>on   •  English:  “I  have  spoken  at  several  conferences”   •  Stemming:   •  Lemma>za>on:   Basis Technology – Open Source Search Conference 2012 12
  • 13. Stemming  vs.  Lemma>za>on   •  Two  words  with  the  same  spelling,  but  different   meanings  create  the  same  stem.   Stemming   LemmaCzaCon   prensa     →  prens   Prensa   →  prensa   (media)    (media)   (media)   prensa     →  prens   prensa     →  prensar    (he/she  presses)          (he/she  presses)          (to  press)       INCORRECT     CORRECT   Basis Technology – Open Source Search Conference 2012 13
  • 14. Stemming  vs.  Lemma>za>on   •  Two  different  words  create  the  same  stem.   Stemming   LemmaCzaCon   publicaciones     →  public   publicaciones   →  publicación     (publicaCons)   (publicaCons)     publico     →  public   publico     →  public     (public)   (public)   (public)       INCORRECT     CORRECT   Basis Technology – Open Source Search Conference 2012 14
  • 15. Token  Processing   German:  “Am  Samstagmorgen  fliege  ich  zurueck  nach   Boston.”   •  Stemming:   •  Lemma>za>on  (and  decompounding!):   Basis Technology – Open Source Search Conference 2012 15
  • 16. How  to  Configure  Solr   •  Challenges   •  Mul>ple  languages  in  the  data  set   •  Goals:   1.  Language  Iden>fica>on   2.  Language-­‐aware  Search:   •  Tokeniza>on   •  Token  Processing   Basis Technology – Open Source Search Conference 2012 16
  • 17. How  to  Configure  Solr   •  What  tools  does  Solr  have  to  work  with?   •  UpdateRequestProcessor   •  Analyzer/CharFilter/Tokenizer/TokenFilter   •  Solr  Cores   •  Pre-­‐process  data  before  Solr?   Basis Technology – Open Source Search Conference 2012 17
  • 18. Solr  UpdateRequestProcessor   •  Runs  Before  Analyzers   •  Full  Access  to  Document   •  Two  op>ons:     •  Run  the  analysis  directly  in  Solr   •  Good  for  Lightweight  Analysis   •  Call  out  to  external  analysis  services   •  Web  Services/UIMA.  Increases  Complexity   •  Limita>ons:     •  Think  through  your  indexing  strategy     Basis Technology – Open Source Search Conference 2012 18
  • 19. Solr  Analyzer/Tokenizer   •  Good  for:   •  Segmenta>on  of  Asian  Language   •  Linguis>cs  -­‐  Lemma>za>on   •  Limita>ons:   •  No  access  to  document  object     •  Schema.xml   •  FieldType   •  Analyzer   –  CharFilter   –  Tokenize   –  TokenFilter   Basis Technology – Open Source Search Conference 2012 19
  • 20. Goal  1:  Language  ID     •  UpdateRequestProcessor   •  Runs  before  field-­‐level  analysis  takes  place   •  Basic  Language  Iden>fier  URP  to  be  included  in  Solr   •  Outside  Solr     What  do  you  do  with  the  language  informa>on??   Basis Technology – Open Source Search Conference 2012 20
  • 21. Goal  2:  Mul>-­‐Lingual  Support  in  Solr   •  Three  main  approaches:   1.  One  Solr  field  for  each  language   2.  One  Solr  Core  per  language   3.  All  Languages  in  a  Single  Field   Informed  by  Trey  Grainger    @  Careerbuilder:  hOp://www.lucidimagina>on.com/sites/default/files/Grainger%20Trey%20-­‐%20Extending%20Solr, %20Building%20a%20Cloud-­‐Like%20Knowledge%20Discovery%20Plaiorm%20-­‐%20rev.pdf   Basis Technology – Open Source Search Conference 2012 21
  • 22. Mul>ple  Languages:  Method  1   •  One  field  for  each  language   •  Pro:   •  Simple  approach  and  implementa>on   •  Guarantees  that  queries  are  processed  the  same  way  as   index   •  Con:   •  Increased  query-­‐>me  complexity  (mi>gate  with  Dismax)   •  Decreased  query  speed  as  addi>onal  fields  are  queried   •  May  require  storing  mul>ple  copies  of  text   Basis Technology – Open Source Search Conference 2012 22
  • 23. Mul>ple  Languages:  Method  2   •  One  Solr  core  per  language    Each  Core  has  the  same  field,  with  a  language-­‐specific      Analyzer/Tokenizer   •  Pros:   •  No  query-­‐>me  performance  overhead   •  Guarantees  that  queries  are  processed  the  same  way  as   index   •  Cons:   •  Significant  complexity  in  managing  mul>ple  cores   •  Must  implement  custom  sharding   •  Does  not  support  mul>lingual  documents   Basis Technology – Open Source Search Conference 2012 23
  • 24. Mul>ple  Languages:  Method  3   •  All  Languages  in  one  field   •  Pros:   •  Single  field  makes  queries  and  indexing  easy   •  Same  schema/core  as  more  languages  added   •  Cons:   •  Requires  complex  custom  Tokenizer/Analyzer   •  Must  pass  in  language  informa>on  for  queries  and  indexing   •  Does  not  guarantee  queries  are  processed  the  same  as  the   index   •  Poten>al  TF/IDF  confusion       Basis Technology – Open Source Search Conference 2012 24
  • 25. Language  is  Important   •  Use  language  informa>on  at  index  and  query  >me   •  Increase  recall,  maintain  precision   •  BeOer  search  results  for  your  users   Basis Technology – Open Source Search Conference 2012 25
  • 26. My  Contact  Info   •  Steve  Kearns   •  skearns@basistech.com   •  hOp://www.basistech.com   Basis Technology – Open Source Search Conference 2012 26