Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Dawid Weiss- Finite state automata in lucene

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Chargement dans…3
×

Consultez-les par la suite

1 sur 47 Publicité

Dawid Weiss- Finite state automata in lucene

Télécharger pour lire hors ligne

Finite state automata and transducers made it into Lucene fairly recently, but already show a very promising impact on search performance. This data structure is rarely exploited because it is commonly (and unfairly) associated with high complexity. During the talk, I will try to show that automata and transducers are in fact very simple, their construction can be very efficient (memory and time-wise) and their field of applications very broad.

Finite state automata and transducers made it into Lucene fairly recently, but already show a very promising impact on search performance. This data structure is rarely exploited because it is commonly (and unfairly) associated with high complexity. During the talk, I will try to show that automata and transducers are in fact very simple, their construction can be very efficient (memory and time-wise) and their field of applications very broad.

Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Les utilisateurs ont également aimé (20)

Publicité

Plus par Lucidworks (Archived) (20)

Dawid Weiss- Finite state automata in lucene

  1. 1. Finite State Automata in Dawid WEISS
  2. 2. . Dawid Weiss . 20+ years of coding 10 years assembly only . Academia & Research PhD in Information Retrieval, PUT Open source Carrot2 , HPPC, Lucene, … Industry & Business Carrot Search s.c. . .
  3. 3. Talk outline State machines (automata) FSAs, DFAs, FSTs and other XXXs. Use cases in Lucene and Solr Suggester. FuzzySearch. Index. No API details Still @experimental.
  4. 4. (Non)? Deterministic Finite State (Automata|Machines)
  5. 5. HashSet hash → slot → value 0x29384d34 → lucene 0xde3e3354 → lucid 0x00000666 → lucifer
  6. 6. HashSet hash → slot → value 0x29384d34 → lucene 0xde3e3354 → lucid 0x00000666 → lucifer FSA (deterministic) l u c e n e i d r f e
  7. 7. HashSet hash → slot → value 0x29384d34 → lucene 0xde3e3354 → lucid 0x00000666 → lucifer FSA (deterministic) l u c e n e i d exists(sequence) r oor(pre x) f ceil(pre x) e
  8. 8. k i l l b l deterministic, non-minimal i l
  9. 9. k i l l b l deterministic, non-minimal i l k i l l deterministic, minimal b
  10. 10. k i l l b l deterministic, non-minimal i l k i l l deterministic, minimal b k i l l non-deterministic, i l non-minimal b
  11. 11. (Sorted)Map lucene → 1 lucid → 2 lucifer → 666
  12. 12. (Sorted)Map lucene → 1 lucid → 2 lucifer → 666 FST (transducer) l u c e n e|1 i d|2 r|666 f e
  13. 13. (Sorted)Map lucene → 1 lucid → 2 lucifer → 666 FST (transducer) l|1 u c e n e i|1 d r f|664 e
  14. 14. NFSAs and Regular expressions a a e1e2 e1 e1 Determinization e+ e states explosion, not always possible Backtracking recursion explosion e* e e? e
  15. 15. a?nan
  16. 16. a?nan n=3 → a?a?a?aaa
  17. 17. a?nan n=3 → a?a?a?aaa Source: Russ Cox, Regular Expression Matching Can Be Simple And Fast (re2).
  18. 18. 35000 30000 25000 Time [ms] 20000 15000 10000 5000 0 0 5 10 15 20 25 30 Time of matching an for pattern a?n an , depending on n. Java 1.6, modern hardware.
  19. 19. Linear-time, minimal, deterministic FSA construction Linear algorithm from sorted input by Daciuk, Mihov, et al. Active path states that still can change States dictionary nodes that will never change
  20. 20. 1) common AP pre x 2) freeze the rest of AP 3) add suffix → new AP lucene
  21. 21. 1) common AP pre x 2) freeze the rest of AP 3) add suffix → new AP l u c e n e lucid
  22. 22. 1) common AP pre x 2) freeze the rest of AP 3) add suffix → new AP l u c e n e i d
  23. 23. 1) common AP pre x 2) freeze the rest of AP 3) add suffix → new AP l u c e n e i d lucifer
  24. 24. 1) common AP pre x 2) freeze the rest of AP 3) add suffix → new AP l u c e n e i d f e r
  25. 25. 1) common AP pre x 2) freeze the rest of AP 3) add suffix → new AP l u c e n e i d r f e
  26. 26. FS(A|T)s in (Lucene|Solr)
  27. 27. Automata in Lucene|Solr org.apache.lucene.util.automaton.* partial port of brics, FuzzyQuery, AutomatonTermsEnum org.apache.lucene.util.automaton.fst.FST FSA and FSTs from sorted data, suggester, indexes
  28. 28. org.apache.lucene.util.automaton.fst.* FSA representation Arc-based, not state-based Moore vs. Mealy. Compact vs. intuitive Input: abc, bd, bde. a b c a b c b d b e d e d
  29. 29. org.apache.lucene.util.automaton.fst.* FSA representation a b c Arc-based, not state-based s3 s2 s1 Moore vs. Mealy. Compact vs. intuitive b d e s4 s5 Next-state chaining requires unusual tricks during construction s1 s2 s5 s4 s3 cFL bL eFL dL a bL s1 s2 s5 s4 s3 cFL bL eFL dL bLN a
  30. 30. org.apache.lucene.util.automaton.fst.* FSA representation a b c Arc-based, not state-based s3 s2 s1 Moore vs. Mealy. Compact vs. intuitive b d e s4 s5 Next-state chaining requires unusual tricks during construction s1 s2 s5 s4 s3 Everything in a byte[] cFL bL eFL dL a bL traversals-ready, memory-efficient s1 s2 s5 s4 s3 cFL bL eFL dL bLN a
  31. 31. org.apache.lucene.util.automaton.fst.* FSA representation a b c Arc-based, not state-based s3 s2 s1 Moore vs. Mealy. Compact vs. intuitive b d e s4 s5 Next-state chaining requires unusual tricks during construction s1 s2 s5 s4 s3 Everything in a byte[] cFL bL eFL dL a bL traversals-ready, memory-efficient Dual transition storage format lookup: bsearch or linear scan s1 s2 s5 s4 s3 cFL bL eFL dL bLN a
  32. 32. Input size Compressed size (MB) Input MB Terms Lucene morf. gzip Wikipedia t.index 481 38 092 045 258 164 149 Polish in . 162 3 672 200 3.1 1.7 15.4 .
  33. 33. Use Cases: Solr's Autocomplete
  34. 34. Solr's Suggesters Design choices sort order (alpha, score), pre x vs. spelling, boost exact matches? Weights term→weight, lookup(term, onlyMorePopular) org.apache.solr.spelling.suggest.Lookup JaspellLookup, TSTLookup, FSTLookup
  35. 35. flour|3 four|4 fourier|3 furious|2 . . Take 1 . .
  36. 36. flour|3 four|4 →fou* fourier|3 furious|2 o u l i e r | f o u r | 3 u 4 r 2 i o u s | Find pre x. Depth-in traversal for completions. PQ on score|alpha . . Take 1 . .
  37. 37. 2furious 3flour 3fourier 4four . . Take 2 . .
  38. 38. 2furious 3flour →fou* 3fourier 4four u f r i o u 2 s l 3 f u r i e r o 4 u f o From score roots, until N collected. Find pre x. Depth-in traversal for completions, stop if N collected. Find/boost exact match. . . Take 2 . .
  39. 39. 2furious 5urious|furious 5rious|furious 5ious|furious 5ous|furious 5us|furious 5s|furious 3flour … . Take 3 (in xes) . . .
  40. 40. 2 i o o u i r s r s | f 5 u s u 4 o u 6 r u r r | . i o u s f 7 u r o l f o i e 3 l u r e f o | f r | r i l o u e i o | | r i u r u .
  41. 41. Constant time lookups! Regardless of the terms dictionary size. Regardless of pre x length.
  42. 42. Constant time lookups! Regardless of the terms dictionary size. Regardless of pre x length. Exact matches only. Static snapshot (not incremental). Discretized weights.
  43. 43. Top50KWiki.utf8, 676 KB, 50 000 terms Jaspell TST FST . RAM [B] . 7 869 415 . 7 914 524 300 .175 queries per second,. . . tpq . PREFIX [100-200] . 458 . 966 . 742 . PREFIX [6-9] . 330 . 228 . 659 . PREFIX [2-4] . 126 . 29 . . 501
  44. 44. Summary
  45. 45. Summary and Conclusions Automata compact, powerful, efficient data structure Lucene/Solr bene ts behind the scenes, but spreading: index, queries, suggesters API in Lucene …is shaped right now, still @experimental
  46. 46. Acknowledgement Michael McCandless Robert Muir committer: .+
  47. 47. dawid.weiss@carrotsearch.com

×