SlideShare une entreprise Scribd logo
1  sur  49
Télécharger pour lire hors ligne
Fast Generation of Result
                        xxxx Search
              Snippets in Web


                    Franco Sánchez Huertas
                            (UCSP)



                      EDA – June, 2010

21/06/2010                 UCSP -FASH        1
Overview

•   What are snippets?
•   Research questions
•   Rationale
•   Baseline
•   Compressed Token System
•   Document caching for snippet generation
•   Sentence reordering
•   Conclusion




21/06/2010                 UCSP -FASH         2
What snippets are?




21/06/2010       UCSP -FASH   3
What snippets are?




21/06/2010       UCSP -FASH   4
Research question




        Which fast strategies can we use to
         generate snippets for web search
                       results?




21/06/2010             UCSP -FASH             5
Rationale

•      Two main reasons:
      – Snippet extraction is an integral part of the query
         evaluation process and speeding it will reduce the
         overall time (and resources) required to process a
         query




21/06/2010                   UCSP -FASH                       6
Rationale

•      Two main reasons:
      – Snippet extraction is an integral part of the query
         evaluation process and speeding it will reduce the
         overall time (and resources) required to process a
         query

      –      No prior literature exists which discusses how to
             efficiently generate snippets




21/06/2010                       UCSP -FASH                      7
sigir        Search




21/06/2010            UCSP -FASH   8
sigir        Search          Identify relevant
                                documents




21/06/2010            UCSP -FASH                 9
sigir        Search          Identify relevant
                                documents




                                                    Strip
                                                  sentences




                                                 Bag of sentences
21/06/2010            UCSP -FASH                              10
sigir        Search          Identify relevant
                                documents




                                                         Strip
                                                       sentences




                                    Collect stats on
                                      sentences
                 Sentence
                  ranker
                                                   Bag of sentences
21/06/2010            UCSP -FASH                                   11
sigir         Search          Identify relevant
                                 documents




                                                          Strip
                                                        sentences
             Pick 2-3 sentences
             per document …
             generate result page
                                     Collect stats on
                                       sentences
                   Sentence
                    ranker
                                                    Bag of sentences
21/06/2010             UCSP -FASH                                   12
Sentence ranking
• a-priori (without queries) ai
      – sentence position (titles, leading sentences)
      – term/sentence significance

• Query time (with queries)
      – query terms count ci
      – unique query term count ui
      – query term proximity li

• Using all the above features, sentence i can be
  ranked using some function f(ci, ui, li, ai)

21/06/2010                    UCSP -FASH                13
Baseline
Indexing time




                        <html>
                        <body>                          text
                                                          text
                                                        texttext
                                                                                               0100101001
                                                                                                0100101001
                                                                                                 0100101001
                                                          texttext                             0010101101
                        text <br />                            text
                                                            text      <eos>   compress using    0010101101
                                                                                                 0010101101
                                      -strip out HTML         text

                        text                                   text   <eos>   gzip
                                      -add EOS marker
                        </body>
                        </html>




                21/06/2010                                   UCSP -FASH                                       14
Baseline
Indexing time




                        <html>
                        <body>                          text
                                                          text
                                                        texttext
                                                                                               0100101001
                                                                                                0100101001
                                                                                                 0100101001
                                                          texttext                             0010101101
                        text <br />                            text
                                                            text      <eos>   compress using    0010101101
                                                                                                 0010101101
                                      -strip out HTML         text

                        text                                   text   <eos>   gzip
                                      -add EOS marker
                        </body>
                        </html>
Query time




                                                                                                              results list




                                                                                                                query




                21/06/2010                                   UCSP -FASH                                            15
Baseline
Indexing time




                        <html>
                        <body>                          text
                                                          text
                                                        texttext
                                                                                                   0100101001
                                                                                                    0100101001
                                                                                                     0100101001
                                                          texttext                                 0010101101
                        text <br />                            text
                                                            text      <eos>   compress using        0010101101
                                                                                                     0010101101
                                      -strip out HTML         text

                        text                                   text   <eos>   gzip
                                      -add EOS marker
                        </body>
                        </html>


                                                                                   decompress
Query time




                                                                                          text
                                                                                         text    <eos>            results list
                                                                                         text
                                                                                          text   <eos>




                                                                                                                    query




                21/06/2010                                   UCSP -FASH                                                16
Baseline
Indexing time




                        <html>
                        <body>                            text
                                                            text
                                                          texttext
                                                                                                      0100101001
                                                                                                       0100101001
                                                                                                        0100101001
                                                            texttext                                  0010101101
                        text <br />                              text
                                                              text      <eos>    compress using        0010101101
                                                                                                        0010101101
                                      -strip out HTML           text

                        text                                     text   <eos>    gzip
                                      -add EOS marker
                        </body>
                        </html>


                                                                                      decompress
Query time




                                                                                             text
                                                                                            text    <eos>            results list
                                                                                            text
                                                                                             text   <eos>
                      two/three       f(ci, ui, li, ai)    string matcher
                      sentences                           (full word matching)

                                                                                                                       query




                21/06/2010                                     UCSP -FASH                                                 17
Compressed Token System (CTS)
Indexing time




                        <html>
                        <body>                               text
                                                               text
                                                             texttext
                                                               texttext
                        text <br />                                 text
                                                                 text      <eos>
                                      strip out HTML               text

                        text                                        text   <eos>

                        </body>
                        </html>
                                                  pass 1

                                                   the   1
                                                   the   1
                                                   of    2
                                                   of    2
                                                   in    3
                                                   in    3
                                                   …




                21/06/2010                                        UCSP -FASH       18
Compressed Token System (CTS)
Indexing time




                        <html>
                        <body>                               text
                                                               text
                                                             texttext
                                                               texttext
                        text <br />                                 text
                                                                 text      <eos>
                                      strip out HTML               text

                        text                                        text   <eos>

                        </body>
                        </html>
                                                  pass 1                  pass 2
                                                                                       single file
                                                   the   1
                                                   of    2
                                                   in
                                                   …
                                                         3                         +




                21/06/2010                                        UCSP -FASH                         19
Compressed Token System (CTS)
Indexing time




                        <html>
                        <body>                               text
                                                               text
                                                             texttext
                                                               texttext
                        text <br />                                 text
                                                                 text      <eos>
                                      strip out HTML               text

                        text                                        text   <eos>

                        </body>
                        </html>
                                                  pass 1                  pass 2
                                                                                       single file
                                                   the   1
                                                   of    2
                                                   in
                                                   …
                                                         3                         +
                                                                                                     offset map is
                                                                                                     required to tell
                                                                                                     where
                                                                                                     a document starts




                21/06/2010                                        UCSP -FASH                                   20
Compressed Token System (CTS)
Indexing time




                            <html>
                            <body>                               text
                                                                   text
                                                                 texttext
                                                                   texttext
                            text <br />                                 text
                                                                     text      <eos>
                                          strip out HTML               text

                            text                                        text   <eos>

                            </body>
                            </html>
                                                      pass 1                  pass 2
                                                                                           single file
                                                       the   1
                                                       of    2
                                                       in
                                                       …
                                                             3                         +
                                                                                                         offset map is
                •    Terms in lexicon are replaced with an integer                                       required to tell
                •    Those not in lexicon are spelt out as follows                                       where
                          ESC-length-word                                                                a document starts
                          ESC-7-britney
                •    Mark end of each sentence
                •    Compress using integer compression scheme (vbyte)
                    21/06/2010                                        UCSP -FASH                                   21
Compressed Token System (CTS)
Indexing time




                            <html>
                            <body>                               text
                                                                   text
                                                                 texttext
                                                                   texttext
                            text <br />                                 text
                                                                     text      <eos>
                                          strip out HTML               text

                            text                                        text   <eos>

                            </body>
                            </html>
                                                      pass 1                  pass 2
                                                                                           single file
                                                       the   1
                                                       of    2
                                                       in
                                                       …
                                                             3                         +
                                                                                                         offset map is
                •    Terms in lexicon are replaced with an integer                                       required to tell
                •    Those not in lexicon are spelt out as follows                                       where
                          ESC-length-word                                                                a document starts
                          ESC-7-britney
                •    Mark end of each sentence
                •    Compress using integer compression scheme (vbyte)
                    21/06/2010                                        UCSP -FASH                                   22
Compressed Token System (CTS)
             compressed                   vocabulary
             collection
Query time




                                          the   1
               offset                     of    2
                                          in    3
               mapping                    …




             21/06/2010      UCSP -FASH                23
Compressed Token System (CTS)
             compressed                   vocabulary
             collection
Query time




                                          the   1
               offset                     of    2
                                                       query
                                          in    3
               mapping                    …




             21/06/2010      UCSP -FASH                        24
Compressed Token System (CTS)
             compressed                        vocabulary
             collection
Query time




                                               the    1
               offset                          of     2
                                                                         query
                                               in     3
               mapping                         …




                                                     convert to integer tokens


                                          1 33 57




             21/06/2010      UCSP -FASH                                          25
Compressed Token System (CTS)
             compressed                                    vocabulary
             collection
Query time




                                                           the    1
               offset     results list                     of     2
                                                                                     query
                                                           in     3
               mapping                                     …




                                                                 convert to integer tokens


                                                      1 33 57




             21/06/2010                  UCSP -FASH                                          26
Compressed Token System (CTS)
             compressed                                                 vocabulary
             collection
Query time




                                                                        the    1
               offset     results list                                  of     2
                                                                                                  query
                                                                        in     3
               mapping                                                  …




                                                                              convert to integer tokens

                               text
                                 text
                               texttext
                                 texttext                          1 33 57
                                   text12 1 1 98 33
                                     text
                                       57 98


                             integer documents




             21/06/2010                               UCSP -FASH                                          27
Compressed Token System (CTS)
             compressed                                                 vocabulary
             collection
Query time




                                                                        the    1
               offset     results list                                  of     2
                                                                                                       query
                                                                        in     3
               mapping                                                  …




                                                                              convert to integer tokens

                               text
                                 text
                               texttext
                                 texttext                          1 33 57
                                   text12 1 1 98 33
                                     text
                                       57 98
                                                                                                          convert back
                                                                                                             to text
                             integer documents


                                                integer +
                                              ESC sequence                         f(ci, ui, li, ai)           two/three
                                                 matcher                                                       sentences




             21/06/2010                               UCSP -FASH                                                       28
Compressed Token System (CTS)
             compressed                                                              vocabulary
             collection
Query time




                                                                                     the    1
               offset                  results list                                  of     2
                                                                                                                    query
                                                                                     in     3
               mapping                                                               …




                                                                                           convert to integer tokens

                                            text
                                              text
                                            texttext
                                              texttext                          1 33 57
                                                text12 1 1 98 33
                                                  text
                                                    57 98
                                                                                                                       convert back
                                                                                                                          to text
                                          integer documents


                                                             integer +
                                                           ESC sequence                         f(ci, ui, li, ai)           two/three
                                                              matcher                                                       sentences


                  - use compressed integer matching
             21/06/2010                                            UCSP -FASH                                                       29
Data set and results
      - We used TREC WT10g and WT100g and collections.
      - WT50g is a 50 GB collection randomly sampled from WT100g


                                                    30
               Percentage of full collection size


                                                                                    Baseline
                                                    25                              CTS

                                                    20

                                                    15

                                                    10

                                                    5

                                                    0
                                                         WT10g       WT50g         WT100g
                                                            Compression effectiveness




21/06/2010                                                           UCSP -FASH                30
Efficiency comparison
- Generated snippets for 10,000 queries using CTS and baseline (top 10 docs)
- Measured snippet generation time for each document
- First few queries caching effect was noticed, so we take the average of the
last 7000 queries




              Baseline


                 CTS

                                     7                   16
                                                     Time (msec)



 21/06/2010                         UCSP -FASH                                  31
Efficiency comparison
- Generated snippets for 10,000 queries using CTS and baseline (top 10 docs)
- Measured snippet generation time for each document
- First few queries caching effect was noticed, so we take the average of the
last 7000 queries


                         Disk access    In-memory processing


              Baseline   Seek                   Processing


                 CTS

                                  4.5       7                      16
                                                               Time (msec)



 21/06/2010                                UCSP -FASH                           32
Efficiency comparison
- Generated snippets for 10,000 queries using CTS and baseline (top 10 docs)
- Measured snippet generation time for each document
- First few queries caching effect was noticed, so we take the average of the
last 7000 queries


                         Disk access    In-memory processing


              Baseline   Seek                   Processing


                 CTS

                                  4.5       7                      16
                                                               Time (msec)

So can we get away with performing no disk access?
 21/06/2010                                UCSP -FASH                           33
Document caching
  - With CTS avg doc size: 5.7 KB → 1.2 KB (compression + no HTML)
  - WT100g has around 18.6 million documents




21/06/2010                        UCSP -FASH                         34
Document caching
  - With CTS avg doc size: 5.7 KB → 1.2 KB (compression + no HTML)
  - WT100g has around 18.6 million documents

  - Snippet machine 4GB of RAM,
        - 1 GB is used by lexicon and document offset mapping
        - 2-3 GB can be dedicated for caching
  - In theory, using WT100g should be able to cache over 250k docs in
  memory
        - This is 5-7% of the collection size




21/06/2010                         UCSP -FASH                           35
Document caching
  - With CTS avg doc size: 5.7 KB → 1.2 KB (compression + no HTML)
  - WT100g has around 18.6 million documents

  - Snippet machine 4GB of RAM,
        - 1 GB is used by lexicon and document offset mapping
        - 2-3 GB can be dedicated for caching
  - In theory, using WT100g should be able to cache over 250k docs in
  memory
        - This is 5-7% of the collection size

  - But, in reality, how much disk access would we actually save?

  - We simulate this by caching the top 20 documents for > 500 k queries
      - Simulation allows us to control memory usage and exact hit and miss
      counts


21/06/2010                          UCSP -FASH                                36
Caching simulation
  - We processed > 500 k queries and cached the top 20 documents for
  each query

  - The score of documents is half Okapi BM25 score and half query
  independent score (similar effect as PageRank)

  - We used two cache eviction policies:
      -Static – once cache is populated and full no entries are evicted
      -LRU (least recently used) – once cache is full documents are
      evicted based of the recency of their access

  - What is Q ?
      - Search engines cache results of most popular queries




21/06/2010                           UCSP -FASH                           37
Caching simulation (results)
             % of doc requests that hit cache




                                                Cache size (% of collection, WT100G)


         • Cache of 1% of collection yields 80% hits and caching 3%
         accounts for more than 97% of hits
21/06/2010                                                      UCSP -FASH             38
Caching simulation

             Baseline   Seek               Process


                CTS


   CTS + caching

                            3.4        7                      16
                                                         Time (msec)


How can we further enhance doc cache performance?
      – Smaller cache entries mean more documents can fit in cache, so do we need to
        keep entire documents in cache? Perhaps not


21/06/2010                             UCSP -FASH                                  39
Sentence reordering

             Captain Feathersword is the friendliest
             Pirate on the open seas. He loves a good
             party, and making people giggle. It's
             lucky that he has a feather for a sword,
             which he can use to tickle everyone.
             Captain Feathersword's Pirate Ship is
             called "The Good Ship Feathersword",
             and he loves to cook, dance and sing
             with his crew on his ship.
             You'll hear Captain Feathersword on his
             ship or on dry land, saying a big "Ahoy
             there me hearties!"




                                                                Blue terms = significant
21/06/2010                       UCSP -FASH                                                           40
                                         Source: http://www.nickjr.co.uk/shows/wiggles/feathersword.aspx
Sentence reordering
             4
               Captain Feathersword is the friendliest
                                         5
             Pirate on the open seas. He loves a
             good party, and making people giggle.
             2
               It's lucky that he has a feather for a
             sword, which he can use to tickle
                          1
             everyone. Captain Feathersword's
             Pirate Ship is called "The Good Ship
             Feathersword", and he loves to cook,
             dance and sing with his crew on his ship.
             3
               You'll hear Captain Feathersword on his
             ship or on dry land, saying a big "Ahoy
             there me hearties!"




                                                                 Blue terms = significant
21/06/2010                        UCSP -FASH                                                           41
                                          Source: http://www.nickjr.co.uk/shows/wiggles/feathersword.aspx
Sentence reordering
             1
              Captain Feathersword's Pirate Ship is
             called "The Good Ship Feathersword",
             and he loves to cook, dance and sing
             with his crew on his ship.
             2
              It's lucky that he has a feather for a
             sword, which he can use to tickle
             everyone.
             3
               You'll hear Captain Feathersword on his
             ship or on dry land, saying a big "Ahoy
             there me hearties!"
             4
              Captain Feathersword is the friendliest
             Pirate on the open seas.
             5
              He loves a good party, and making
             people google.
                                                                  Blue terms = significant
21/06/2010                         UCSP -FASH                                                           42
                                           Source: http://www.nickjr.co.uk/shows/wiggles/feathersword.aspx
Sentence reordering
             1
              Captain Feathersword's Pirate Ship is
             called "The Good Ship Feathersword",
             and he loves to cook, dance and sing
             with his crew on his ship.
             2
              It's lucky that he has a feather for a
             sword, which he can use to tickle
             everyone.                                                    Keep in cache
             3
               You'll hear Captain Feathersword on his
             ship or on dry land, saying a big "Ahoy
             there me hearties!“

             4
              Captain Feathersword is the friendliest
             Pirate on the open seas.
             5
              He loves a good party, and making
             people google.                                       Blue terms = significant
21/06/2010                         UCSP -FASH                                                           43
                                           Source: http://www.nickjr.co.uk/shows/wiggles/feathersword.aspx
Sentence reordering methods
• Significant sentence (ST): sentences that contain significant terms
  are good
• Natural order: the original order sentences in a document




21/06/2010                      UCSP -FASH                              44
Sentence reordering methods
• Significant sentence (ST): sentences that contain significant terms
  are good
• Natural order: the original order sentences in a document
• Query log (Qlt): sentences that contain previously queried terms
• Query log (Qlu): same as Qlt, but repeating query terms in a
  sentence are only counted once.




21/06/2010                      UCSP -FASH                              45
Sentence reordering methods
• Significant sentence (ST): sentences that contain significant terms
  are good
• Natural order: the original order sentences in a document
• Query log (Qlt): sentences that contain previously queried terms
• Query log (Qlu): same as Qlt, but repeating query terms in a
  sentence are only counted once.

• But where do we draw the cut-off line?
      – A trade-off between efficiency gains (more documents in cache) and
        effectiveness loss




21/06/2010                          UCSP -FASH                               46
21/06/2010   UCSP -FASH   47
Conclusion
   •   We proposed a practical document storage for snippet extraction
       system (CTS)

   •   Compared to the baseline defined, using CTS, the in-memory
       processing time to generate a snippet is reduced by half of the
       baseline’s

   •   Using document cache, we have shown that the 80% of seeks can
       be also be averted by caching only 1% of the collection size

   •   Caching documents can be further enhanced by retaining only the
       important parts of a document through sentence re-ordering




21/06/2010                           UCSP -FASH                          48
Questions




21/06/2010   UCSP -FASH   49

Contenu connexe

Dernier

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 

Dernier (20)

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 

En vedette

Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 

En vedette (20)

Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 

Snipets by FrancoSH

  • 1. Fast Generation of Result xxxx Search Snippets in Web Franco Sánchez Huertas (UCSP) EDA – June, 2010 21/06/2010 UCSP -FASH 1
  • 2. Overview • What are snippets? • Research questions • Rationale • Baseline • Compressed Token System • Document caching for snippet generation • Sentence reordering • Conclusion 21/06/2010 UCSP -FASH 2
  • 5. Research question Which fast strategies can we use to generate snippets for web search results? 21/06/2010 UCSP -FASH 5
  • 6. Rationale • Two main reasons: – Snippet extraction is an integral part of the query evaluation process and speeding it will reduce the overall time (and resources) required to process a query 21/06/2010 UCSP -FASH 6
  • 7. Rationale • Two main reasons: – Snippet extraction is an integral part of the query evaluation process and speeding it will reduce the overall time (and resources) required to process a query – No prior literature exists which discusses how to efficiently generate snippets 21/06/2010 UCSP -FASH 7
  • 8. sigir Search 21/06/2010 UCSP -FASH 8
  • 9. sigir Search Identify relevant documents 21/06/2010 UCSP -FASH 9
  • 10. sigir Search Identify relevant documents Strip sentences Bag of sentences 21/06/2010 UCSP -FASH 10
  • 11. sigir Search Identify relevant documents Strip sentences Collect stats on sentences Sentence ranker Bag of sentences 21/06/2010 UCSP -FASH 11
  • 12. sigir Search Identify relevant documents Strip sentences Pick 2-3 sentences per document … generate result page Collect stats on sentences Sentence ranker Bag of sentences 21/06/2010 UCSP -FASH 12
  • 13. Sentence ranking • a-priori (without queries) ai – sentence position (titles, leading sentences) – term/sentence significance • Query time (with queries) – query terms count ci – unique query term count ui – query term proximity li • Using all the above features, sentence i can be ranked using some function f(ci, ui, li, ai) 21/06/2010 UCSP -FASH 13
  • 14. Baseline Indexing time <html> <body> text text texttext 0100101001 0100101001 0100101001 texttext 0010101101 text <br /> text text <eos> compress using 0010101101 0010101101 -strip out HTML text text text <eos> gzip -add EOS marker </body> </html> 21/06/2010 UCSP -FASH 14
  • 15. Baseline Indexing time <html> <body> text text texttext 0100101001 0100101001 0100101001 texttext 0010101101 text <br /> text text <eos> compress using 0010101101 0010101101 -strip out HTML text text text <eos> gzip -add EOS marker </body> </html> Query time results list query 21/06/2010 UCSP -FASH 15
  • 16. Baseline Indexing time <html> <body> text text texttext 0100101001 0100101001 0100101001 texttext 0010101101 text <br /> text text <eos> compress using 0010101101 0010101101 -strip out HTML text text text <eos> gzip -add EOS marker </body> </html> decompress Query time text text <eos> results list text text <eos> query 21/06/2010 UCSP -FASH 16
  • 17. Baseline Indexing time <html> <body> text text texttext 0100101001 0100101001 0100101001 texttext 0010101101 text <br /> text text <eos> compress using 0010101101 0010101101 -strip out HTML text text text <eos> gzip -add EOS marker </body> </html> decompress Query time text text <eos> results list text text <eos> two/three f(ci, ui, li, ai) string matcher sentences (full word matching) query 21/06/2010 UCSP -FASH 17
  • 18. Compressed Token System (CTS) Indexing time <html> <body> text text texttext texttext text <br /> text text <eos> strip out HTML text text text <eos> </body> </html> pass 1 the 1 the 1 of 2 of 2 in 3 in 3 … 21/06/2010 UCSP -FASH 18
  • 19. Compressed Token System (CTS) Indexing time <html> <body> text text texttext texttext text <br /> text text <eos> strip out HTML text text text <eos> </body> </html> pass 1 pass 2 single file the 1 of 2 in … 3 + 21/06/2010 UCSP -FASH 19
  • 20. Compressed Token System (CTS) Indexing time <html> <body> text text texttext texttext text <br /> text text <eos> strip out HTML text text text <eos> </body> </html> pass 1 pass 2 single file the 1 of 2 in … 3 + offset map is required to tell where a document starts 21/06/2010 UCSP -FASH 20
  • 21. Compressed Token System (CTS) Indexing time <html> <body> text text texttext texttext text <br /> text text <eos> strip out HTML text text text <eos> </body> </html> pass 1 pass 2 single file the 1 of 2 in … 3 + offset map is • Terms in lexicon are replaced with an integer required to tell • Those not in lexicon are spelt out as follows where ESC-length-word a document starts ESC-7-britney • Mark end of each sentence • Compress using integer compression scheme (vbyte) 21/06/2010 UCSP -FASH 21
  • 22. Compressed Token System (CTS) Indexing time <html> <body> text text texttext texttext text <br /> text text <eos> strip out HTML text text text <eos> </body> </html> pass 1 pass 2 single file the 1 of 2 in … 3 + offset map is • Terms in lexicon are replaced with an integer required to tell • Those not in lexicon are spelt out as follows where ESC-length-word a document starts ESC-7-britney • Mark end of each sentence • Compress using integer compression scheme (vbyte) 21/06/2010 UCSP -FASH 22
  • 23. Compressed Token System (CTS) compressed vocabulary collection Query time the 1 offset of 2 in 3 mapping … 21/06/2010 UCSP -FASH 23
  • 24. Compressed Token System (CTS) compressed vocabulary collection Query time the 1 offset of 2 query in 3 mapping … 21/06/2010 UCSP -FASH 24
  • 25. Compressed Token System (CTS) compressed vocabulary collection Query time the 1 offset of 2 query in 3 mapping … convert to integer tokens 1 33 57 21/06/2010 UCSP -FASH 25
  • 26. Compressed Token System (CTS) compressed vocabulary collection Query time the 1 offset results list of 2 query in 3 mapping … convert to integer tokens 1 33 57 21/06/2010 UCSP -FASH 26
  • 27. Compressed Token System (CTS) compressed vocabulary collection Query time the 1 offset results list of 2 query in 3 mapping … convert to integer tokens text text texttext texttext 1 33 57 text12 1 1 98 33 text 57 98 integer documents 21/06/2010 UCSP -FASH 27
  • 28. Compressed Token System (CTS) compressed vocabulary collection Query time the 1 offset results list of 2 query in 3 mapping … convert to integer tokens text text texttext texttext 1 33 57 text12 1 1 98 33 text 57 98 convert back to text integer documents integer + ESC sequence f(ci, ui, li, ai) two/three matcher sentences 21/06/2010 UCSP -FASH 28
  • 29. Compressed Token System (CTS) compressed vocabulary collection Query time the 1 offset results list of 2 query in 3 mapping … convert to integer tokens text text texttext texttext 1 33 57 text12 1 1 98 33 text 57 98 convert back to text integer documents integer + ESC sequence f(ci, ui, li, ai) two/three matcher sentences - use compressed integer matching 21/06/2010 UCSP -FASH 29
  • 30. Data set and results - We used TREC WT10g and WT100g and collections. - WT50g is a 50 GB collection randomly sampled from WT100g 30 Percentage of full collection size Baseline 25 CTS 20 15 10 5 0 WT10g WT50g WT100g Compression effectiveness 21/06/2010 UCSP -FASH 30
  • 31. Efficiency comparison - Generated snippets for 10,000 queries using CTS and baseline (top 10 docs) - Measured snippet generation time for each document - First few queries caching effect was noticed, so we take the average of the last 7000 queries Baseline CTS 7 16 Time (msec) 21/06/2010 UCSP -FASH 31
  • 32. Efficiency comparison - Generated snippets for 10,000 queries using CTS and baseline (top 10 docs) - Measured snippet generation time for each document - First few queries caching effect was noticed, so we take the average of the last 7000 queries Disk access In-memory processing Baseline Seek Processing CTS 4.5 7 16 Time (msec) 21/06/2010 UCSP -FASH 32
  • 33. Efficiency comparison - Generated snippets for 10,000 queries using CTS and baseline (top 10 docs) - Measured snippet generation time for each document - First few queries caching effect was noticed, so we take the average of the last 7000 queries Disk access In-memory processing Baseline Seek Processing CTS 4.5 7 16 Time (msec) So can we get away with performing no disk access? 21/06/2010 UCSP -FASH 33
  • 34. Document caching - With CTS avg doc size: 5.7 KB → 1.2 KB (compression + no HTML) - WT100g has around 18.6 million documents 21/06/2010 UCSP -FASH 34
  • 35. Document caching - With CTS avg doc size: 5.7 KB → 1.2 KB (compression + no HTML) - WT100g has around 18.6 million documents - Snippet machine 4GB of RAM, - 1 GB is used by lexicon and document offset mapping - 2-3 GB can be dedicated for caching - In theory, using WT100g should be able to cache over 250k docs in memory - This is 5-7% of the collection size 21/06/2010 UCSP -FASH 35
  • 36. Document caching - With CTS avg doc size: 5.7 KB → 1.2 KB (compression + no HTML) - WT100g has around 18.6 million documents - Snippet machine 4GB of RAM, - 1 GB is used by lexicon and document offset mapping - 2-3 GB can be dedicated for caching - In theory, using WT100g should be able to cache over 250k docs in memory - This is 5-7% of the collection size - But, in reality, how much disk access would we actually save? - We simulate this by caching the top 20 documents for > 500 k queries - Simulation allows us to control memory usage and exact hit and miss counts 21/06/2010 UCSP -FASH 36
  • 37. Caching simulation - We processed > 500 k queries and cached the top 20 documents for each query - The score of documents is half Okapi BM25 score and half query independent score (similar effect as PageRank) - We used two cache eviction policies: -Static – once cache is populated and full no entries are evicted -LRU (least recently used) – once cache is full documents are evicted based of the recency of their access - What is Q ? - Search engines cache results of most popular queries 21/06/2010 UCSP -FASH 37
  • 38. Caching simulation (results) % of doc requests that hit cache Cache size (% of collection, WT100G) • Cache of 1% of collection yields 80% hits and caching 3% accounts for more than 97% of hits 21/06/2010 UCSP -FASH 38
  • 39. Caching simulation Baseline Seek Process CTS CTS + caching 3.4 7 16 Time (msec) How can we further enhance doc cache performance? – Smaller cache entries mean more documents can fit in cache, so do we need to keep entire documents in cache? Perhaps not 21/06/2010 UCSP -FASH 39
  • 40. Sentence reordering Captain Feathersword is the friendliest Pirate on the open seas. He loves a good party, and making people giggle. It's lucky that he has a feather for a sword, which he can use to tickle everyone. Captain Feathersword's Pirate Ship is called "The Good Ship Feathersword", and he loves to cook, dance and sing with his crew on his ship. You'll hear Captain Feathersword on his ship or on dry land, saying a big "Ahoy there me hearties!" Blue terms = significant 21/06/2010 UCSP -FASH 40 Source: http://www.nickjr.co.uk/shows/wiggles/feathersword.aspx
  • 41. Sentence reordering 4 Captain Feathersword is the friendliest 5 Pirate on the open seas. He loves a good party, and making people giggle. 2 It's lucky that he has a feather for a sword, which he can use to tickle 1 everyone. Captain Feathersword's Pirate Ship is called "The Good Ship Feathersword", and he loves to cook, dance and sing with his crew on his ship. 3 You'll hear Captain Feathersword on his ship or on dry land, saying a big "Ahoy there me hearties!" Blue terms = significant 21/06/2010 UCSP -FASH 41 Source: http://www.nickjr.co.uk/shows/wiggles/feathersword.aspx
  • 42. Sentence reordering 1 Captain Feathersword's Pirate Ship is called "The Good Ship Feathersword", and he loves to cook, dance and sing with his crew on his ship. 2 It's lucky that he has a feather for a sword, which he can use to tickle everyone. 3 You'll hear Captain Feathersword on his ship or on dry land, saying a big "Ahoy there me hearties!" 4 Captain Feathersword is the friendliest Pirate on the open seas. 5 He loves a good party, and making people google. Blue terms = significant 21/06/2010 UCSP -FASH 42 Source: http://www.nickjr.co.uk/shows/wiggles/feathersword.aspx
  • 43. Sentence reordering 1 Captain Feathersword's Pirate Ship is called "The Good Ship Feathersword", and he loves to cook, dance and sing with his crew on his ship. 2 It's lucky that he has a feather for a sword, which he can use to tickle everyone. Keep in cache 3 You'll hear Captain Feathersword on his ship or on dry land, saying a big "Ahoy there me hearties!“ 4 Captain Feathersword is the friendliest Pirate on the open seas. 5 He loves a good party, and making people google. Blue terms = significant 21/06/2010 UCSP -FASH 43 Source: http://www.nickjr.co.uk/shows/wiggles/feathersword.aspx
  • 44. Sentence reordering methods • Significant sentence (ST): sentences that contain significant terms are good • Natural order: the original order sentences in a document 21/06/2010 UCSP -FASH 44
  • 45. Sentence reordering methods • Significant sentence (ST): sentences that contain significant terms are good • Natural order: the original order sentences in a document • Query log (Qlt): sentences that contain previously queried terms • Query log (Qlu): same as Qlt, but repeating query terms in a sentence are only counted once. 21/06/2010 UCSP -FASH 45
  • 46. Sentence reordering methods • Significant sentence (ST): sentences that contain significant terms are good • Natural order: the original order sentences in a document • Query log (Qlt): sentences that contain previously queried terms • Query log (Qlu): same as Qlt, but repeating query terms in a sentence are only counted once. • But where do we draw the cut-off line? – A trade-off between efficiency gains (more documents in cache) and effectiveness loss 21/06/2010 UCSP -FASH 46
  • 47. 21/06/2010 UCSP -FASH 47
  • 48. Conclusion • We proposed a practical document storage for snippet extraction system (CTS) • Compared to the baseline defined, using CTS, the in-memory processing time to generate a snippet is reduced by half of the baseline’s • Using document cache, we have shown that the 80% of seeks can be also be averted by caching only 1% of the collection size • Caching documents can be further enhanced by retaining only the important parts of a document through sentence re-ordering 21/06/2010 UCSP -FASH 48
  • 49. Questions 21/06/2010 UCSP -FASH 49