SlideShare une entreprise Scribd logo
1  sur  23
Ac#ons	
  Speak	
  Louder	
  than	
  Words:	
  	
  
      Analyzing	
  large-­‐scale	
  query	
  logs	
  to	
  improve	
  the	
  research	
  
      experience	
  


                                                           Ted.Diamond
                                                           Susan.Price
                                                           Raman.Chandrasekar
Code4Lib	
  2013	
  
Overview	
  

•  Summon 	
             	
  ®



•  The	
  Relevance	
  Metrics	
  Framework	
  (RMF)	
  
         •    helping	
  us	
  go	
  from	
  user	
  acCons	
  to	
  metrics	
  to	
  a	
  beEer	
  user	
  experience	
  

    •    Goals	
  
    •    Data	
  flow	
  in	
  RMF	
  
    •    Query	
  Sessions	
  
    •    From	
  Logs	
  to	
  StaCsCcs	
  
    •    Metrics	
  Computed	
  
    •    Challenges	
  in	
  RMF	
  
•  Insights	
  from	
  RMF	
  
•  Summary	
  
                                                                                                                             2	
  
The	
  Summon®	
  Discovery	
  Service	
  
      	
  

•  Hosted/SoQware	
  as	
  a	
  
   Service	
  
•  Match	
  &	
  Merge	
  
   combines	
  rich	
  metadata	
  
   and	
  full	
  text	
  from	
  
   mulCple	
  sources	
  
•  Single-­‐unified	
  index	
  

•  1	
  billion+	
  items	
  
•  >	
  500	
  customers	
  
•  Relevancy	
  in	
  >	
  17	
  
   languages	
  
                                                   3	
  
The	
  Summon	
  	
  
Relevance	
  Metrics	
  Framework	
  (RMF)	
  



                                                 4	
  
RMF	
  Goals	
  

•  Observe	
  and	
  log	
  user	
  ac(ons	
  
    •  Queries,	
  Types	
  of	
  queries	
  
    •  Features	
  used	
  (e.g.	
  filters,	
  advanced	
  search)	
  
    •  Click	
  paEerns	
  
       	
  
•  Compute	
  quality	
  of	
  search	
  results	
  
    •  Metrics	
  from	
  user	
  behavior,	
  such	
  as	
  clicks	
  
       	
  
•  Analyze	
  data	
  to	
  improve	
  search	
  results	
  
   and	
  enhance	
  the	
  research	
  experience	
  

                                                                          5	
  
Data	
  Acquired	
  in	
  RMF	
  
•  Queries	
                                      Sample	
  queries	
  
                                                  	
  
               •  Query	
  terms	
                •     anniversary	
  9/11	
  
               •  Filters	
                       •     kidney	
  stone	
  
                                                  •     moore	
  vs	
  mack	
  
               •  Advanced	
  syntax	
            •     moore	
  vs	
  mack	
  trucks	
  

•  Clicks	
                                       • 
                                                  • 
                                                        Armee	
  Deutschland	
  Rolle	
  in	
  Einheit	
  
                                                        body	
  art	
  in	
  the	
  workplace	
  
         	
                                       •     the	
  moplah	
  rebellion	
  of	
  1921	
  john	
  j.	
  banning	
  
                                                  •     平凡的世界	
  
We	
  collect	
  all	
  queries	
  and	
          •     9780140027631	
  
                                                  •     “boundary	
  dispute”	
  Title:(india)	
  
clicks,	
  not	
  just	
  a	
  sample	
  è	
     •     SubjectTerms:"孙少平”	
  	
  

large	
  logs	
                                   •     TitleCombined:	
  (AnalyCcal	
  Biochemistry)	
  	
  	
  
                                                       	
  	
  	
  	
  	
  	
  	
  	
  s.fvf(ContentType,	
  Book	
  Review,	
  t)	
  
	
  	
                                            •     Hawthorne,	
  Mark.	
  "The	
  Tale	
  of	
  TaEoos:	
  	
  	
  	
  	
  	
  	
  	
  	
  
                                                       	
  	
  	
  	
  	
  	
  The	
  history	
  and	
  culture	
  of	
  body	
  	
  
Sampling	
  will	
  not	
  cover	
  	
                 	
  	
  	
  	
  	
  	
  	
  art	
  in	
  India	
  and	
  abroad.	
  ”	
  
the	
  long	
  tail	
  of	
  queries	
  
                                                  	
  
        	
  
                                                                                                                                                   6	
  
        	
  
Key	
  RMF	
  concept:	
  Session	
  
•    “Finding”	
  oQen	
  spans	
  mulCple	
  queries.	
  	
  
•    Users	
  add/remove	
  filters;	
  change	
  search	
  terms	
  
     	
  
•    We	
  define	
  Search	
  sessions	
  
•    	
  	
  	
  	
  	
  Sequence	
  of	
  events	
  with	
  same	
  session	
  Id,	
  with:	
  
                        •    No	
  breaks	
  >	
  90	
  minutes	
  
                        •    Total	
  elapsed	
  Cme	
  <=	
  8	
  hours	
  
                        •    Possibly	
  spanning	
  day	
  boundary	
  
•    Data	
  grouped	
  by	
  session,	
  sorted	
  by	
  Cme	
  
     	
  	
  	
  	
  

Different	
  level	
  of	
  abstracCon,	
  more	
  robust.	
  

                                                                                                   7	
  
RMF	
  Data	
  Flow	
  

                          Search	
         Search	
  
      Search	
           	
  Server	
     	
  Server	
      Search	
  
     	
  Server	
                                          	
  Server	
  


         Fetch	
  	
  
         Logs	
  




                                                                            8	
  
From	
  Logs	
  to	
  Metrics	
  &	
  Sta#s#cs	
  

                                 •  Remove	
  ‘noise’	
  	
  
                                    (e.g.	
  test	
  queries)	
  
                                 •  IdenCfy	
  session	
  boundaries	
  
                                 •  Associate	
  clicks	
  with	
  queries	
  
                                    	
  
                                 •  Compute	
  search	
  goodness	
  
                                    metrics	
  for	
  queries	
  and	
  
                                    sessions	
  
                                 •  Compute	
  staCsCcs	
  on	
  
                                    aggregated	
  data:	
  
                                    Abandonment,	
  MRR,	
  DCG	
  


                                                                                 9	
  
Data	
  Flow:	
  Metrics,	
  Sta#s#cs	
  Genera#on	
  

            Session	
                Query	
  
             Data	
                  Data	
  




           Session	
               Query	
  
           Metrics	
               Metrics	
  
          Calculator	
            Calculator	
  



            Session	
               Query	
  
            Metrics	
               Metrics	
  


           Session	
               Query	
  
          StaCsCcs	
              StaCsCcs	
  
          Calculator	
            Calculator	
  


            Session	
                Query	
  
             Stats	
                 Stats	
  




                                                         10	
  
Metrics	
  Computed:	
  Abandonment	
  

	
  	
  Search	
  Abandonment	
  
l  Intui(on:	
  Good	
  results	
  lead	
  to	
  clicks	
  

         	
  
         So	
  compute:	
  
              l  %	
  queries	
  with	
  no	
  clicks	
  on	
  results	
  


              l  %	
  sessions	
  with	
  no	
  clicks	
  on	
  results	
  


         	
  
                             Usually	
  lower	
  abandonment	
  is	
  beEer.	
  
     	
  	
  	
  	
  	
  	
  




                                                                                   11	
  
Metrics	
  Computed:	
  MRR	
  

Mean	
  Reciprocal	
  Rank	
  (MRR)	
            •  Click	
  on	
  result	
  #3	
  è	
  
Intui(on:	
  Relevant	
  results	
                  MRR	
  =	
  0.33	
  (=	
  1/3)	
  
should	
  rank	
  high	
                            	
  
	
                                               •  MRR	
  =	
  0.15	
  è	
  	
  
Compute:	
                                          First	
  good	
  result	
  
  1/(Rank	
  of	
  top-­‐ranked	
  clicked	
        around	
  rank	
  6	
  	
  
  result)	
                                         (~	
  1/(0.15))	
  	
  	
  L	
  
    	
  
    Higher	
  MRR	
  is	
  beEer!	
  

                                                 •  Best	
  MRR	
  =	
  1.0	
  !	
  

                                                                                            12	
  
Metrics	
  Computed:	
  DCG	
  

Discounted	
  CumulaCve	
  Gain	
  (DCG)	
  
   Intui(on:	
  Best	
  to	
  have	
  relevant	
  results	
  in	
  the	
  ‘right’	
  order	
  
   So:	
  
          l    More	
  points	
  for	
  top-­‐ranking	
  results	
  clicked.	
  
                Discounted	
  as	
  you	
  go	
  down	
  the	
  result	
  set.	
  
          l    Cumulated	
  across	
  all	
  clicks	
  for	
  a	
  query	
  
   l    Typical	
  formula	
  for	
  DCG	
  at	
  rank	
  p,	
  
         if	
  ri	
  is	
  the	
  relevance	
  of	
  result	
  at	
  rank	
  i:	
  
         	
  
         	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  DCGp	
  =	
  r1	
  +	
  Σj=2..p	
  (rj/log2(j))	
  
         	
  
   l    We	
  assume	
  clicks	
  imply	
  relevance	
  

                                                                                                           13	
  
Challenges	
  with	
  Log	
  Data	
  

•  Dealing	
  with	
  query	
  or	
  click	
  spam/noise	
  
      •  Remove	
  expired	
  sessions	
  
      •  Mark	
  spam	
  as	
  “suspect”,	
  exclude	
  it	
  
      •  Note:	
  If	
  relaCvely	
  liEle	
  spam,	
  minimal	
  effect	
  on	
  metrics	
  
•  Assigning	
  queries	
  to	
  sessions	
  
                        •  Ideally,	
  a	
  session	
  =	
  one	
  user	
  +	
  one	
  informaCon	
  need	
  
•  Measuring	
  relevance	
  
                        •  Clicks:	
  imperfect	
  proxy	
  for	
  relevance/user	
  saCsfacCon	
  
•  DisCnguishing	
  real	
  changes	
  in	
  relevance	
  
                       from	
  other	
  causes	
  (e.g.	
  academic	
  calendar)	
  
                       	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Be	
  Pragma(c!	
  
                                                                                                                14	
  
How	
  is	
  RMF	
  helpful?	
  	
  
Some	
  examples…	
  


                                       15	
  
Impact:	
  Valuable	
  Data	
  Source	
  

Aggregated,	
  cleaned	
  data	
  useful	
  for	
  Autocomplete	
  and	
  
Query	
  suggesCons	
  
	
  
	
  




                                                                             16	
  
Impact:	
  Great	
  for	
  Analyses	
  	
  

How	
  many	
  results	
  per	
  page	
  (RPP)	
  is	
  opCmal?	
  
More	
  not	
  always	
  beEer:	
  
       •     Too	
  many	
  è	
  user	
  has	
  to	
  wait	
  
       •     Too	
  few	
  	
  è	
  	
  	
  user	
  has	
  to	
  keep	
  going	
  to	
  the	
  	
  Next	
  	
  link	
  
       •  RPP	
  was	
  25,	
  is	
  10	
  OK?	
  	
  

       •  Used	
  RMF	
  to	
  model	
  click	
  rate	
  changes;	
  verified	
  in	
  producCon	
  
       	
  
       Yes,	
  users	
  can	
  sCll	
  change	
  RPP	
  J	
  

	
  


                                                                                                                           17	
  
Session	
  Abandonment	
  by	
  #Terms	
  in	
  1st	
  Query	
  
	
  




                    Abandonment:	
  Smaller	
  is	
  beEer	
  




  0	
       2	
             4	
                   6	
                   8	
     10	
     12	
  

                          Number	
  of	
  search	
  terms	
  in	
  query	
  


                                                                                                  18	
  
Impact:	
  Improving	
  User	
  Experience	
  	
  

•  Abandonment	
  seen	
  as	
  very	
  different	
  on	
  
   short	
  vs.	
  long	
  queries	
  
   •  Similar	
  behavior	
  seen	
  in	
  web	
  search	
  
      	
  
   •  QuesCons:	
  	
  
       •  Why?	
  	
  
       •  How	
  can	
  we	
  improve	
  the	
  (re)search	
  experience?	
  
          	
  
   •  Web	
  search:	
  Need	
  to	
  infer	
  intent,	
  or	
  segment	
  search	
  

        Ongoing	
  work…	
  


                                                                                        19	
  
Session	
  Abandonment	
  vs.	
  #me	
  




                        Thanksgiving	
  


                                           Christmas	
  




         Data	
  Loss	
  


                                                           20	
  
Impact:	
  Data	
  Use	
  Plan	
  

•  Huge	
  variance	
  in	
  data	
  across	
  Cme	
  
    •  e.g.	
  behavior	
  during,	
  at	
  the	
  end	
  of,	
  and	
  aQer	
  semester	
  
             •  [any	
  guesses	
  why?]	
  

•  Cannot	
  use	
  small	
  segments	
  of	
  data	
  for	
  decision-­‐
   making	
  
    •  Need	
  to	
  use	
  straCfied	
  samples,	
  across	
  Cme	
  
    	
  




                                                                                               21	
  
Takeaways	
  

•  Relevance	
  Metrics	
  Framework	
  
   •  Using	
  what	
  people	
  do,	
  not	
  say	
  they	
  do:	
  	
  
      AcCons	
  vs.	
  Words	
  
   •  Sessions	
  as	
  a	
  concept	
  
   •  Going	
  from	
  logs	
  to	
  metrics	
  to	
  staCsCcs	
  
   •  Challenges	
  in	
  using	
  this	
  data	
  
•  Some	
  insights	
  we	
  gained:	
  
     •  Valuable	
  in	
  many	
  ways	
  	
  
                     	
  
     	
  	
  	
  	
  è	
  ConCnual	
  Improvements	
  in	
  Summon	
  


                                                                            22	
  
Thanks!	
  

Thanks	
  to	
  Ted,	
  and	
  to	
  the	
  Summon	
  team!	
  
	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Ques#ons??	
  
	
  
	
  
	
  
Contact	
  us:
Susan.Price@SerialsSolutions.com
Raman.Chandrasekar@SerialsSolutions.com                                                                 @synthesiser

                                                                                                                       23	
  

Contenu connexe

Similaire à Actions speak louder than words: Analyzing large-scale query logs to improve the research experience

A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataDatabricks
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsMarina Santini
 
Internet-scale Real-time Code Clone Search via Multi-level Indexing
Internet-scale Real-time Code Clone Search via Multi-level IndexingInternet-scale Real-time Code Clone Search via Multi-level Indexing
Internet-scale Real-time Code Clone Search via Multi-level Indexingimanmahsa
 
NYJavaSIG - Big Data Microservices w/ Speedment
NYJavaSIG - Big Data Microservices w/ SpeedmentNYJavaSIG - Big Data Microservices w/ Speedment
NYJavaSIG - Big Data Microservices w/ SpeedmentSpeedment, Inc.
 
Librato's Joseph Ruscio at Heroku's 2013: Instrumenting 12-Factor Apps
Librato's Joseph Ruscio at Heroku's 2013: Instrumenting 12-Factor AppsLibrato's Joseph Ruscio at Heroku's 2013: Instrumenting 12-Factor Apps
Librato's Joseph Ruscio at Heroku's 2013: Instrumenting 12-Factor AppsHeroku
 
Hadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected IntelligenceHadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected IntelligenceTed Dunning
 
"Search, APIs,Capability Management and the Sensis Journey"
"Search, APIs,Capability Management and the Sensis Journey""Search, APIs,Capability Management and the Sensis Journey"
"Search, APIs,Capability Management and the Sensis Journey"Lucidworks (Archived)
 
Crowd-Sourced Intelligence Built into Search over Hadoop
Crowd-Sourced Intelligence Built into Search over HadoopCrowd-Sourced Intelligence Built into Search over Hadoop
Crowd-Sourced Intelligence Built into Search over HadoopDataWorks Summit
 
Big Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiBig Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiVijay Susheedran C G
 
Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introductionNeeraj Tewari
 
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, SolrLarge-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, SolrDataWorks Summit
 
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)Amazon Web Services
 
2010 10-building-global-listening-platform-with-solr
2010 10-building-global-listening-platform-with-solr2010 10-building-global-listening-platform-with-solr
2010 10-building-global-listening-platform-with-solrLucidworks (Archived)
 
Machine Learning Application Development
Machine Learning Application DevelopmentMachine Learning Application Development
Machine Learning Application DevelopmentLARCA UPC
 
Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...
Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...
Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...Institute of Contemporary Sciences
 
Oracle InMemory hardcore edition
Oracle InMemory hardcore editionOracle InMemory hardcore edition
Oracle InMemory hardcore editionAlexander Tokarev
 
Labmatrix Slides 2011 05
Labmatrix Slides 2011 05Labmatrix Slides 2011 05
Labmatrix Slides 2011 05bhughes26
 

Similaire à Actions speak louder than words: Analyzing large-scale query logs to improve the research experience (20)

A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big Data
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology Applications
 
Internet-scale Real-time Code Clone Search via Multi-level Indexing
Internet-scale Real-time Code Clone Search via Multi-level IndexingInternet-scale Real-time Code Clone Search via Multi-level Indexing
Internet-scale Real-time Code Clone Search via Multi-level Indexing
 
NYJavaSIG - Big Data Microservices w/ Speedment
NYJavaSIG - Big Data Microservices w/ SpeedmentNYJavaSIG - Big Data Microservices w/ Speedment
NYJavaSIG - Big Data Microservices w/ Speedment
 
FAST Search for SharePoint
FAST Search for SharePointFAST Search for SharePoint
FAST Search for SharePoint
 
Librato's Joseph Ruscio at Heroku's 2013: Instrumenting 12-Factor Apps
Librato's Joseph Ruscio at Heroku's 2013: Instrumenting 12-Factor AppsLibrato's Joseph Ruscio at Heroku's 2013: Instrumenting 12-Factor Apps
Librato's Joseph Ruscio at Heroku's 2013: Instrumenting 12-Factor Apps
 
Hadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected IntelligenceHadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected Intelligence
 
"Search, APIs,Capability Management and the Sensis Journey"
"Search, APIs,Capability Management and the Sensis Journey""Search, APIs,Capability Management and the Sensis Journey"
"Search, APIs,Capability Management and the Sensis Journey"
 
Crowd-Sourced Intelligence Built into Search over Hadoop
Crowd-Sourced Intelligence Built into Search over HadoopCrowd-Sourced Intelligence Built into Search over Hadoop
Crowd-Sourced Intelligence Built into Search over Hadoop
 
Big Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiBig Data Real Time Training in Chennai
Big Data Real Time Training in Chennai
 
Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introduction
 
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, SolrLarge-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
 
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
 
Bigdata analytics
Bigdata analyticsBigdata analytics
Bigdata analytics
 
2010 10-building-global-listening-platform-with-solr
2010 10-building-global-listening-platform-with-solr2010 10-building-global-listening-platform-with-solr
2010 10-building-global-listening-platform-with-solr
 
Machine Learning Application Development
Machine Learning Application DevelopmentMachine Learning Application Development
Machine Learning Application Development
 
Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...
Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...
Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...
 
Data mining applications
Data mining applicationsData mining applications
Data mining applications
 
Oracle InMemory hardcore edition
Oracle InMemory hardcore editionOracle InMemory hardcore edition
Oracle InMemory hardcore edition
 
Labmatrix Slides 2011 05
Labmatrix Slides 2011 05Labmatrix Slides 2011 05
Labmatrix Slides 2011 05
 

Dernier

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 

Dernier (20)

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 

Actions speak louder than words: Analyzing large-scale query logs to improve the research experience

  • 1. Ac#ons  Speak  Louder  than  Words:     Analyzing  large-­‐scale  query  logs  to  improve  the  research   experience   Ted.Diamond Susan.Price Raman.Chandrasekar Code4Lib  2013  
  • 2. Overview   •  Summon    ® •  The  Relevance  Metrics  Framework  (RMF)   •  helping  us  go  from  user  acCons  to  metrics  to  a  beEer  user  experience   •  Goals   •  Data  flow  in  RMF   •  Query  Sessions   •  From  Logs  to  StaCsCcs   •  Metrics  Computed   •  Challenges  in  RMF   •  Insights  from  RMF   •  Summary   2  
  • 3. The  Summon®  Discovery  Service     •  Hosted/SoQware  as  a   Service   •  Match  &  Merge   combines  rich  metadata   and  full  text  from   mulCple  sources   •  Single-­‐unified  index   •  1  billion+  items   •  >  500  customers   •  Relevancy  in  >  17   languages   3  
  • 4. The  Summon     Relevance  Metrics  Framework  (RMF)   4  
  • 5. RMF  Goals   •  Observe  and  log  user  ac(ons   •  Queries,  Types  of  queries   •  Features  used  (e.g.  filters,  advanced  search)   •  Click  paEerns     •  Compute  quality  of  search  results   •  Metrics  from  user  behavior,  such  as  clicks     •  Analyze  data  to  improve  search  results   and  enhance  the  research  experience   5  
  • 6. Data  Acquired  in  RMF   •  Queries   Sample  queries     •  Query  terms   •  anniversary  9/11   •  Filters   •  kidney  stone   •  moore  vs  mack   •  Advanced  syntax   •  moore  vs  mack  trucks   •  Clicks   •  •  Armee  Deutschland  Rolle  in  Einheit   body  art  in  the  workplace     •  the  moplah  rebellion  of  1921  john  j.  banning   •  平凡的世界   We  collect  all  queries  and   •  9780140027631   •  “boundary  dispute”  Title:(india)   clicks,  not  just  a  sample  è   •  SubjectTerms:"孙少平”     large  logs   •  TitleCombined:  (AnalyCcal  Biochemistry)                      s.fvf(ContentType,  Book  Review,  t)       •  Hawthorne,  Mark.  "The  Tale  of  TaEoos:                              The  history  and  culture  of  body     Sampling  will  not  cover                  art  in  India  and  abroad.  ”   the  long  tail  of  queries       6    
  • 7. Key  RMF  concept:  Session   •  “Finding”  oQen  spans  mulCple  queries.     •  Users  add/remove  filters;  change  search  terms     •  We  define  Search  sessions   •           Sequence  of  events  with  same  session  Id,  with:   •  No  breaks  >  90  minutes   •  Total  elapsed  Cme  <=  8  hours   •  Possibly  spanning  day  boundary   •  Data  grouped  by  session,  sorted  by  Cme           Different  level  of  abstracCon,  more  robust.   7  
  • 8. RMF  Data  Flow   Search   Search   Search    Server    Server   Search    Server    Server   Fetch     Logs   8  
  • 9. From  Logs  to  Metrics  &  Sta#s#cs   •  Remove  ‘noise’     (e.g.  test  queries)   •  IdenCfy  session  boundaries   •  Associate  clicks  with  queries     •  Compute  search  goodness   metrics  for  queries  and   sessions   •  Compute  staCsCcs  on   aggregated  data:   Abandonment,  MRR,  DCG   9  
  • 10. Data  Flow:  Metrics,  Sta#s#cs  Genera#on   Session   Query   Data   Data   Session   Query   Metrics   Metrics   Calculator   Calculator   Session   Query   Metrics   Metrics   Session   Query   StaCsCcs   StaCsCcs   Calculator   Calculator   Session   Query   Stats   Stats   10  
  • 11. Metrics  Computed:  Abandonment      Search  Abandonment   l  Intui(on:  Good  results  lead  to  clicks     So  compute:   l  %  queries  with  no  clicks  on  results   l  %  sessions  with  no  clicks  on  results     Usually  lower  abandonment  is  beEer.               11  
  • 12. Metrics  Computed:  MRR   Mean  Reciprocal  Rank  (MRR)   •  Click  on  result  #3  è   Intui(on:  Relevant  results   MRR  =  0.33  (=  1/3)   should  rank  high       •  MRR  =  0.15  è     Compute:   First  good  result   1/(Rank  of  top-­‐ranked  clicked   around  rank  6     result)   (~  1/(0.15))      L     Higher  MRR  is  beEer!   •  Best  MRR  =  1.0  !   12  
  • 13. Metrics  Computed:  DCG   Discounted  CumulaCve  Gain  (DCG)   Intui(on:  Best  to  have  relevant  results  in  the  ‘right’  order   So:   l  More  points  for  top-­‐ranking  results  clicked.   Discounted  as  you  go  down  the  result  set.   l  Cumulated  across  all  clicks  for  a  query   l  Typical  formula  for  DCG  at  rank  p,   if  ri  is  the  relevance  of  result  at  rank  i:                          DCGp  =  r1  +  Σj=2..p  (rj/log2(j))     l  We  assume  clicks  imply  relevance   13  
  • 14. Challenges  with  Log  Data   •  Dealing  with  query  or  click  spam/noise   •  Remove  expired  sessions   •  Mark  spam  as  “suspect”,  exclude  it   •  Note:  If  relaCvely  liEle  spam,  minimal  effect  on  metrics   •  Assigning  queries  to  sessions   •  Ideally,  a  session  =  one  user  +  one  informaCon  need   •  Measuring  relevance   •  Clicks:  imperfect  proxy  for  relevance/user  saCsfacCon   •  DisCnguishing  real  changes  in  relevance   from  other  causes  (e.g.  academic  calendar)                                                                              Be  Pragma(c!   14  
  • 15. How  is  RMF  helpful?     Some  examples…   15  
  • 16. Impact:  Valuable  Data  Source   Aggregated,  cleaned  data  useful  for  Autocomplete  and   Query  suggesCons       16  
  • 17. Impact:  Great  for  Analyses     How  many  results  per  page  (RPP)  is  opCmal?   More  not  always  beEer:   •  Too  many  è  user  has  to  wait   •  Too  few    è      user  has  to  keep  going  to  the    Next    link   •  RPP  was  25,  is  10  OK?     •  Used  RMF  to  model  click  rate  changes;  verified  in  producCon     Yes,  users  can  sCll  change  RPP  J     17  
  • 18. Session  Abandonment  by  #Terms  in  1st  Query     Abandonment:  Smaller  is  beEer   0   2   4   6   8   10   12   Number  of  search  terms  in  query   18  
  • 19. Impact:  Improving  User  Experience     •  Abandonment  seen  as  very  different  on   short  vs.  long  queries   •  Similar  behavior  seen  in  web  search     •  QuesCons:     •  Why?     •  How  can  we  improve  the  (re)search  experience?     •  Web  search:  Need  to  infer  intent,  or  segment  search   Ongoing  work…   19  
  • 20. Session  Abandonment  vs.  #me   Thanksgiving   Christmas   Data  Loss   20  
  • 21. Impact:  Data  Use  Plan   •  Huge  variance  in  data  across  Cme   •  e.g.  behavior  during,  at  the  end  of,  and  aQer  semester   •  [any  guesses  why?]   •  Cannot  use  small  segments  of  data  for  decision-­‐ making   •  Need  to  use  straCfied  samples,  across  Cme     21  
  • 22. Takeaways   •  Relevance  Metrics  Framework   •  Using  what  people  do,  not  say  they  do:     AcCons  vs.  Words   •  Sessions  as  a  concept   •  Going  from  logs  to  metrics  to  staCsCcs   •  Challenges  in  using  this  data   •  Some  insights  we  gained:   •  Valuable  in  many  ways              è  ConCnual  Improvements  in  Summon   22  
  • 23. Thanks!   Thanks  to  Ted,  and  to  the  Summon  team!                                                                                                            Ques#ons??         Contact  us: Susan.Price@SerialsSolutions.com Raman.Chandrasekar@SerialsSolutions.com @synthesiser 23