DSPy a system for AI to Write Prompts and Do Fine Tuning
Faceted Search and Solr
1. “Regular” Search
Faceted Search
Interface:
! User expresses information need as short query.
Search engine returns ranked, pageable result set.
New York CTO Club
!
December 9, 2009 User happy when...
! Top-ranked result satisfies information need.
! At least some result on first page is relevant.
Daniel Tunkelang, Google User unhappy when...
Otis Gospodneti!, Sematext ! No result on first page satisfies information need.
! Results misleadingly appear relevant (bait and switch).
1 3
Agenda Relevance Is Subjective
Daniel:
!
What is faceted search? Relevance is defined as a measure of
!
Why use faceted search? information conveyed by a document relative to
!
Thoughts about design and user experience. a query.
It is shown that the relationship between the
Otis:
!
What are Lucene and Solr? document and the query, though necessary, is
!
Why use an open-source search library?
not sufficient to determine relevance.
!
Thoughts about implementation.
William Goffman, On relevance as a measure, 1964.
2 4
2. Regular Search Experience What is Faceted Search?
! Best understood through examples.
" See the following slides.
" Or shop on almost any ecommerce site.
! Facets = multiple ways to organize information.
" Often based on available structured information.
" But not always, e.g., facets obtained via text mining.
! Typical interaction:
" User starts with a full-text search.
" Facets guide query refinement process.
5 7
Assumptions Are Dangerous Faceted Search for News
!
self-awareness
tf-idf
PageRank !
self-expression
!
model knows best
!
answer is a document
!
one-shot query
6 8
3. Faceted Search for People
9
Faceted Search for Breakfast But Facets are Not a Silver Bullet...
! Screen real estate is finite.
" Choose facets wisely.
" Choose facet values wisely for monster facets.
! Multiple selection within a facet is powerful, but...
" Has to be intuitive, especially AND vs. OR.
" Even trickier for hierarchical facets.
! Search relevance still matters!
" Most faceted search applications rank results.
" Irrelevant results " irrelevant facet refinements.
10 12
4. Exploring Information Science Be Careful with Faceted Search!
Cameras have artists?!
13 15
Deliver Precision and Recall Clarify, Then Refine
Easier said than done!
Ranking of facet values is an open research topic.
14 16
5. Take-Aways What is / isn't Lucene
! Faceted search addresses the subjectivity of ! Free, ASL, Java IR library, Jar
relevance and information overload. ! Doug Cutting, ASF, 2001
! But deploying faceted search effectively
! Application agnostic: Indexing & Searching
requires that you think about user experience. ! High performance, scalable
! No dependencies
! Recommended reading:
! Heavily ported
" My thin book entitled Faceted Search
" Marti Hearst's book on Search User Interfaces
! No: crawler, rich doc parser, turn-key solution
" Peter Morville's upcoming book on Search Patterns ! No: out of the box faceted search-capability... but...
17 19
Faceted Search with Lucene & Solr
Otis Gospodneti!, Sematext
18
6. What is/isn't Solr Facet Field Requirements
!
Indexing/Search server with HTTP API built on !
Must be indexed
top of Lucene !
Often not tokenized
!
Fast & scalable (distributed search, index !
Often not altered (lowercase, punctuation)#
replication)#
!
Storing not required
!
XML, JSON, Ruby, Perl, PHP, javabin
!
Multivalued fields OK
!
No: crawler (but Nutch ==> Solr works)#
!
Yes: rich text parser
!
Yes: Faceted Search out of the box!
21 23
Solr and Faceted Search Turn It On
!
3 Types of facets: Field Values (text), Dates, !
0 facets:
Queries. !
http://host:80/solr/select?q=foo
!
“Text”: return counts for all/top terms in a field !
1 facet:
for a result set - e.g. categories a la Amazon !
http://host:80/solr/select?q=foo&facet=true&facet.field=category
!
Dates: return counts for docs in specified date !
N facets:
ranges !
http://host:80/solr/select?
q=foo&facet=true&facet.field=category&facet.field=inStock
!
Queries: return counts for docs that also match !
facet=true or facet.on
a given query - handy for number ranges (think
prices!)#
22 24
7. Text Facet Response Date Facet Response
<result numFound="4" start="0"/> <result name="response" numFound="42" start="0"/>
!
facet.mincount=1 to
<lst name="facet_counts"> <lst name="facet_counts">
<lst name="facet_fields">
avoid 0-count facet <lst name="facet_dates">
<lst name="category"> values <lst name="timestamp">
<int name="electronics">3</int> !
facet.limit=N to limit to <int name="2007-08-11T00:00:00.000Z">1</int>
<int name="copier">0</int> <int name="2007-08-12T00:00:00.000Z">5</int>
top N facet values
</lst> <int name="2007-08-13T00:00:00.000Z">3</int>
<lst name="inStock"> !
facet.missing=true to <int name="2007-08-14T00:00:00.000Z">7</int>
<int name="false">3</int> catch uncategorized <int name="2007-08-15T00:00:00.000Z">2</int>
<int name="true">1</int> <int name="2007-08-16T00:00:00.000Z">16</int>
</lst>
!
lots of other options! <str name="gap">+1DAY</str>
</lst> <date name="end">2007-08-17T00:00:00Z</date>
</lst> 25 </lst> 27
Date Facets Query Facets
!
http://.../solr/select/? !
http://.../solr/select?
q=*:*&rows=0&facet=true&facet.date=timesta q=shoes&rows=0&facet=true&facet.field=inStoc
mp&facet.date.start=NOW/DAY- k&facet.query=price:
5DAYS&facet.date.end=NOW/DAY [*+TO+500]&facet.query=price:[500+TO+*]
%2B1DAY&facet.date.gap=%2B1DAY !
Avoids the bucket-at-index-time work-around
!
(%2B1 ==> +1)# !
Keep queries disjoint
!
Solr Date Math Parser syntax: /HOUR,
+2YEARS, -1DAY, /DAY+6MONTHS+3DAYS,
+6MONTHS+3DAYS/DAY
26 28
8. Query Facet Response State of Lucene & Solr
<result numFound="3" start="0"/>
!
Super healthy community, exploding
<lst name="facet_counts">
<lst name="facet_queries">
development
<int name="price:[* TO 500]">3</int> !
Lucene 3.0 – 2009-11-25:
<int name="price:[500 TO *]">1</int>
!
Performance, faster range queries, clean API, better
</lst>
Unicode support, more non-English support
<lst name="facet_fields">
<lst name="inStock">
!
Solr 1.4 – 2009-11-10:
<int name="false">3</int> !
Performance, new replication, Db indexing, rich-doc
<int name="true">1</int> indexing, results clustering, faster response protocol,
</lst> deduplication...
</lst>
</lst> 29 31
UI Integration Lucene, Solr, Enterprise
!
Use Filter Queries via fq !
Free: Community
!
http://.../solr/select? !
Lucene ~ 600 emails/month (dev: 2000/month)#
q=shoes&facet=true&facet.field=category& !
Solr ~1300 emails/month (dev: 800/month)#
fq=price:[0 TO 300]
!
http://.../solr/select? !
Commercial: Support Subscriptions
q=shoes&facet=true&facet.field=category& !
Sematext
fq=price:[0 TO 300]&fq=inStock:true !
Lucid Imagination
!
Important: single request does it all
30 32