Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Solr At AOL, Presented by Sean Timm at SolrExchage DC
1.
2. Solr and Lucene @ AOL
SEAN TIMM, CHIEF ARCHITECT, AOL ADVERTISING
3. 1999
• Believe, Cher and Livin’ la Vida Loca, Ricky Martin
• The Matrix and The Phantom Menace
• Windows 98 Second Edition
• AltaVista, Northern Light, Yahoo, ODP, Inktomi
– Google
• PPC Text search ads invented 1998
– Banner ads
4. A Brief History of Search @ AOL
• Acquired PLS in 1998
• AOL Search used ODP
• Site Search
• Local Search
• Built into AOL Server
• CPL
– VSM then BM25
– Phrase, numeric, date, text, and
proximity boosting
– Conflation classes (like synonyms)
5. Relevance
• Precision/recall
• “free alcohol” vs. “alcohol free”
• Lawyer versus Attorney
• Iron and ironic same stem (Porter)
• Beyonce vs. Beyoncé
• Eagles
–Bird, sports teams, band, AMC Eagle
• F 15, F-15, F15
• FREAK
Relevant Retrieved
6. The Dawn of Solr
• Prohibitively expensive to continue CPL development
• Complicated deployment
• 2005: Investigating migration to Lucene
• 2006: CNET open sourced Solr
7. Contributions
• Local Lucene/Solr (superseded by SpatialSearch)
• Query Timeout
• Data Import Handler (DIH)
• Numerous smaller patches
• Committers: Noble Paul, Shalin Mangar, Patrick
O’Leary
8. Contributing to Solr/Lucene
• Learn
–Join the mailing lists
•solr-user@lucene.apache.org
•dev@lucene.apache.org
–Read search and Solr related blogs
–The #solr IRC channel on freenode
9. Contributing to Solr/Lucene
• Help others
–Answer questions.
–Improve documentation in the code, the wiki, or
the website.
–Make improvements to the Solr Admin UI.
10. Contributing to Solr/Lucene
• Confirm a bug
• Submit a patch for a reported bug or feature
request
• Improve a patch
• Try out a patch and see if it works
11. Contributing to Solr/Lucene
• Submit your own tickets
– Bug
– Feature request
• Start with solr-user@lucene
• Discuss on dev@lucene
• Create Jira ticket, ideally with patches and unit tests
• Yonik’s Law of Patches:
– A half-baked patch in Jira, with no documentation, no tests, and no
backwards compatibility is better than no patch at all.
12. Applications
• MapQuest (SpatialSearch)
• Mail
• AIM
• AOL Search
• Site Search
• News Search
• RUM
• Sarah Palin e-mails (admin)
• Demand
• Wikipedia article pattern detection
19. Related Searches
• Simple query
– User
• New York Library
– Solr query
• Lower case
• Prefer exact match “new york library”
• Use phrase slop to allow terms in same order and near each
other, e.g., new york city public library
• primeQuery:“new york library” OR “new york library”~3
20. Wikipedia Traffic Correlation Schema
<field name="title" type="string" indexed="true" stored="true" required="true" />
<field name="title_norm" type="string" indexed="true" stored="true" required="true" />
<field name="total_pvs" type="long" indexed="true" stored="true" required="true" />
<!-- Dynamic field definitions. If a field name is not found, dynamicFields
will be used if the name matches any of the patterns.
RESTRICTION: the glob-like pattern in the name attribute must have
a "*" only at the start or the end.
EXAMPLE: name="*_i" will match any field ending in _i (like myid_i, z_i)
Longer patterns will be matched first. if equal size patterns
both match, the first appearing in the schema will be used. -->
<!-- trend direction. field name contains date string, e.g., "trend_20110622" -->
<dynamicField name="trend_*" type="int" indexed="true" stored="true"/>
<!-- page views. field name contains date string, e.g., "pvs_20110622" -->
<dynamicField name="pvs_*" type="long" indexed="true" stored="true"/>
22. Sarah Palin E-mail Stats
• 13,177 documents
• 4 hours from receiving data to production install
• ~150 K requests per day at launch
• Now about 6-7 K requests per day
• Running on 3 VMs in two different data centers
behind a NetScaler
25. Huffington Post Comments
• Solr 4
• Uses Solr Cloud
• Single shard
• ReplicationFactor 3
• Real-time
• 90 days of comments
• Tested up to 100 writes / second
26. More HuffPost comments
• Used by editors and moderators
–Topic investigation
–Troll detection
• Config
–Special features: search for emoticons, prefer
exact match, date boosting
• Hack-a-thon comment clustering, timeline, and
summarization
28. Relevance in Solr
• “free alcohol” vs. “alcohol free”
–Phrase queries and phrase slop
• Lawyer versus Attorney
–SynonymFilterFactory
• Iron and ironic
–Kstem, or Lemmatization via the
SynonymFilterFactory instead of
Snowball/Porter
29. Relevance in Solr
• Beyonce vs. Beyoncé
–Various Folding Filters
• Eagles
–Boost on other fields, such as
popularity, publish date
–Use related searches, facets, or clustering
• F 15, F-15, F15
–WordDelimiterFilter
30. Bringing a New Search Project Online
• Understand the domain
• Ingest (sample) data
• Clean data
• Repeat
• Relevance testing
• Scale out
• Launch/Success