SlideShare a Scribd company logo
1 of 30
Solr and Lucene @ AOL
SEAN TIMM, CHIEF ARCHITECT, AOL ADVERTISING
1999
• Believe, Cher and Livin’ la Vida Loca, Ricky Martin
• The Matrix and The Phantom Menace
• Windows 98 Second Edition
• AltaVista, Northern Light, Yahoo, ODP, Inktomi
– Google
• PPC Text search ads invented 1998
– Banner ads
A Brief History of Search @ AOL
• Acquired PLS in 1998
• AOL Search used ODP
• Site Search
• Local Search
• Built into AOL Server
• CPL
– VSM then BM25
– Phrase, numeric, date, text, and
proximity boosting
– Conflation classes (like synonyms)
Relevance
• Precision/recall
• “free alcohol” vs. “alcohol free”
• Lawyer versus Attorney
• Iron and ironic  same stem (Porter)
• Beyonce vs. Beyoncé
• Eagles
–Bird, sports teams, band, AMC Eagle
• F 15, F-15, F15
• FREAK
Relevant Retrieved
The Dawn of Solr
• Prohibitively expensive to continue CPL development
• Complicated deployment
• 2005: Investigating migration to Lucene
• 2006: CNET open sourced Solr
Contributions
• Local Lucene/Solr (superseded by SpatialSearch)
• Query Timeout
• Data Import Handler (DIH)
• Numerous smaller patches
• Committers: Noble Paul, Shalin Mangar, Patrick
O’Leary
Contributing to Solr/Lucene
• Learn
–Join the mailing lists
•solr-user@lucene.apache.org
•dev@lucene.apache.org
–Read search and Solr related blogs
–The #solr IRC channel on freenode
Contributing to Solr/Lucene
• Help others
–Answer questions.
–Improve documentation in the code, the wiki, or
the website.
–Make improvements to the Solr Admin UI.
Contributing to Solr/Lucene
• Confirm a bug
• Submit a patch for a reported bug or feature
request
• Improve a patch
• Try out a patch and see if it works
Contributing to Solr/Lucene
• Submit your own tickets
– Bug
– Feature request
• Start with solr-user@lucene
• Discuss on dev@lucene
• Create Jira ticket, ideally with patches and unit tests
• Yonik’s Law of Patches:
– A half-baked patch in Jira, with no documentation, no tests, and no
backwards compatibility is better than no patch at all.
Applications
• MapQuest (SpatialSearch)
• Mail
• AIM
• AOL Search
• Site Search
• News Search
• RUM
• Sarah Palin e-mails (admin)
• Demand
• Wikipedia article pattern detection
MapQuest Discover
Travel Blogs
MQ Local Search
Related Searches
Bipartite graph snippet
Related Searches Graph
Page 18
“The Eagles”
The band
NFL
Boston College
Hotel California
Tribute
Related Searches
• Simple query
– User
• New York Library
– Solr query
• Lower case
• Prefer exact match “new york library”
• Use phrase slop to allow terms in same order and near each
other, e.g., new york city public library
• primeQuery:“new york library” OR “new york library”~3
Wikipedia Traffic Correlation Schema
<field name="title" type="string" indexed="true" stored="true" required="true" />
<field name="title_norm" type="string" indexed="true" stored="true" required="true" />
<field name="total_pvs" type="long" indexed="true" stored="true" required="true" />
<!-- Dynamic field definitions. If a field name is not found, dynamicFields
will be used if the name matches any of the patterns.
RESTRICTION: the glob-like pattern in the name attribute must have
a "*" only at the start or the end.
EXAMPLE: name="*_i" will match any field ending in _i (like myid_i, z_i)
Longer patterns will be matched first. if equal size patterns
both match, the first appearing in the schema will be used. -->
<!-- trend direction. field name contains date string, e.g., "trend_20110622" -->
<dynamicField name="trend_*" type="int" indexed="true" stored="true"/>
<!-- page views. field name contains date string, e.g., "pvs_20110622" -->
<dynamicField name="pvs_*" type="long" indexed="true" stored="true"/>
Temporal Traffic Correlation of Wikipedia Page
Views
Sarah Palin E-mail Stats
• 13,177 documents
• 4 hours from receiving data to production install
• ~150 K requests per day at launch
• Now about 6-7 K requests per day
• Running on 3 VMs in two different data centers
behind a NetScaler
Faceting and Clustering
Huffington Post Comments
• Solr 4
• Uses Solr Cloud
• Single shard
• ReplicationFactor 3
• Real-time
• 90 days of comments
• Tested up to 100 writes / second
More HuffPost comments
• Used by editors and moderators
–Topic investigation
–Troll detection
• Config
–Special features: search for emoticons, prefer
exact match, date boosting
• Hack-a-thon comment clustering, timeline, and
summarization
Solr Comments Architecture
Message
Queue
MongoDB
Mongo
Ingestor
Solr
Ingestor
Solr Cloud
Uses SolrJ CloudSolrServer
Tools
Server
JuLiA
Relevance in Solr
• “free alcohol” vs. “alcohol free”
–Phrase queries and phrase slop
• Lawyer versus Attorney
–SynonymFilterFactory
• Iron and ironic
–Kstem, or Lemmatization via the
SynonymFilterFactory instead of
Snowball/Porter
Relevance in Solr
• Beyonce vs. Beyoncé
–Various Folding Filters
• Eagles
–Boost on other fields, such as
popularity, publish date
–Use related searches, facets, or clustering
• F 15, F-15, F15
–WordDelimiterFilter
Bringing a New Search Project Online
• Understand the domain
• Ingest (sample) data
• Clean data
• Repeat
• Relevance testing
• Scale out
• Launch/Success

More Related Content

Viewers also liked

Coterie 9 11
Coterie 9 11Coterie 9 11
Coterie 9 11LaRue
 
Presentation to the Old Dominion University (ODU) MBA Association, 3/20/13
Presentation to the Old Dominion University (ODU) MBA Association, 3/20/13Presentation to the Old Dominion University (ODU) MBA Association, 3/20/13
Presentation to the Old Dominion University (ODU) MBA Association, 3/20/13Marty Kaszubowski
 
O asis1 2[1]
O asis1 2[1]O asis1 2[1]
O asis1 2[1]tanica
 
20101023 ie9 cache
20101023 ie9 cache20101023 ie9 cache
20101023 ie9 cache彰 村地
 
Maroon5
Maroon5Maroon5
Maroon5tanica
 
Tennis
TennisTennis
Tennisaritz
 
All the lovers
All the loversAll the lovers
All the loverstanica
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineLucidworks (Archived)
 
Нестандартные методы интернет рекламы
Нестандартные методы интернет рекламыНестандартные методы интернет рекламы
Нестандартные методы интернет рекламыVladimir
 
IAMAS 2010 First presentation
IAMAS 2010 First presentationIAMAS 2010 First presentation
IAMAS 2010 First presentationocrock
 
Highly Relevant Search Result Ranking for Law Enforcement
Highly Relevant Search Result Ranking for Law EnforcementHighly Relevant Search Result Ranking for Law Enforcement
Highly Relevant Search Result Ranking for Law EnforcementLucidworks (Archived)
 
Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14
Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14
Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14Marty Kaszubowski
 
Descritores de linguagem
Descritores de linguagemDescritores de linguagem
Descritores de linguagemgindri
 
Network Forensics Puzzle Contest に挑戦 #1
Network Forensics Puzzle Contest に挑戦 #1Network Forensics Puzzle Contest に挑戦 #1
Network Forensics Puzzle Contest に挑戦 #1彰 村地
 
Hellosong
HellosongHellosong
Hellosongtanica
 
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/LuceneSearching The United States Code with Solr/Lucene
Searching The United States Code with Solr/LuceneLucidworks (Archived)
 

Viewers also liked (18)

Coterie 9 11
Coterie 9 11Coterie 9 11
Coterie 9 11
 
Presentation to the Old Dominion University (ODU) MBA Association, 3/20/13
Presentation to the Old Dominion University (ODU) MBA Association, 3/20/13Presentation to the Old Dominion University (ODU) MBA Association, 3/20/13
Presentation to the Old Dominion University (ODU) MBA Association, 3/20/13
 
O asis1 2[1]
O asis1 2[1]O asis1 2[1]
O asis1 2[1]
 
What’s new in apache lucene 3.0
What’s new in apache lucene 3.0What’s new in apache lucene 3.0
What’s new in apache lucene 3.0
 
20101023 ie9 cache
20101023 ie9 cache20101023 ie9 cache
20101023 ie9 cache
 
Maroon5
Maroon5Maroon5
Maroon5
 
Tennis
TennisTennis
Tennis
 
All the lovers
All the loversAll the lovers
All the lovers
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
 
Mains aux fleurs
Mains aux fleursMains aux fleurs
Mains aux fleurs
 
Нестандартные методы интернет рекламы
Нестандартные методы интернет рекламыНестандартные методы интернет рекламы
Нестандартные методы интернет рекламы
 
IAMAS 2010 First presentation
IAMAS 2010 First presentationIAMAS 2010 First presentation
IAMAS 2010 First presentation
 
Highly Relevant Search Result Ranking for Law Enforcement
Highly Relevant Search Result Ranking for Law EnforcementHighly Relevant Search Result Ranking for Law Enforcement
Highly Relevant Search Result Ranking for Law Enforcement
 
Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14
Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14
Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14
 
Descritores de linguagem
Descritores de linguagemDescritores de linguagem
Descritores de linguagem
 
Network Forensics Puzzle Contest に挑戦 #1
Network Forensics Puzzle Contest に挑戦 #1Network Forensics Puzzle Contest に挑戦 #1
Network Forensics Puzzle Contest に挑戦 #1
 
Hellosong
HellosongHellosong
Hellosong
 
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/LuceneSearching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
 

Similar to Solr At AOL, Presented by Sean Timm at SolrExchage DC

Semantic Web and Schema.org
Semantic Web and Schema.orgSemantic Web and Schema.org
Semantic Web and Schema.orgrvguha
 
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...Lucidworks
 
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012Amazon Web Services
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrRahul Jain
 
How the Web of Data Will be Won
How the Web of Data Will be WonHow the Web of Data Will be Won
How the Web of Data Will be WonJeni Tennison
 
Synchronicity: Just-In-Time Discovery of Lost Web Pages
Synchronicity: Just-In-Time Discovery of Lost Web PagesSynchronicity: Just-In-Time Discovery of Lost Web Pages
Synchronicity: Just-In-Time Discovery of Lost Web PagesMichael Nelson
 
Demand, Media, and Search Analytics at AOL
Demand, Media, and Search Analytics at AOLDemand, Media, and Search Analytics at AOL
Demand, Media, and Search Analytics at AOLSean Timm
 
Gathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesGathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesHeiko Paulheim
 
Webs of People, Webs of Data
Webs of People, Webs of DataWebs of People, Webs of Data
Webs of People, Webs of DataSimon Price
 
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Ontico
 
(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web PagesMichael Nelson
 
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache SolrCreating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache SolrBrooke Ganz
 
Lessons Learnt From Working With Rails
Lessons Learnt From Working With RailsLessons Learnt From Working With Rails
Lessons Learnt From Working With Railsmartinbtt
 
Halko_santafe_2015
Halko_santafe_2015Halko_santafe_2015
Halko_santafe_2015Nathan Halko
 
What happened to the Semantic Web?
What happened to the Semantic Web?What happened to the Semantic Web?
What happened to the Semantic Web?Peter Mika
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrJake Mannix
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkJake Mannix
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSujit Pal
 

Similar to Solr At AOL, Presented by Sean Timm at SolrExchage DC (20)

Semantic Web and Schema.org
Semantic Web and Schema.orgSemantic Web and Schema.org
Semantic Web and Schema.org
 
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
 
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
How the Web of Data Will be Won
How the Web of Data Will be WonHow the Web of Data Will be Won
How the Web of Data Will be Won
 
Haifa
HaifaHaifa
Haifa
 
NISO Virtual Conference: The Semantic Web Coming of Age: Technologies and Imp...
NISO Virtual Conference: The Semantic Web Coming of Age: Technologies and Imp...NISO Virtual Conference: The Semantic Web Coming of Age: Technologies and Imp...
NISO Virtual Conference: The Semantic Web Coming of Age: Technologies and Imp...
 
Synchronicity: Just-In-Time Discovery of Lost Web Pages
Synchronicity: Just-In-Time Discovery of Lost Web PagesSynchronicity: Just-In-Time Discovery of Lost Web Pages
Synchronicity: Just-In-Time Discovery of Lost Web Pages
 
Demand, Media, and Search Analytics at AOL
Demand, Media, and Search Analytics at AOLDemand, Media, and Search Analytics at AOL
Demand, Media, and Search Analytics at AOL
 
Gathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesGathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia Entities
 
Webs of People, Webs of Data
Webs of People, Webs of DataWebs of People, Webs of Data
Webs of People, Webs of Data
 
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
 
(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages
 
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache SolrCreating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
 
Lessons Learnt From Working With Rails
Lessons Learnt From Working With RailsLessons Learnt From Working With Rails
Lessons Learnt From Working With Rails
 
Halko_santafe_2015
Halko_santafe_2015Halko_santafe_2015
Halko_santafe_2015
 
What happened to the Semantic Web?
What happened to the Semantic Web?What happened to the Semantic Web?
What happened to the Semantic Web?
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+Solr
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and Spark
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming text
 

More from Lucidworks (Archived)

Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Lucidworks (Archived)
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and SolrLucidworks (Archived)
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessLucidworks (Archived)
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceLucidworks (Archived)
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchLucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchLucidworks (Archived)
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Lucidworks (Archived)
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...Lucidworks (Archived)
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Lucidworks (Archived)
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCLucidworks (Archived)
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCLucidworks (Archived)
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCLucidworks (Archived)
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKLucidworks (Archived)
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarLucidworks (Archived)
 
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks Lucidworks (Archived)
 
Seeley yonik solr performance key innovations
Seeley yonik   solr performance key innovationsSeeley yonik   solr performance key innovations
Seeley yonik solr performance key innovationsLucidworks (Archived)
 

More from Lucidworks (Archived) (20)

Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
The Data-Driven Paradigm
The Data-Driven ParadigmThe Data-Driven Paradigm
The Data-Driven Paradigm
 
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
 
What's new in solr june 2014
What's new in solr june 2014What's new in solr june 2014
What's new in solr june 2014
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLK
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinar
 
Solr4 nosql search_server_2013
Solr4 nosql search_server_2013Solr4 nosql search_server_2013
Solr4 nosql search_server_2013
 
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
 
Seeley yonik solr performance key innovations
Seeley yonik   solr performance key innovationsSeeley yonik   solr performance key innovations
Seeley yonik solr performance key innovations
 

Recently uploaded

Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 

Recently uploaded (20)

Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 

Solr At AOL, Presented by Sean Timm at SolrExchage DC

  • 1.
  • 2. Solr and Lucene @ AOL SEAN TIMM, CHIEF ARCHITECT, AOL ADVERTISING
  • 3. 1999 • Believe, Cher and Livin’ la Vida Loca, Ricky Martin • The Matrix and The Phantom Menace • Windows 98 Second Edition • AltaVista, Northern Light, Yahoo, ODP, Inktomi – Google • PPC Text search ads invented 1998 – Banner ads
  • 4. A Brief History of Search @ AOL • Acquired PLS in 1998 • AOL Search used ODP • Site Search • Local Search • Built into AOL Server • CPL – VSM then BM25 – Phrase, numeric, date, text, and proximity boosting – Conflation classes (like synonyms)
  • 5. Relevance • Precision/recall • “free alcohol” vs. “alcohol free” • Lawyer versus Attorney • Iron and ironic  same stem (Porter) • Beyonce vs. Beyoncé • Eagles –Bird, sports teams, band, AMC Eagle • F 15, F-15, F15 • FREAK Relevant Retrieved
  • 6. The Dawn of Solr • Prohibitively expensive to continue CPL development • Complicated deployment • 2005: Investigating migration to Lucene • 2006: CNET open sourced Solr
  • 7. Contributions • Local Lucene/Solr (superseded by SpatialSearch) • Query Timeout • Data Import Handler (DIH) • Numerous smaller patches • Committers: Noble Paul, Shalin Mangar, Patrick O’Leary
  • 8. Contributing to Solr/Lucene • Learn –Join the mailing lists •solr-user@lucene.apache.org •dev@lucene.apache.org –Read search and Solr related blogs –The #solr IRC channel on freenode
  • 9. Contributing to Solr/Lucene • Help others –Answer questions. –Improve documentation in the code, the wiki, or the website. –Make improvements to the Solr Admin UI.
  • 10. Contributing to Solr/Lucene • Confirm a bug • Submit a patch for a reported bug or feature request • Improve a patch • Try out a patch and see if it works
  • 11. Contributing to Solr/Lucene • Submit your own tickets – Bug – Feature request • Start with solr-user@lucene • Discuss on dev@lucene • Create Jira ticket, ideally with patches and unit tests • Yonik’s Law of Patches: – A half-baked patch in Jira, with no documentation, no tests, and no backwards compatibility is better than no patch at all.
  • 12. Applications • MapQuest (SpatialSearch) • Mail • AIM • AOL Search • Site Search • News Search • RUM • Sarah Palin e-mails (admin) • Demand • Wikipedia article pattern detection
  • 18. Related Searches Graph Page 18 “The Eagles” The band NFL Boston College Hotel California Tribute
  • 19. Related Searches • Simple query – User • New York Library – Solr query • Lower case • Prefer exact match “new york library” • Use phrase slop to allow terms in same order and near each other, e.g., new york city public library • primeQuery:“new york library” OR “new york library”~3
  • 20. Wikipedia Traffic Correlation Schema <field name="title" type="string" indexed="true" stored="true" required="true" /> <field name="title_norm" type="string" indexed="true" stored="true" required="true" /> <field name="total_pvs" type="long" indexed="true" stored="true" required="true" /> <!-- Dynamic field definitions. If a field name is not found, dynamicFields will be used if the name matches any of the patterns. RESTRICTION: the glob-like pattern in the name attribute must have a "*" only at the start or the end. EXAMPLE: name="*_i" will match any field ending in _i (like myid_i, z_i) Longer patterns will be matched first. if equal size patterns both match, the first appearing in the schema will be used. --> <!-- trend direction. field name contains date string, e.g., "trend_20110622" --> <dynamicField name="trend_*" type="int" indexed="true" stored="true"/> <!-- page views. field name contains date string, e.g., "pvs_20110622" --> <dynamicField name="pvs_*" type="long" indexed="true" stored="true"/>
  • 21. Temporal Traffic Correlation of Wikipedia Page Views
  • 22. Sarah Palin E-mail Stats • 13,177 documents • 4 hours from receiving data to production install • ~150 K requests per day at launch • Now about 6-7 K requests per day • Running on 3 VMs in two different data centers behind a NetScaler
  • 23.
  • 25. Huffington Post Comments • Solr 4 • Uses Solr Cloud • Single shard • ReplicationFactor 3 • Real-time • 90 days of comments • Tested up to 100 writes / second
  • 26. More HuffPost comments • Used by editors and moderators –Topic investigation –Troll detection • Config –Special features: search for emoticons, prefer exact match, date boosting • Hack-a-thon comment clustering, timeline, and summarization
  • 27. Solr Comments Architecture Message Queue MongoDB Mongo Ingestor Solr Ingestor Solr Cloud Uses SolrJ CloudSolrServer Tools Server JuLiA
  • 28. Relevance in Solr • “free alcohol” vs. “alcohol free” –Phrase queries and phrase slop • Lawyer versus Attorney –SynonymFilterFactory • Iron and ironic –Kstem, or Lemmatization via the SynonymFilterFactory instead of Snowball/Porter
  • 29. Relevance in Solr • Beyonce vs. Beyoncé –Various Folding Filters • Eagles –Boost on other fields, such as popularity, publish date –Use related searches, facets, or clustering • F 15, F-15, F15 –WordDelimiterFilter
  • 30. Bringing a New Search Project Online • Understand the domain • Ingest (sample) data • Clean data • Repeat • Relevance testing • Scale out • Launch/Success