Searching text is what Solr is known for, but did you know that many companies receive an equal or greater business impact through implementing a recommendation engine in addition to their text search capabilities? With a few tweaks, Solr (or Lucene) can also serve as a full featured recommendation engine. Machine learning libraries like Apache Mahout provide excellent behavior-based, off-line recommendation algorithms, but what if you want more control? This talk will demonstrate how to effectively utilize Solr to perform collaborative filtering (users who liked this also liked…), categorical classification and subsequent hierarchical-based recommendations, as well as related-concept extraction and concept based recommendations. Sound difficult? It’s not. Come learn step-by-step how to create a powerful real-time recommendation engine using Apache Solr and see some real-world examples of some of these strategies in action.
3. My Background
Trey Grainger
• Manager, Search Technology Development
@ CareerBuilder.com
Relevant Background
• Search & Recommendations
• High-volume, N-tier Architectures
• NLP, Relevancy Tuning, user group testing, & machine learning
Fun Side Projects
• Founder and Chief Engineer @ .com
• Currently co-authoring Solr in Action book… keep your eyes out for
the early access release from Manning Publications
4. About Search @CareerBuilder
• Over 1 million new jobs each month
• Over 45 million actively searchable resumes
• ~250 globally distributed search servers (in
the U.S., Europe, & Asia)
• Thousands of unique, dynamically generated
indexes
• Hundreds of millions of search documents
• Over 1 million searches an hour
6. Redefining “Search Engine”
• “Lucene is a high-performance, full-featured
text search engine library…”
Yes, but really…
• Lucene is a high-performance, fully-featured
token matching and scoring library… which
can perform full-text searching.
7. Redefining “Search Engine”
or, in machine learning speak:
• A Lucene index is a multi-dimensional
sparse matrix… with very fast and powerful
lookup capabilities.
• Think of each field as a matrix containing each
term mapped to each document
8. The Lucene Inverted Index
(traditional text example)
How the content is INDEXED into
What you SEND to Lucene/Solr: Lucene/Solr (conceptually):
Document Content Field Term Documents
doc1 once upon a time, in a land a doc1 [2x]
far, far away brown doc3 [1x] , doc5 [1x]
doc2 the cow jumped over the cat doc4 [1x]
moon.
cow doc2 [1x] , doc5 [1x]
doc3 the quick brown fox
jumped over the lazy dog. … ...
doc4 the cat in the hat once doc1 [1x], doc5 [1x]
doc5 The brown cow said “moo” over doc2 [1x], doc3 [1x]
once. the doc2 [2x], doc3 [2x],
… … doc4[2x], doc5 [1x]
… …
9. Match Text Queries to Text Fields
/solr/select/?q=jobcontent: (software engineer)
Job Content Field Documents engineer
… … doc5
engineer doc1, doc3, doc4,
doc5
software engineer
…
doc1 doc3
mechanical doc2, doc4, doc6 doc4
… …
software doc1, doc3, doc4,
doc7, doc8 software
… … doc7 doc8
10. Beyond Text Searching
• Lucene/Solr is a text search matching engine
• When Lucene/Solr search text, they are matching
tokens in the query with tokens in index
• Anything that can be searched upon can form the
basis of matching and scoring:
– text, attributes, locations, results of functions, user
behavior, classifications, etc.
11. Business Case for Recommendations
• For companies like CareerBuilder, recommendations
can provide as much or even greater business value
(i.e. views, sales, job applications) than user-driven
search capabilities.
• Recommendations create stickiness to pull users
back to your company’s website, app, etc.
• What are recommendations?
… searches of relevant content for a user
12. Approaches to Recommendations
• Content-based
– Attribute based
• i.e. income level, hobbies, location, experience
– Hierarchical
• i.e. “medical//nursing//oncology”, “animal//dog//terrier”
– Textual Similarity
• i.e. Solr’s MoreLikeThis Request Handler & Search Handler
– Concept Based
• i.e. Solr => “software engineer”, “java”, “search”, “open source”
• Behavioral Based
• Collaborative Filtering: “Users who liked that also liked this…”
• Hybrid Approaches
14. Attribute-based Recommendations
• Example: Match User Attributes to Item Attribute Fields
Janes_Profile:{
Industry:”healthcare”,
Locations:”Boston, MA”,
JobTitle:”Nurse Educator”,
Salary:{ min:40000, max:60000 },
}
/solr/select/?q=(jobtitle:”nurse educator”^25 OR
jobtitle:(nurse educator)^10) AND ((city:”Boston” AND
state:”MA”)^15 OR state:”MA”) AND
_val_:”map(salary,40000,60000,10,0)”
//by mapping the importance of each attribute to weights based upon
your business domain, you can easily find results which match your
customer’s profile without the user having to initiate a search.
15. Hierarchical Recommendations
• Example: Match User Attributes to Item Attribute Fields
Janes_Profile:{
MostLikelyCategory:”healthcare//nursing//oncology”,
2ndMostLikelyCategory:”healthcare//nursing//transplant”,
3rdMostLikelyCategory:”educator//postsecondary//nursing”, …
}
/solr/select/?q=(category:(
(”healthcare.nursing.oncology”^40
OR ”healthcare.nursing”^20
OR “healthcare”^10))
OR
(”healthcare.nursing.transplant”^20
OR ”healthcare.nursing”^10
OR “healthcare”^5))
OR
(”educator.postsecondary.nursing”^10
OR ”educator.postsecondary”^5
OR “educator”) ))
16. Textual Similarity-based Recommendations
• Solr’s More Like This Request Handler / Search Handler are a good
example of this.
• Essentially, “important keywords” are extracted from one or more
documents and turned into a search.
• This results in secondary search results which demonstrate
textual similarity to the original document(s)
• See http://wiki.apache.org/solr/MoreLikeThis for example usage
• Currently no distributed search support (but a patch is available)
17. Concept Based Recommendations
Approaches:
1) Create a Taxonomy/Dictionary to define your
concepts and then either:
a) manually tag documents as they come in
//Very hard to scale… see Amazon Mechanical Turk if you must do this
or
b) create a classification system which automatically tags
content as it comes in (supervised machine learning)
//See Apache Mahout
2) Use an unsupervised machine learning algorithm to
cluster documents and dynamically discover concepts
(no dictionary required).
//This is already built into Solr using Carrot2!
20. Clustering Search in Solr
• /solr/clustering/?q=content:nursing
&rows=100
&carrot.title=titlefield
&carrot.snippet=titlefield
&LingoClusteringAlgorithm.desiredClusterCountBase=25
&group=false //clustering & grouping don’t currently play nicely
• Allows you to dynamically identify “concepts” and their
prevalence within a user’s top search results
23. Example Concept-based Recommendation
Stage 1: Identify Concepts
Original Query: q=(solr or lucene) Clusters Identifier:
Developer (22)
// can be a user’s search, their job title, a list of skills, Java Developer (13)
// or any other keyword rich data source
Software (10)
Senior Java Developer (9)
Architect (6)
Software Engineer (6)
Web Developer (5)
Search (3)
Software Developer (3)
Systems (3)
Administrator (2)
Facets Identified (occupation): Hadoop Engineer (2)
Java J2EE (2)
Computer Software Engineers Search Development (2)
Web Developers Software Architect (2)
... Solutions Architect (2)
24. Example Concept-based Recommendation
Stage 2: Run Recommendations Search
q=content:(“Developer”^22 or “Java Developer”^13 or “Software
”^10 or “Senior Java Developer”^9 or “Architect ”^6 or “Software
Engineer”^6 or “Web Developer ”^5 or “Search”^3 or “Software
Developer”^3 or “Systems”^3 or “Administrator”^2 or “Hadoop
Engineer”^2 or “Java J2EE”^2 or “Search Development”^2 or
“Software Architect”^2 or “Solutions Architect”^2) and
occupation: (“Computer Software Engineers” or “Web
Developers”)
// Your can also add the user’s location or the original keywords to the
// recommendations search if it helps results quality for your use-case.
27. Geography and Recommendations
• Filtering or boosting results based upon geographical area or
distance can help greatly for certain use cases:
– Jobs/Resumes, Tickets/Concerts, Restaurants
• For other use cases, location sensitivity is nearly worthless:
– Books, Songs, Movies
/solr/select/?q=(Standard Recommendation Query) AND
_val_:”(recip(geodist(location, 40.7142, 74.0064),1,1,0))”
// there are dozens of well-documented ways to search/filter/sort/boost
// on geography in Solr.. This is just one example.
29. The Lucene Inverted Index
(user behavior example)
How the content is INDEXED into
What you SEND to Lucene/Solr: Lucene/Solr (conceptually):
Document “Users who bought this Term Documents
product” Field
user1 doc1, doc5
doc1 user1, user4, user5
user2 doc2
doc2 user2, user3 user3 doc2
user4 doc1, doc3,
doc3 user4 doc4, doc5
user5 doc1, doc4
doc4 user4, user5
… …
doc5 user4, user1
… …
30. Collaborative Filtering
• Step 1: Find similar users who like the same documents
q=documentid: (“doc1” OR “doc4”)
Document “Users who bought this
product “Field
doc1 doc4
doc1 user1, user4, user5
user1 user4 user4 user5
doc2 user2, user3
user5
doc3 user4
doc4 user4, user5 Top Scoring Results (Most Similar Users):
1) user5 (2 shared likes)
doc5 user4, user1 2) user4 (2 shared likes)
… … 3) user 1 (1 shared like)
31. Collaborative Filtering
• Step 2: Search for docs “liked” by those similar users
Most Similar Users:
1) user5 (2 shared likes)
/solr/select/?q=userlikes: (“user5”^2
2) user4 (2 shared likes) OR “user4”^2 OR “user1”^1)
3) user 1 (1 shared like)
Term Documents
Top Recommended Documents:
user1 doc1, doc5 1) doc1 (matches user4, user5, user1)
user2 doc2 2) doc4 (matches user4, user5)
3) doc5 (matches user4, user1)
user3 doc2
4) doc3 (matches user4)
user4 doc1, doc3,
doc4, doc5 //Doc 2 does not match
user5 doc1, doc4 //above example ignores idf calculations
… …
32. Lot’s of Variations
• Users –> Item(s)
• User –> Item(s) –> Users
• Item –> Users –> Item(s)
• etc.
User 1 User 2 User 3 User 4 …
Item 1 X X X …
Item 2 X X …
Item 3 X X …
Item 4 X …
… … … … … …
Note: Just because this example tags with “users” doesn’t mean you have to.
You can map any entity to any other related entity and achieve a similar result.
33. Comparison with Mahout
• Recommendations are much easier for us to perform in Solr:
– Data is already present and up-to-date
– Doesn’t require writing significant code to make changes (just changing queries)
– Recommendations are real-time as opposed to asynchronously processed off-line.
– Allows easy utilization of any content and available functions to boost results
• Our initial tests show our collaborative filtering approach in Solr significantly
outperforms our Mahout tests in terms of results quality
– Note: We believe that some portion of the quality issues we have with the Mahout
implementation have to do with staleness of data due to the frequency with which our data is
updated.
• Our general take away:
– We believe that Mahout might be able to return better matches than Solr with a lot of custom
work, but it does not perform better for us out of the box.
• Because we already scale…
– Since we already have all of data indexed in Solr (tens to hundreds of millions of documents),
there’s no need for us to rebuild a sparse matrix in Hadoop (your needs may be different).
35. Hybrid Approaches
• Not much to say here, I think you get the point.
• /solr/select/?q=category:(”healthcare.nursing.oncology”^10
”healthcare.nursing”^5 OR “healthcare”) OR title:”Nurse
Educator”^15 AND _val_:”map(salary,40000,60000,10,0)”^5
AND _val_:”(recip(geodist(location, 40.7142,
74.0064),1,1,0))”)
• Combining multiple approaches generally yields better overall
results if done intelligently. Experimentation is key here.
38. Custom Scoring with Payloads
• In addition to boosting search terms and fields, content within the same field can also
be boosted differently using Payloads (requires a custom scoring implementation):
• Content Field:
design [1] / engineer [1] / really [ ] / great [ ] / job [ ] / ten[3] / years[3] /
experience[3] / careerbuilder [2] / design *2+, …
Payload Bucket Mappings:
jobtitle: bucket=[1] boost=10; company: bucket=[2] boost=4;
jobdescription: bucket=[] weight=1; experience: bucket=[3] weight=1.5
We can pass in a parameter to solr at query time specifying the boost to apply to each
bucket i.e. …&bucketWeights=1:10;2:4;3:1.5;default:1;
• This allows us to map many relevancy buckets to search terms at index time and adjust
the weighting at query time without having to search across hundreds of fields.
• By making all scoring parameters overridable at query time, we are able to do A / B
testing to consistently improve our relevancy model
39. Measuring Results Quality
• A/B Testing is key to understanding our search results quality.
• Users are randomly divided between equal groups
• Each group experiences a different algorithm for the duration of the
test
• We can measure “performance” of the algorithm based upon
changes in user behavior:
– For us, more job applications = more relevant results
– For other companies, that might translate into products purchased, additional
friends requested, or non-search pages viewed
• We use this to test both keyword search results and also
recommendations quality
41. Understanding Our Users
• Machine learning algorithms can help us understand what
matters most to different groups of users.
Example: Willingness to relocate for a job (miles per percentile)
2,500
2,000
Title Examiners, Abstractors, and Searchers
1,500
1,000
Software Developers, Systems Software
500
Food Preparation Workers
0
1% 5% 10% 20% 25% 30% 40% 50% 60% 70% 75% 80% 90% 95%
42. Key Takeaways
• Recommendations can be as valuable or more
than keyword search.
• If your data fits in Solr then you have everything
you need to build an industry-leading
recommendation system
• Even a single keyword can be enough to begin
making meaningful recommendations. Build up
intelligently from there.
43. Contact Info
Trey Grainger
trey.grainger@careerbuilder.com
http://www.careerbuilder.com
@treygrainger
And yes, we are hiring – come chat with me if you are interested.