Smarter Search With Spark-Solr: Search gets smarter when you know more about your documents and their relationship to each other (think: PageRank) and the users (i.e. popularity), in addition to what you already know about their content (text search). It also gets smarter when you know more about your users (personalization) and both their affinity for certain kinds of content and their similarities to each other (collaborative filtering recommenders).
Building all of these pieces typically requires a big mix of batch workloads to do log processing, as well as training machine-learned models to use during realtime querying, and are highly domain specific, but many techniques are fairly universal: we will discuss how Spark can interface with a Solr Cloud cluster to efficiently perform many of the pieces to this puzzle in one relatively self-contained package (no HDFS/S3, all data stored in Solr!), and introduce “spark-solr” – an open-source JVM library to facilitate this.
Potential of AI (Generative AI) in Business: Learnings and Insights
Practical Data Science Workbench with Spark and Solr
1. A Practical Data Science
Workbench:
spark-solr
Jake Mannix
@pbrane
Lead Data Engineer, Lucidworks
2. $ whoami
Now: Lucidworks, Office of the CTO: applied ML / data engineering R&D
Previously:
• Allen Institute for AI: Semantic Search on academic research publications
• Twitter: account search, user interest modeling, content recommendations
• LinkedIn: profile search, generic entity-to-entity recommender systems
Prehistory:
• other software companies, algebraic topology, particle cosmology
4. • What is the “Minimum Viable Big Data Science Toolkit”?
• DB? Distributed FS? NoSQL store?
• ML libraries / frameworks (scripting? notebook? REPL?)
• text analysis or graph libraries?
• dataviz package?
• hosting layer (for models and/or POC apps)?
Cold Start
5. • Spark and Solr for Data Engineering
• Why Solr?
• Why Spark?
• Example rapid turnaround workflow: Searchhub
• data exploration
• clustering: unsupervised ML
• classification: supervised ML
• recommenders: collaborative filtering + content-based
+ “mixed-mode”
Overview
6. Practical Data Science with Spark and Solr
Why does Solr need Spark?
Why does Spark need Solr?
7. Why does Spark need Solr?
Typical Hadoop / Spark data-engineering task, start with some data on
HDFS:
$ hdfs dfs -ls /user/jake/mail/lucene-solr-user/2015
…
-rw-r--r-- 1 jake staff 63043884 Feb 4 18:22 part-00001.lzo
-rw-r--r-- 1 jake staff 79770856 Feb 4 18:22 part-00002.lzo
-rw-r--r-- 1 jake staff 72108179 Feb 4 18:22 part-00003.lzo
-rw-r--r-- 1 jake staff 12150481 Feb 4 18:22 part-00004.lzo
Now what? What’s in these files?
8. Solr gives you:
• random access data store
• full-text search
• fast aggregate statistics
• just starting out: no HDFS / S3 necessary!
• world-class multilingual text analytics:
• no more: tokens = str.toLowerCase().split(“s+“)
• relevancy / ranking
• realtime REST service layer / web console
9. • Apache Lucene
• Grouping and Joins
• Streaming parallel SQL
• Stats, expressions,
transformations and more
• Lang. Detection
• Extensible
• Massive Scale/Fault tolerance
Solr Key Features
• Full text search (Info Retr.)
• Facets/Guided Nav galore!
• Lots of data types
• Spelling, auto-complete,
highlighting
• Cursors
• More Like This
• De-duplication
10. Why Spark for Solr?
• spark-shell: a Big Data REPL with all your fave JVM libs!
• Build the index in parallel very, very quickly
• Aggregations
• Boosts, stats, iterative global computations
• Offline compute to update index with additional info (e.g. PageRank,
popularity)
• Whole corpus analytics and ML: clustering, classification, CF, rankers
• General-purpose distributed computation
• Joins with other storage (Cassandra, HDFS, DB, HBase)
11. Why do data engineering with Solr and Spark?
SolrSpark
• Data exploration and visualization
• Easy ingestion and feature
selection
• Powerful ranking features
• Quick and dirty classification and
clustering
• Simple operation and scaling
• Stats and math built in
• General purpose batch/streaming
compute engine
Whole collection analysis!
• Fast, large scale iterative
algorithms
• Advanced machine learning:
MLLib, Mahout, Deep Learning4j
• Lots of integrations with other big
data systems
and together: http://github.com/lucidworks/spark-solr
12. • Free Data ! ASF mailing-list archives + github + JIRA
• https://github.com/lucidworks/searchhub
• Index it into Solr
• Explore a bit deeper: unsupervised Spark ML
• Exploit labels: predictive analytics
• Build a recommender, mix & match with search
Example workflow: Searchhub
TM
13. • Initial exploration of ASF mailing-list archives
• index into Solr: just need to turn your records into json
• facet:
• fields with low cardinality or with sensible ranges
• document size histogram
• projects, authors, dates
• find: broken fields, automated content, expected data missing, errors
• now: load into a spark RDD via SolrRDD:
Searchhub: Initial Exploration
14. • try other text analyzers: (no more str.split(“w+”)! )
Smarter Text Analysis in Spark
ref: Lucidworks blog on LuceneTextAnalyzer by Steve Rowe
15. • Unsupervised machine learning with MLLib or Mahout:
• clustering documents with KMeans
• extract topics with Latent Dirichlet Allocation
• learn word vectors with Word2Vec
• Write the results back to solr:
Searchhub: Exploratory Data Science
16. • can also do something more like real Data Science:
Searchhub Classification: “Many Newsgroups”
18. • Recommender Systems
• content-based:
• mail-thread as “item”, head msgs grouped by replier
as “user” profile
• search query of users against items to recommend
• collaborative-filtering:
• users replying to a head msg “rate” them +-tively
• train a Spark-ML ALS RecSys model
• both can generate item-item similarity models
Spark+Solr RecSys
19. • With top-K closest items by both CF and Content:
• store them back into a Solr collection!
• fetch your (or generic user’s) recent items
• query them:
• “q=(cf:123^1.1 cf:39^2.3 cf:93^0.7)^alpha
(ct:912^2.9 ct:123^1.8 ct:99^2.2)^(1-alpha)”
Experimenting with mixed-mode Recommenders
I live at the intersection of information retrieval and machine learning, and the scalable / distributed systems engineering to enable them
Wake-up call: you won’t always be at the company you’re at now, with Ops and Data Infra (or maybe even now, you’re consulting, or at a tiny startup). Imagine you jumped into a new Data Lake…
In transitioning away from places like Twitter and LinkedIn, I find myself wondering:
going to depend: are you just doing analysis, or building prototypes, or full-fledged POCs / demos?
Who in the audience has used Solr (and: in prod?), how about Spark?
(and writes Scala?)
Solr is a fantastically scalable, production-grade search engine
Spark is a high-performance, flexible distributed computation engine
Download one? These aren’t big, but sometimes each chunk *is* big. These are moderately human-readable (after decompress), so you could “hdfs cat” them, but often it’s binary: parquet/avro/thrift/protobufs. Maybe you can “hdfs cat” them, too… but you want to explore, tinker, see what ugly rough edges there are.
But what if instead of storing them in HDFS, you indexed them into Solr?
AND: no HDFS means no NameNode SPOF, no need for HA NN, Ops to go with that.
data types: Dates, numeric types, etc.
in use in production (hundreds to thousands of nodes) in most of the Fortune 500
Next up: ok, solr can help spark. If it’s so great, why spark at all?
y’all know all this: but the TL;DR is that spark handles the bulk computation and global view of your data set.
preview: what about that string splitting? We’ve got Lucene now
LuceneTextAnalyzer: specify a name, a tokenizer, and 0 or more token filters
basically, in 3 lines of scala+JSON, you get a *fast* German stemming n-gram tokenizer DataFrame UDF
once you have tokenization for initial featurization…
similarly for the LDA topics.
could persist the Word2Vec model to disk and load during query time for query expansion
Next up: just because spark + solr gets you “quick and dirty” DS, doesn’t mean you can’t do it more seriously in this setup as well, once you’re all situated:
this model may be applied to data you have that doesn’t have labels
it may also be serialized, and loaded up into a service layer (plug for Fusion)
(this CV is where you suddenly realize you’ve been playing with Spark+Solr on your laptop, and it’s not going to finish any time soon…)
Remember: index these modified DataFrames back to Solr and take a peek at them!
(Next up, to wash your brains from a mind-numbing scala slide, we have…)
a blast from the past: an artist’s rendition of matrix decomposition, which I call
“SVD on canvas with acrylics”
this query is looking for items who's CF neighbors contain one of these 3 items, or their content-based neighbors contain those other two, w/linearly weighted scoring
now you have a simple slider interpolating between CF and Content recs
(A/B test, or learn the weight by engagement, etc…)