13. • Term
Frequency:
“How
well
a
term
describes
a
document”
• Measure:
how
often
a
term
occurs
per
document
• Inverse
Document
Frequency:
“How
important
is
a
term
overall”
• Measure:
how
rare
the
term
is
across
all
documents
TF*IDF
14. Score(q,
d)
=
∑
idf(t)
·∙
(
tf(t
in
d)
·∙
(k
+
1)
)
/
(
tf(t
in
d)
+
k
·∙
(1
–
b
+
b
·∙
|d|
/
avgdl
)
t
in
q
Where:
t
=
term;
d
=
document;
q
=
query;
i
=
index
tf(t
in
d)
=
numTermOccurrencesInDocument
½
idf(t)
=
1
+
log
(numDocs
/
(docFreq
+
1))
|d|
=
∑
1
t
in
d
avgdl
=
=
(
∑
|d|
)
/
(
∑
1
)
)
d
in
i
d
in
i
k
=
Free
parameter.
Usually
~1.2
to
2.0.
Increases
term
frequency
saturation
point.
b
=
Free
parameter.
Usually
~0.75.
Increases
impact
of
document
normalization.
BM25
(aka
Okapi)
15. • Capture
and
log
pretty
much
everything
• Searches,
clicks,
time
on
page,
seen/not,
etc.
• Precision
—
Of
those
shown,
what’s
relevant?
• Recall
—
Of
all
that’s
relevant,
what
was
found?
• NDCG
—
Account
for
position
Measure,
Measure,
Measure
21. Magic
Guessing
Core
Information
Theory
(aka
Lucene/Solr)
Search
Aids
(Facets,
Did
You
Mean,
Highlighting)
Machine
Learning/NLP
(Clicks,
Crowd
Sourcing,
Recs,
Personalization,
User
feedback)
Rules,
Domain
Specific
Knowledge
fuhgeddaboudit
22. Content Collabora*on Context
Core
Solr
capabili*es:
text
matching,
face*ng,
spell
checking,
highligh*ng
Business
Rules
for
content:
landing
pages,
boost/
block,
promopons,
etc.
Leverage
collec*ve
intelligence
to
predict
what
users
will
do
based
on
historical,
aggregated
data
Recommenders,
Popularity,
Search
Paths
Who
are
you?
Where
are
you?
What
have
you
done
previously?
User/Market
Segmentapon,
Roles,
Security,
Personalizapon
Next Generation Relevance
23. But
What
About
the
Real
World?
Indexing
Edipon
Machine
Learning/
NLP
NER,
Topic
Detection,
Clustering
Word2Vec,
etc.
Domain
Rules:
Synonyms,
Regexes,
Lexical
Resources
Extraction
Load
Into
Spark
Build
W2V,
PageRank,
Topic,
Clustering
Models
Offline
Content
Models
24. Real
World?
Query
Edipon
Query
Intent
Strategic,
Tactical,
Semantic😊
iPad case
Head/Tail/
Clickstream/
Recommenders
User
Factors:
Segmentation,
Location,
History,
Profile,
Security
Parse
Domain
Specific
Rules
Transform
Results
…
Cascading
Rerankers
Learn
To
Rank
(multi-‐
model),
Bias
corrections
25. Real
World?
Users
Edipon
Load
Into
SparkSignals Query
Analysis
Recommenders/
Personalization
😊
iPad case
Query
Edition
Raw
Models
Clickstream
Models
26. (Exact/Original
Match)^X
(Sloppy
Phrase)~M^Y
(AND
Q)^Z
(OR
Q)^XX
(Expansions/Click/Head/Tail
Boosts)^YY
(Personalization
Biases)^ZZ
({!ltr
model=…})
Filters+Options:
security,
rules,
hard
preferences,
categories
The
Perfect(?!?)
Query*
YMMV!
}
Precision
Recall
Caveat
Emptor!
*
Note:
there
are
a
lot
of
variations
on
this.
edismax
handles
most
Learn
to
Rank
X
>
Y
>
Z
>
XX
All
weights
can
be
learned
27. • Don’t
take
my
word
for
it,
experiment!
• A/B
Tests,
Multi-‐arm
Bandits
• Good
primer:
• http://www.slideshare.net/InfoQ/online-‐controlled-‐experiments-‐introduction-‐
insights-‐scaling-‐and-‐humbling-‐statistics
• Rules
are
fine,
as
long
as
the
are
contained,
have
a
lifespan
and
are
measured
for
effectiveness
Experimentapon,
Not
Editorializapon
29. 29
Lucidworks Fusion Product Suite
The Lucidworks platform provides all of the components needed to create and
run smart enterprise and consumer applications
Create rich UI with modular components for
web and mobile
Surface the insights that matter most with the power
of machine learning and artificial intelligence
Highly scalable search engine and NoSQL datastore
that gives you instant access to all your data
Combine the power of
the Fusion stack with
the simplicity you’d
expect in a SaaS-based
application
30. Lucidworks Fusion Architecture
Web App Mobile
BI/Analytics
Logs File
Web Database
Box/
Dropbox
Elasticsearch
SDK
Sharepoint
Slack
Jive
Connectors
Admin UI
Search
Analytics
Visualization
REST/
SQL
Hadoop
Google Drive
Security Built In
Proven Speed CDCR
Extensible Scalable Responsive
NLP: NER, Phrases, POS
Query Intent & Doc Classification
Recommenders
Anomaly Detection
Signals and Query Analytics
Clustering
A/B Testing
ETL and Query Pipelines
Alerting and Messaging
SQL and Catalog
Scheduling
Connectors/Federation
Import/Export
Custom Jobs
RulesTopic Detection
Custom
Search
Devs
Data
Scientists
Business
Users
Cross Cutting Features
HDFS (Optional)
31. Fusion: Meeting the Search Challenge
Relevance and Discovery
Business Support
Intelligence
Open & Scalable
Signal Proc. Machine Learning NLP Math/Stats
Proven Search Extensible Simplified DevOps Real Time
Query/Doc Simulations Rules Analytics User Interface
Personalization Recommendations Query Intent Experimentation (A/B)