3. Who am I ?
• Big Search and Distributed database specialist
• Built a Search as a Service platform
• Lead Search Architect @ Bloomberg Vault
• Credit Derivatives Analytics Engineer @ Bloomberg
• Masters' @ Courant Institute of Mathematical Sciences, New York University
• Passionate about Search, Scuba Diving , Motorcycles and German Shepherds
5. Agenda
• Search
at
Bloomberg
•
Goals
and
Objec5ves
•
A
li9le
background
•
Factors
affec5ng
indexing
•
Our
tests
and
benchmarks
•
Design
for
a
be9er
NRT
indexer
•
Future
work
•
Q/A
11. Indexing Workflow
We were talking about IBM during the fishing trip
Down
Cas)ng
[We] [were] [talking] [about] [IBM] [during] [the] [fishing] [trip]
Creates
tokens
by
lowercasing
all
le4ers
and
dropping
non-‐le4ers.
[we] [were] [talking] [about] [ibm] [during] [the] [fishing] [trip]
[we] [were] [talking] [about] [ibm] [during] [the] [fishing] [trip]
[talk] [fish]
[talking] [about] [ibm] [fishing] [trip]
[talk] [big] [blue] [fish] [journey]
[chat]
Consider
the
sentence:
[we] [were] [talking] [about] [ibm] [during] [the] [fishing] [trip]
[talk] [fish]
Tokeniza)on
A
tokenizer
splits
the
stream
of
characters
into
a
series
of
tokens.
Stemming
Lemma)za)on
Stemming
algorithms
reduce
words
"fishing",
"fished",
"fish",
and
"fisher"
to
the
root
word,
"fish"
Lemma*za*on
expands
words
to
their
inflected
forms
(ie
fishing
-‐>
fished,
fishes,
fish
but
not
fisher)
Stop
Word
Removal
Remove
common
stop
words
“and”,”or”
etc.
which
introduce
noise
in
the
search
process
Synonym
Expansion
Mapping
of
words
based
upon
thesaurus
(synonyms,
acronyms,
hypernyms,
business
rules,
etc..)
For
example
talk
-‐>
chat,
IBM
-‐>
“big
blue”,
trip
-‐>
journey
12. Designing the Search Index
Designing
a
good
Search
Applica)on
also
involves
many
aspects
of
user
interac)on
that
directly
influence
indexing
design
•
Data
Type
and
Data
Distribu)on
•
Server
side
parameters
•
Networking
•
Client
side
parameters
•
Query
pa4erns
14. Data and Distribution of Tokens
Common types of data that we index in a search index
•
Textual
data
(
human
generated
)
e.g.
messages,
news,
blogs
•
Textual
data
(
machine
generated
)
e.g.
logs
,
5ckets
•
Numerical
data
•
Geospa5al
data
How does this affect search index designs ?
•
Query
speed
and
indexing
speed
depend
on
the
size
of
an
index
•
Size
is
dependent
on
•
Number
of
documents
in
the
index
•
Average
size
of
each
document
•
Distribu5on
of
tokens
•
Index
features
eg.
Face5ng,
Highligh5ng
15. Server-side Factors
• Ratio of CPU’s to the number of solr cores running
•
2
Solr
indices
per
CPU
or
a
Thread
• Disk space
•
Disk
space
for
Solr
index
*
2
(
head
room
for
merge
cycles
)
• Memory
•
JVM
heap
•
Off
Heap
•
DocValues
16. Networking
Cluster design consideration
•
Should
a
cluster
span
data
centers
?
•
Latency
between
datacenters
•
Reliability
and
availability
SLA’s
•
Where
does
your
Zookeeper
ensemble
live
?
•
How
many
elec5on
members
•
Consider
observers
to
scale
zookeeper
•
Dynamically
promote
an
observer
to
elec5on
member
Manage concurrent connections on the server
Monitor network latencies for QoS guarantees
17. Client-side Factors
• Managing connections and reusing connections
• Which format to use for indexing data
•
javabin
•
csv
•
json
•
xml
• How many simultaneous threads to use
18. Experiments with NRT Indexing
It’s not always efficient to send a single document to Solr for indexing
How do you decide how many documents to send ?
Collector : A buffer that collects Solr update documents
•
Time
Triggers
(
T
)
•
Time
based
collector
on
the
client-‐side
to
batch
document
payloads
to
Solr
•
Document
Size
Triggers
(
S
)
•
Document
size
based
collector
on
the
client-‐side
to
batch
document
payloads
to
Solr
•
Document
Number
Triggers
(
N
)
•
Number
of
documents
based
collector
on
the
client-‐side
to
batch
document
payloads
to
Solr
The
collectors
are
all
simultaneously
used
in
order
of
priority.
The
lower
priority
collectors
act
as
a
cut-‐off
backups
to
safe
guard
from
overflows.
20. Benchmarking Setup
• Client application sending data to 4-way replicated SolrCloud
• 5 node Zookeeper ensemble
• All tests done with a similar dataset ( machine generated text )
• We synthesize a high throughput ingest stream, which serves as our input
• Soft commits set at 1sec
21. Benchmarking : Time Limit Tests
docs/sec
Time Triggers: Collection window in ms
22. Benchmarking : Document Limit Tests
docs/sec
Document Number Triggers: Collection window in number of documents
24. Observations
• On an average we were able to observe 5x-7x increase in ingestion throughput
• Optimization parameters are dependent constantly changing factors
• The tuning variables need to be constantly adjusted for best performance
• How to use this now
26. PID Controller
Proportional term ( P ) – present
Output proportional to current error value
Integral term ( I ) - past
Sum of instantaneous error over time,
and give accumulated offset that should
have been corrected previously
Derivative term ( D ) - future
Calculated by determining the slope of previous
error over time times the rate of change
27. PID implementation in the indexer
Solr
Cloud
Solr
response
Sampling
thread
Process
variable
Docs/sec
Client
indexer
process
Pick
one
of
the
Triggers
Time
(T
)
Control
Variable
PID
controller
implementa5on
Indexing
threads
29. Future work
• Perfect the PID indexer
• Add it to the YCSB benchmarking framework
• Add other server side parameters on the PID indexer
• Use the PID indexer along with the YCSB framework to size hardware
30. Never Stop Exploring:
Pushing the Limits of Solr
Anirudha Jadhav , Bloomberg LP
QUESTIONS ?