SlideShare a Scribd company logo
1 of 94
Optimize Is (Not) Bad For You
Deep Dive Into The Segment Merge Abyss
Rafał Kuć
Sematext Group, Inc.
Agenda
β€’ Segments – where, what & how
β€’ Writing segments
β€’ Modifying segments
β€’ Segment merging – what, where, how, why
β€’ Force merging
β€’ Force merging & SolrCloud
β€’ Performance considerations
β€’ Specialized merge policies
https://github.com/sematext/lr/tree/master/2017/optimize
3
01
Sematext & I
cloud
metrics
logs
&
4
01
Solr Collection Architecture
Zookeeper
5
01
Solr Collection Architecture
Zookeeper
SOLR
SOLR
SOLR
SOLR
6
01
Solr Collection Architecture
Zookeeper
SOLR
shard shard
SOLR
shard shard
SOLR
shard shard
SOLR
shard shard
7
01
Solr Shard Architecture
TLOG
8
01
Solr Shard Architecture
TLOG
Segment Segment Segment
Segment
9
01
Lucene Segment
Segment Info
Field Names
Stored Field Values
Point Values
Term Dictionary
Term Frequency
Term Proximity
Normalization
Per Document Vals
Live Documents
1
01
Inside the Segment – Term Dictionary
TERM DOCID
lucene <1>, <2>
revolution <1>, <2>
washington <1>
boston <2>
_1.tim
Doc1 Title: Lucene Revolution Washington, City: Washington D.C
Doc2 Title: Lucene Revolution Boston, City: Boston
_1.tip
1
01
Inside the Segment – Doc Values
Doc1 Title: Lucene Revolution Washington, City: Washington D.C
Doc2 Title: Lucene Revolution Boston, City: Boston
DOCID FIELD VALUE
1 Title Lucene Revolution Washington
1 City Washington D.C.
2 Title Lucene Revolution Boston
2 City Boston
_1.dvd
_1.dvm
1
01
Inside the Segment – Stored Fields
Doc1 Title: Lucene Revolution Washington, City: Washington D.C
Doc2 Title: Lucene Revolution Boston, City: Boston
DOCID VALUE
1 Title: Lucene Revolution Washington
City: Washington D.C
2 Title: Lucene Revolution Boston
City: Boston
_1.fdx
_1.fdt
1
01
Inside the Segment – Compound File System
_1.fdt
_1.fdx
_1.fnm
_1.nvd
_1.nvm
_1.si
_1.Lucene50_0.doc
_1.Lucene50_0.pos
_1.Lucene50_0.tim
_1.Lucene50_0.tip
_1.Lucene50_0.dvd
_1.Lucene50_0.dvm
1
01
Inside the Segment – Compound File System
_1.fdt
_1.fdx
_1.fnm
_1.nvd
_1.nvm
_1.si
_1.Lucene50_0.doc
_1.Lucene50_0.pos
_1.Lucene50_0.tim
_1.Lucene50_0.tip
_1.Lucene50_0.dvd
_1.Lucene50_0.dvm
1
01
Inside the Segment – Compound File System
_1.fdt
_1.fdx
_1.fnm
_1.nvd
_1.nvm
_1.si
_1.Lucene50_0.doc
_1.Lucene50_0.pos
_1.Lucene50_0.tim
_1.Lucene50_0.tip
_1.Lucene50_0.dvd
_1.Lucene50_0.dvm
_2.cfs
_2.cfe
1
01
Indexing
1
01
Indexing
1
01
Indexing
1
01
Indexing
level/tier
2
01
Indexing
2
01
Indexing
2
01
Indexing
2
01
Indexing
2
01
Indexing
2
01
Indexing
2
01
Indexing
2
01
Deletes
2
01
Deletes – After Merge
2
01
Atomic Updates
$ curl -XPOST -H 'Content-Type: application/json'
'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[
{
"id" : "3",
"tags" : {
"add" : [ "solr" ]
}
}
]'
retrieve document
{
"id" : 3,
"tags" : [ "lucene" ],
"awesome" : true
}
3
01
Atomic Updates
$ curl -XPOST -H 'Content-Type: application/json'
'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[
{
"id" : "3",
"tags" : {
"add" : [ "solr" ]
}
}
]'
{
"id" : 3,
"tags" : [ "lucene", "solr" ],
"awesome" : true
}
apply changes
3
01
Atomic Updates
$ curl -XPOST -H 'Content-Type: application/json'
'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[
{
"id" : "3",
"tags" : {
"add" : [ "solr" ]
}
}
]'
{
"id" : 3,
"tags" : [ "lucene", "solr" ],
"awesome" : true
}
delete old document
3
01
Atomic Updates
$ curl -XPOST -H 'Content-Type: application/json'
'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[
{
"id" : "3",
"tags" : {
"add" : [ "solr" ]
}
}
]'
{
"id" : 3,
"tags" : [ "lucene", "solr" ],
"awesome" : true
}
3
01
Atomic Updates – In Place
Works on top of numeric, doc values based fields
Fields need to be not indexed and not stored
Doesn’t require delete/index
Support only inc and set modifers
$ curl -XPOST -H 'Content-Type: application/json'
'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[
{
"id" : "3",
"views" : {
"inc" : 100
}
}
]'
3
01
Atomic Updates – In Place
$ curl -XPOST -H 'Content-Type: application/json'
'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[
{
"id" : "3",
"views" : {
"inc" : 100
}
}
]'
retrieve document
{
"id" : 3,
"tags" : [ "lucene", "solr" ],
"awesome" : true
}
3
01
Atomic Updates – In Place
$ curl -XPOST -H 'Content-Type: application/json'
'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[
{
"id" : "3",
"views" : {
"inc" : 100
}
}
]'
{
"id" : 3,
"tags" : [ "lucene", "solr" ],
"awesome" : true,
"views" : 100
}
apply changes
3
01
Atomic Updates – In Place
$ curl -XPOST -H 'Content-Type: application/json'
'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[
{
"id" : "3",
"views" : {
"inc" : 100
}
}
]'
{
"id" : 3,
"tags" : [ "lucene", "solr" ],
"awesome" : true,
"views" : 100
}
update doc values
3
01
Search – Importance of Segments
Immutable – write once read many
3
01
Search – Importance of Segments
Immutable – write once read many
More segments – slower search speed
3
01
Search – Importance of Segments
Immutable – write once read many
More segments – slower search speed
Fewer segments – faster searches
4
01
Search – Importance of Segments
Immutable – write once read many
More segments – slower search speed
Fewer segments – faster searches
Fewer segments – smaller shard size
4
01
Search – Importance of Segments
Immutable – write once read many
More segments – slower search speed
Fewer segments – faster searches
Fewer segments – smaller shard size
Rapid segment changes – worse I/O cache usage
4
01
Taking Control
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">10</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">10</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">5120</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
4
01
Taking Control
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">10</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">10</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">5120</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
Merge Scheduler
<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler" />
4
01
Taking Control
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">10</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">10</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">5120</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
Merge Scheduler
<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler" />
Segment Warmer
<mergedSegmentWarmer
class="org.apache.lucene.index.SimpleMergedSegmentWarmer" />
4
01
Taking Control – Default Indexing Throughput
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">10</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">10</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">5120</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
4
01
Taking Control – Default Indexing Throughput
throughput < 5k/sec @ ~14GB
4
01
Taking Control – Max Merged Segment Size
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">10</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">10</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">5120</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
Lower higher indexing throughput – smaller segments
Higher better search latency (depends) – more merges
4
01
Taking Control – Lowering Max Merged Size
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">10</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">10</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">512</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
4
01
Taking Control – Lowering Max Segment Size
throughput < 5k/sec @ ~15.5GB
11% throughput increase
5
01
Taking Control – Merge At Once
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">10</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">10</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">5120</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
Lower better search latency (depends)
Higher higher indexing throughput
5
01
Taking Control – Lowering Merge At Once
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">2</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">10</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">5120</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
5
01
Taking Control – Lowering Merge At Once
throughput < 5k/sec @ ~13GB
8% throughput decrease
5
01
Taking Control – Merge At Once Explicit
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">10</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">10</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">5120</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
Controls number of segments merged at once during force merge
5
01
Taking Control – Segments Per Tier
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">10</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">10</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">5120</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
Lower value means more merging, but less segments
Along with maxMergeAtOnce can smoothen I/O spikes
For better indexing throughput set maxMergeAtOnce <
segmentsPerTier
5
01
Taking Control – Combined Together
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">30</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">30</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">512</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
5
01
Taking Control – Combined Together
throughput < 5k/sec @ ~15GB
but look at read difference
5
01
Taking Control – Default vs Combined Read/Write
default settings
5
01
Taking Control – Default vs Combined Read/Write
default settings combined changes settings
5
01
Taking Control – Reclaim Deletes Weight
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">10</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">10</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">5120</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
Controls importance of merging segments with deleted documents
Increase to put priority on merging segments with deleted documents
6
01
Taking Control – No CFS Ratio
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">10</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">10</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">5120</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
Controls compound file system segments ratio
To completely disable CFS set to 0.0
6
01
Taking Control – Merge Scheduler
Controls maximum number of concurrent merges
Merge Scheduler
<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler">
<int name="maxMergeCount">4</int>
<int name="maxThreadCount">4</int>
</mergeScheduler>
6
01
Taking Control – Merge Scheduler
Controls number of threads dedicated to merging
Merge Scheduler
<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler">
<int name="maxMergeCount">4</int>
<int name="maxThreadCount">4</int>
</mergeScheduler>
6
01
Taking Control – Merge Scheduler
Controls number of threads dedicated to merging
For spinning drives set maxThreadCount to 1
Merge Scheduler
<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler">
<int name="maxMergeCount">4</int>
<int name="maxThreadCount">4</int>
</mergeScheduler>
6
01
Taking Control – Merge Scheduler
Controls number of threads dedicated to merging
For spinning drives set maxThreadCount to 1
For SSD set maxThreadCount to min(4, #CPUs / 2)
Merge Scheduler
<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler">
<int name="maxMergeCount">4</int>
<int name="maxThreadCount">4</int>
</mergeScheduler>
6
01
Optimize aka Force Merge
Forces segment merge – usually very expensive
6
01
Optimize aka Force Merge
Forces segment merge – usually very expensive
Desired number of segments can be specified
6
01
Optimize aka Force Merge
Forces segment merge – usually very expensive
Desired number of segments can be specified
Done on all shards at the same time (by default)
6
01
Optimize aka Force Merge
Forces segment merge – usually very expensive
Desired number of segments can be specified
Done on all shards at the same time (by default)
Can be very bad or very good – depending on the use case
6
01
Optimize aka Force Merge
Forces segment merge – usually very expensive
Desired number of segments can be specified
Done on all shards at the same time (by default)
Can be very bad or very good – depending on the use case
$ curl
'http://solr:8983/solr/lr/update?optimize=true&numSegments=1&waitFlush=false'
7
01
Force Merge – The Good
Improves search speed (fewer segments)
7
01
Force Merge – The Good
Improves search speed (fewer segments)
Removes deleted documents
7
01
Force Merge – The Good
Improves search speed (fewer segments)
Removes deleted documents
Shrinks the index by pruning duplicated data
7
01
Force Merge – The Good
Improves search speed (fewer segments)
Removes deleted documents
Shrinks the index by pruning duplicated data
Reduces number of used files
7
01
Force Merge – The Bad
Invalidates operating system I/O cache
7
01
Force Merge – The Bad
Invalidates operating system I/O cache
Very expensive to perform – rewrites all segments
7
01
Force Merge – The Bad
Invalidates operating system I/O cache
Very expensive to perform – rewrites all segments
Not efficient on changing data
7
01
Force Merge – The Bad
Invalidates operating system I/O cache
Very expensive to perform – rewrites all segments
Not efficient on changing data
May cause performance issues
7
01
Force Merge – The Bad
Invalidates operating system I/O cache
Very expensive to perform – rewrites all segments
Not efficient on changing data
May cause performance issues
Will cause temporary increase of disk usage (up to 3x)
7
01
Force Merge – SolrCloud Performance Example
8
01
Force Merge – SolrCloud Performance Example
8
01
Force Merge – Legacy
Index on the master server
Solr Master
Solr Slave
Solr Slave
Solr Slave
index
Documents
8
01
Force Merge – Legacy
Index on the master server
Force merge on the master server
Solr Master
Solr Slave
Solr Slave
Solr Slave
force merge
8
01
Force Merge – Legacy
Index on the master server
Force merge on the master server
Replicate after optimize is done
Solr Master
Solr Slave
Solr Slave
Solr Slave
pull after optimize
8
01
Force Merge – SolrCloud (Solr 7 – pull replicas)
Create collection
Force merge
Solr will do the rest
Solr Solr
Solr Solr
Primary 1
Primary 2 Pull Replica 2
Pull Replica 1
8
01
Force Merge – SolrCloud (NRT, pre 7.0)
Ask yourself if you really need force merge
Solr Solr
Solr Solr
8
01
Force Merge – SolrCloud (NRT replicas, pre 7.0)
Ask yourself if you really need force merge
Create collection on part of the nodes
Solr Solr
Solr Solr
Primary 1
Primary 2
8
01
Force Merge – SolrCloud (NRT replicas, pre 7.0)
Ask yourself if you really need force merge
Create collection on part of the nodes
Index
Solr Solr
Solr Solr
Primary 1
Primary 2
DocumentsDocuments
Documents
Documents
8
01
Force Merge – SolrCloud (NRT replicas, pre 7.0)
Ask yourself if you really need force merge
Create collection on part of the nodes
Index
Force merge
Solr Solr
Solr Solr
Primary 1
Primary 2optimize
8
01
Force Merge – SolrCloud (NRT replicas, pre 7.0)
Ask yourself if you really need force merge
Create collection on part of the nodes
Index
Force merge
Create replicas
Solr Solr
Solr Solr
Primary 1
Primary 2 Replica 2
Replica 1
9
01
Specialized Merge Policy Example – Sorting
Sorting Merge Policy Factory Example
<mergePolicyFactory class="org.apache.solr.index.SortingMergePolicyFactory">
<str name="sort">timestamp desc</str>
<str name="wrapper.prefix">inner</str>
<str name="inner.class">org.apache.solr.index.TieredMergePolicyFactory</str>
<int name="inner.maxMergeAtOnce">10</int>
<int name="inner.segmentsPerTier">10</int>
<double name="inner.noCFSRatio">0.1</double>
</mergePolicyFactory>
9
01
Specialized Merge Policy Example – Sorting
Sorting Merge Policy Factory Example
<mergePolicyFactory class="org.apache.solr.index.SortingMergePolicyFactory">
<str name="sort">timestamp desc</str>
<str name="wrapper.prefix">inner</str>
<str name="inner.class">org.apache.solr.index.TieredMergePolicyFactory</str>
<int name="inner.maxMergeAtOnce">10</int>
<int name="inner.segmentsPerTier">10</int>
<double name="inner.noCFSRatio">0.1</double>
</mergePolicyFactory>
Pre-sorts data during merge for:
- faster range queries
- faster data retrieval
- possibility of early query termination
- convenient for time based data
9
01
http://sematext.com/jobs
You love like we do?
You want to work with ?
Want to work with open source?
You want to do fun stuff?
9
01
Get in touch
RafaΕ‚
rafal.kuc@sematext.com
@kucrafal
http://sematext.com
@sematext http://sematext.com/jobs
Come talk to us
at the booth
Thank You

More Related Content

What's hot

λ°λΈŒμ‹œμŠ€ν„°μ¦ˆ 데이터 레이크 ꡬ좕 이야기 : Data Lake architecture case study (박주홍 데이터 뢄석 및 인프라 νŒ€...
λ°λΈŒμ‹œμŠ€ν„°μ¦ˆ 데이터 레이크 ꡬ좕 이야기 : Data Lake architecture case study (박주홍 데이터 뢄석 및 인프라 νŒ€...λ°λΈŒμ‹œμŠ€ν„°μ¦ˆ 데이터 레이크 ꡬ좕 이야기 : Data Lake architecture case study (박주홍 데이터 뢄석 및 인프라 νŒ€...
λ°λΈŒμ‹œμŠ€ν„°μ¦ˆ 데이터 레이크 ꡬ좕 이야기 : Data Lake architecture case study (박주홍 데이터 뢄석 및 인프라 νŒ€...
Amazon Web Services Korea
Β 
Alfresco search services: Now and Then
Alfresco search services: Now and ThenAlfresco search services: Now and Then
Alfresco search services: Now and Then
Angel Borroy LΓ³pez
Β 

What's hot (20)

Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Β 
Alfresco Content Modelling and Policy Behaviours
Alfresco Content Modelling and Policy BehavioursAlfresco Content Modelling and Policy Behaviours
Alfresco Content Modelling and Policy Behaviours
Β 
Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2
Β 
Care and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst OptimizerCare and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst Optimizer
Β 
Elastic Stack Introduction
Elastic Stack IntroductionElastic Stack Introduction
Elastic Stack Introduction
Β 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
Β 
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby NodeHadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node
Β 
[2D1]Elasticsearch ᄉα…₯α†Όα„‚α…³α†Ό α„Žα…¬α„Œα…₯ᆨᄒα…ͺ
[2D1]Elasticsearch ᄉα…₯α†Όα„‚α…³α†Ό α„Žα…¬α„Œα…₯ᆨᄒα…ͺ[2D1]Elasticsearch ᄉα…₯α†Όα„‚α…³α†Ό α„Žα…¬α„Œα…₯ᆨᄒα…ͺ
[2D1]Elasticsearch ᄉα…₯α†Όα„‚α…³α†Ό α„Žα…¬α„Œα…₯ᆨᄒα…ͺ
Β 
MySQL Index Cookbook
MySQL Index CookbookMySQL Index Cookbook
MySQL Index Cookbook
Β 
λ°λΈŒμ‹œμŠ€ν„°μ¦ˆ 데이터 레이크 ꡬ좕 이야기 : Data Lake architecture case study (박주홍 데이터 뢄석 및 인프라 νŒ€...
λ°λΈŒμ‹œμŠ€ν„°μ¦ˆ 데이터 레이크 ꡬ좕 이야기 : Data Lake architecture case study (박주홍 데이터 뢄석 및 인프라 νŒ€...λ°λΈŒμ‹œμŠ€ν„°μ¦ˆ 데이터 레이크 ꡬ좕 이야기 : Data Lake architecture case study (박주홍 데이터 뢄석 및 인프라 νŒ€...
λ°λΈŒμ‹œμŠ€ν„°μ¦ˆ 데이터 레이크 ꡬ좕 이야기 : Data Lake architecture case study (박주홍 데이터 뢄석 및 인프라 νŒ€...
Β 
Alfresco tuning part1
Alfresco tuning part1Alfresco tuning part1
Alfresco tuning part1
Β 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Β 
Alfresco search services: Now and Then
Alfresco search services: Now and ThenAlfresco search services: Now and Then
Alfresco search services: Now and Then
Β 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
Β 
Zdlra copy to cloud
Zdlra copy to cloudZdlra copy to cloud
Zdlra copy to cloud
Β 
Kudu Deep-Dive
Kudu Deep-DiveKudu Deep-Dive
Kudu Deep-Dive
Β 
Jose portillo dev con presentation 1138
Jose portillo   dev con presentation 1138Jose portillo   dev con presentation 1138
Jose portillo dev con presentation 1138
Β 
Technical tips for secure Apache Hadoop cluster #ApacheConAsia #ApacheCon
Technical tips for secure Apache Hadoop cluster #ApacheConAsia #ApacheConTechnical tips for secure Apache Hadoop cluster #ApacheConAsia #ApacheCon
Technical tips for secure Apache Hadoop cluster #ApacheConAsia #ApacheCon
Β 
Cassandra nice use cases and worst anti patterns
Cassandra nice use cases and worst anti patternsCassandra nice use cases and worst anti patterns
Cassandra nice use cases and worst anti patterns
Β 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Β 

Viewers also liked

Viewers also liked (6)

Effective Hive Queries
Effective Hive QueriesEffective Hive Queries
Effective Hive Queries
Β 
Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6
Β 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
Β 
Solr on Docker - the Good, the Bad and the Ugly
Solr on Docker - the Good, the Bad and the UglySolr on Docker - the Good, the Bad and the Ugly
Solr on Docker - the Good, the Bad and the Ugly
Β 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloud
Β 
How to Run Solr on Docker and Why
How to Run Solr on Docker and WhyHow to Run Solr on Docker and Why
How to Run Solr on Docker and Why
Β 

Similar to Solr Search Engine: Optimize Is (Not) Bad for You

04 data accesstechnologies
04 data accesstechnologies04 data accesstechnologies
04 data accesstechnologies
Bat Programmer
Β 
Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr Tutorial
Sourcesense
Β 
Letting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search ComponentLetting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search Component
Jay Luker
Β 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Erik Hatcher
Β 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Erik Hatcher
Β 

Similar to Solr Search Engine: Optimize Is (Not) Bad for You (20)

Optimize Is (Not) Bad For You - Rafał Kuć, Sematext Group, Inc.
Optimize Is (Not) Bad For You - Rafał Kuć, Sematext Group, Inc.Optimize Is (Not) Bad For You - Rafał Kuć, Sematext Group, Inc.
Optimize Is (Not) Bad For You - Rafał Kuć, Sematext Group, Inc.
Β 
04 data accesstechnologies
04 data accesstechnologies04 data accesstechnologies
04 data accesstechnologies
Β 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and Solr
Β 
IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys" IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys"
Β 
Interactive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupInteractive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval Meetup
Β 
20150210 solr introdution
20150210 solr introdution20150210 solr introdution
20150210 solr introdution
Β 
Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr Tutorial
Β 
Presto at Tivo, Boston Hadoop Meetup
Presto at Tivo, Boston Hadoop MeetupPresto at Tivo, Boston Hadoop Meetup
Presto at Tivo, Boston Hadoop Meetup
Β 
Letting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search ComponentLetting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search Component
Β 
AWS re:Invent 2016: Real-Time Data Exploration and Analytics with Amazon Elas...
AWS re:Invent 2016: Real-Time Data Exploration and Analytics with Amazon Elas...AWS re:Invent 2016: Real-Time Data Exploration and Analytics with Amazon Elas...
AWS re:Invent 2016: Real-Time Data Exploration and Analytics with Amazon Elas...
Β 
Deep Dive On Object Storage: Amazon S3 and Amazon Glacier - AWS PS Summit Can...
Deep Dive On Object Storage: Amazon S3 and Amazon Glacier - AWS PS Summit Can...Deep Dive On Object Storage: Amazon S3 and Amazon Glacier - AWS PS Summit Can...
Deep Dive On Object Storage: Amazon S3 and Amazon Glacier - AWS PS Summit Can...
Β 
Rails and the Apache SOLR Search Engine
Rails and the Apache SOLR Search EngineRails and the Apache SOLR Search Engine
Rails and the Apache SOLR Search Engine
Β 
Rapid prototyping search applications with solr
Rapid prototyping search applications with solrRapid prototyping search applications with solr
Rapid prototyping search applications with solr
Β 
Log Analysis At Scale
Log Analysis At ScaleLog Analysis At Scale
Log Analysis At Scale
Β 
Web analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.comWeb analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.com
Β 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Β 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Β 
Developing on SQL Azure
Developing on SQL AzureDeveloping on SQL Azure
Developing on SQL Azure
Β 
Lessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at VintedLessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at Vinted
Β 
Adding Search to Relational Databases
Adding Search to Relational DatabasesAdding Search to Relational Databases
Adding Search to Relational Databases
Β 

More from Sematext Group, Inc.

Metrics, Logs, Transaction Traces, Anomaly Detection at Scale
Metrics, Logs, Transaction Traces, Anomaly Detection at ScaleMetrics, Logs, Transaction Traces, Anomaly Detection at Scale
Metrics, Logs, Transaction Traces, Anomaly Detection at Scale
Sematext Group, Inc.
Β 

More from Sematext Group, Inc. (20)

Tweaking the Base Score: Lucene/Solr Similarities Explained
Tweaking the Base Score: Lucene/Solr Similarities ExplainedTweaking the Base Score: Lucene/Solr Similarities Explained
Tweaking the Base Score: Lucene/Solr Similarities Explained
Β 
OOPs, OOMs, oh my! Containerizing JVM apps
OOPs, OOMs, oh my! Containerizing JVM appsOOPs, OOMs, oh my! Containerizing JVM apps
OOPs, OOMs, oh my! Containerizing JVM apps
Β 
Is observability good for your brain?
Is observability good for your brain?Is observability good for your brain?
Is observability good for your brain?
Β 
Introducing log analysis to your organization
Introducing log analysis to your organization Introducing log analysis to your organization
Introducing log analysis to your organization
Β 
Monitoring and Log Management for
Monitoring and Log Management forMonitoring and Log Management for
Monitoring and Log Management for
Β 
Introduction to solr
Introduction to solrIntroduction to solr
Introduction to solr
Β 
Building Resilient Log Aggregation Pipeline with Elasticsearch & Kafka
Building Resilient Log Aggregation Pipeline with Elasticsearch & KafkaBuilding Resilient Log Aggregation Pipeline with Elasticsearch & Kafka
Building Resilient Log Aggregation Pipeline with Elasticsearch & Kafka
Β 
Elasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep diveElasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep dive
Β 
Tuning Solr & Pipeline for Logs
Tuning Solr & Pipeline for LogsTuning Solr & Pipeline for Logs
Tuning Solr & Pipeline for Logs
Β 
Running High Performance & Fault-tolerant Elasticsearch Clusters on Docker
Running High Performance & Fault-tolerant Elasticsearch Clusters on DockerRunning High Performance & Fault-tolerant Elasticsearch Clusters on Docker
Running High Performance & Fault-tolerant Elasticsearch Clusters on Docker
Β 
Top Node.js Metrics to Watch
Top Node.js Metrics to WatchTop Node.js Metrics to Watch
Top Node.js Metrics to Watch
Β 
Running High Performance and Fault Tolerant Elasticsearch Clusters on Docker
Running High Performance and Fault Tolerant Elasticsearch Clusters on DockerRunning High Performance and Fault Tolerant Elasticsearch Clusters on Docker
Running High Performance and Fault Tolerant Elasticsearch Clusters on Docker
Β 
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
Β 
From Zero to Production Hero: Log Analysis with Elasticsearch (from Velocity ...
From Zero to Production Hero: Log Analysis with Elasticsearch (from Velocity ...From Zero to Production Hero: Log Analysis with Elasticsearch (from Velocity ...
From Zero to Production Hero: Log Analysis with Elasticsearch (from Velocity ...
Β 
Docker Logging Webinar
Docker Logging  WebinarDocker Logging  Webinar
Docker Logging Webinar
Β 
Docker Monitoring Webinar
Docker Monitoring  WebinarDocker Monitoring  Webinar
Docker Monitoring Webinar
Β 
Metrics, Logs, Transaction Traces, Anomaly Detection at Scale
Metrics, Logs, Transaction Traces, Anomaly Detection at ScaleMetrics, Logs, Transaction Traces, Anomaly Detection at Scale
Metrics, Logs, Transaction Traces, Anomaly Detection at Scale
Β 
Tuning Elasticsearch Indexing Pipeline for Logs
Tuning Elasticsearch Indexing Pipeline for LogsTuning Elasticsearch Indexing Pipeline for Logs
Tuning Elasticsearch Indexing Pipeline for Logs
Β 
Solr Anti Patterns
Solr Anti PatternsSolr Anti Patterns
Solr Anti Patterns
Β 
Tuning Solr for Logs
Tuning Solr for LogsTuning Solr for Logs
Tuning Solr for Logs
Β 

Recently uploaded

Recently uploaded (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
Β 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
Β 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
Β 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
Β 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Β 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
Β 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
Β 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
Β 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Β 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
Β 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
Β 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
Β 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Β 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Β 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
Β 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Β 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
Β 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
Β 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
Β 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
Β 

Solr Search Engine: Optimize Is (Not) Bad for You

  • 1. Optimize Is (Not) Bad For You Deep Dive Into The Segment Merge Abyss RafaΕ‚ KuΔ‡ Sematext Group, Inc.
  • 2. Agenda β€’ Segments – where, what & how β€’ Writing segments β€’ Modifying segments β€’ Segment merging – what, where, how, why β€’ Force merging β€’ Force merging & SolrCloud β€’ Performance considerations β€’ Specialized merge policies https://github.com/sematext/lr/tree/master/2017/optimize
  • 6. 6 01 Solr Collection Architecture Zookeeper SOLR shard shard SOLR shard shard SOLR shard shard SOLR shard shard
  • 9. 9 01 Lucene Segment Segment Info Field Names Stored Field Values Point Values Term Dictionary Term Frequency Term Proximity Normalization Per Document Vals Live Documents
  • 10. 1 01 Inside the Segment – Term Dictionary TERM DOCID lucene <1>, <2> revolution <1>, <2> washington <1> boston <2> _1.tim Doc1 Title: Lucene Revolution Washington, City: Washington D.C Doc2 Title: Lucene Revolution Boston, City: Boston _1.tip
  • 11. 1 01 Inside the Segment – Doc Values Doc1 Title: Lucene Revolution Washington, City: Washington D.C Doc2 Title: Lucene Revolution Boston, City: Boston DOCID FIELD VALUE 1 Title Lucene Revolution Washington 1 City Washington D.C. 2 Title Lucene Revolution Boston 2 City Boston _1.dvd _1.dvm
  • 12. 1 01 Inside the Segment – Stored Fields Doc1 Title: Lucene Revolution Washington, City: Washington D.C Doc2 Title: Lucene Revolution Boston, City: Boston DOCID VALUE 1 Title: Lucene Revolution Washington City: Washington D.C 2 Title: Lucene Revolution Boston City: Boston _1.fdx _1.fdt
  • 13. 1 01 Inside the Segment – Compound File System _1.fdt _1.fdx _1.fnm _1.nvd _1.nvm _1.si _1.Lucene50_0.doc _1.Lucene50_0.pos _1.Lucene50_0.tim _1.Lucene50_0.tip _1.Lucene50_0.dvd _1.Lucene50_0.dvm
  • 14. 1 01 Inside the Segment – Compound File System _1.fdt _1.fdx _1.fnm _1.nvd _1.nvm _1.si _1.Lucene50_0.doc _1.Lucene50_0.pos _1.Lucene50_0.tim _1.Lucene50_0.tip _1.Lucene50_0.dvd _1.Lucene50_0.dvm
  • 15. 1 01 Inside the Segment – Compound File System _1.fdt _1.fdx _1.fnm _1.nvd _1.nvm _1.si _1.Lucene50_0.doc _1.Lucene50_0.pos _1.Lucene50_0.tim _1.Lucene50_0.tip _1.Lucene50_0.dvd _1.Lucene50_0.dvm _2.cfs _2.cfe
  • 29. 2 01 Atomic Updates $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "tags" : { "add" : [ "solr" ] } } ]' retrieve document { "id" : 3, "tags" : [ "lucene" ], "awesome" : true }
  • 30. 3 01 Atomic Updates $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "tags" : { "add" : [ "solr" ] } } ]' { "id" : 3, "tags" : [ "lucene", "solr" ], "awesome" : true } apply changes
  • 31. 3 01 Atomic Updates $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "tags" : { "add" : [ "solr" ] } } ]' { "id" : 3, "tags" : [ "lucene", "solr" ], "awesome" : true } delete old document
  • 32. 3 01 Atomic Updates $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "tags" : { "add" : [ "solr" ] } } ]' { "id" : 3, "tags" : [ "lucene", "solr" ], "awesome" : true }
  • 33. 3 01 Atomic Updates – In Place Works on top of numeric, doc values based fields Fields need to be not indexed and not stored Doesn’t require delete/index Support only inc and set modifers $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "views" : { "inc" : 100 } } ]'
  • 34. 3 01 Atomic Updates – In Place $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "views" : { "inc" : 100 } } ]' retrieve document { "id" : 3, "tags" : [ "lucene", "solr" ], "awesome" : true }
  • 35. 3 01 Atomic Updates – In Place $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "views" : { "inc" : 100 } } ]' { "id" : 3, "tags" : [ "lucene", "solr" ], "awesome" : true, "views" : 100 } apply changes
  • 36. 3 01 Atomic Updates – In Place $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "views" : { "inc" : 100 } } ]' { "id" : 3, "tags" : [ "lucene", "solr" ], "awesome" : true, "views" : 100 } update doc values
  • 37. 3 01 Search – Importance of Segments Immutable – write once read many
  • 38. 3 01 Search – Importance of Segments Immutable – write once read many More segments – slower search speed
  • 39. 3 01 Search – Importance of Segments Immutable – write once read many More segments – slower search speed Fewer segments – faster searches
  • 40. 4 01 Search – Importance of Segments Immutable – write once read many More segments – slower search speed Fewer segments – faster searches Fewer segments – smaller shard size
  • 41. 4 01 Search – Importance of Segments Immutable – write once read many More segments – slower search speed Fewer segments – faster searches Fewer segments – smaller shard size Rapid segment changes – worse I/O cache usage
  • 42. 4 01 Taking Control Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory>
  • 43. 4 01 Taking Control Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory> Merge Scheduler <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler" />
  • 44. 4 01 Taking Control Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory> Merge Scheduler <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler" /> Segment Warmer <mergedSegmentWarmer class="org.apache.lucene.index.SimpleMergedSegmentWarmer" />
  • 45. 4 01 Taking Control – Default Indexing Throughput Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory>
  • 46. 4 01 Taking Control – Default Indexing Throughput throughput < 5k/sec @ ~14GB
  • 47. 4 01 Taking Control – Max Merged Segment Size Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory> Lower higher indexing throughput – smaller segments Higher better search latency (depends) – more merges
  • 48. 4 01 Taking Control – Lowering Max Merged Size Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">512</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory>
  • 49. 4 01 Taking Control – Lowering Max Segment Size throughput < 5k/sec @ ~15.5GB 11% throughput increase
  • 50. 5 01 Taking Control – Merge At Once Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory> Lower better search latency (depends) Higher higher indexing throughput
  • 51. 5 01 Taking Control – Lowering Merge At Once Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">2</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory>
  • 52. 5 01 Taking Control – Lowering Merge At Once throughput < 5k/sec @ ~13GB 8% throughput decrease
  • 53. 5 01 Taking Control – Merge At Once Explicit Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory> Controls number of segments merged at once during force merge
  • 54. 5 01 Taking Control – Segments Per Tier Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory> Lower value means more merging, but less segments Along with maxMergeAtOnce can smoothen I/O spikes For better indexing throughput set maxMergeAtOnce < segmentsPerTier
  • 55. 5 01 Taking Control – Combined Together Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">30</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">30</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">512</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory>
  • 56. 5 01 Taking Control – Combined Together throughput < 5k/sec @ ~15GB but look at read difference
  • 57. 5 01 Taking Control – Default vs Combined Read/Write default settings
  • 58. 5 01 Taking Control – Default vs Combined Read/Write default settings combined changes settings
  • 59. 5 01 Taking Control – Reclaim Deletes Weight Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory> Controls importance of merging segments with deleted documents Increase to put priority on merging segments with deleted documents
  • 60. 6 01 Taking Control – No CFS Ratio Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory> Controls compound file system segments ratio To completely disable CFS set to 0.0
  • 61. 6 01 Taking Control – Merge Scheduler Controls maximum number of concurrent merges Merge Scheduler <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"> <int name="maxMergeCount">4</int> <int name="maxThreadCount">4</int> </mergeScheduler>
  • 62. 6 01 Taking Control – Merge Scheduler Controls number of threads dedicated to merging Merge Scheduler <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"> <int name="maxMergeCount">4</int> <int name="maxThreadCount">4</int> </mergeScheduler>
  • 63. 6 01 Taking Control – Merge Scheduler Controls number of threads dedicated to merging For spinning drives set maxThreadCount to 1 Merge Scheduler <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"> <int name="maxMergeCount">4</int> <int name="maxThreadCount">4</int> </mergeScheduler>
  • 64. 6 01 Taking Control – Merge Scheduler Controls number of threads dedicated to merging For spinning drives set maxThreadCount to 1 For SSD set maxThreadCount to min(4, #CPUs / 2) Merge Scheduler <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"> <int name="maxMergeCount">4</int> <int name="maxThreadCount">4</int> </mergeScheduler>
  • 65. 6 01 Optimize aka Force Merge Forces segment merge – usually very expensive
  • 66. 6 01 Optimize aka Force Merge Forces segment merge – usually very expensive Desired number of segments can be specified
  • 67. 6 01 Optimize aka Force Merge Forces segment merge – usually very expensive Desired number of segments can be specified Done on all shards at the same time (by default)
  • 68. 6 01 Optimize aka Force Merge Forces segment merge – usually very expensive Desired number of segments can be specified Done on all shards at the same time (by default) Can be very bad or very good – depending on the use case
  • 69. 6 01 Optimize aka Force Merge Forces segment merge – usually very expensive Desired number of segments can be specified Done on all shards at the same time (by default) Can be very bad or very good – depending on the use case $ curl 'http://solr:8983/solr/lr/update?optimize=true&numSegments=1&waitFlush=false'
  • 70. 7 01 Force Merge – The Good Improves search speed (fewer segments)
  • 71. 7 01 Force Merge – The Good Improves search speed (fewer segments) Removes deleted documents
  • 72. 7 01 Force Merge – The Good Improves search speed (fewer segments) Removes deleted documents Shrinks the index by pruning duplicated data
  • 73. 7 01 Force Merge – The Good Improves search speed (fewer segments) Removes deleted documents Shrinks the index by pruning duplicated data Reduces number of used files
  • 74. 7 01 Force Merge – The Bad Invalidates operating system I/O cache
  • 75. 7 01 Force Merge – The Bad Invalidates operating system I/O cache Very expensive to perform – rewrites all segments
  • 76. 7 01 Force Merge – The Bad Invalidates operating system I/O cache Very expensive to perform – rewrites all segments Not efficient on changing data
  • 77. 7 01 Force Merge – The Bad Invalidates operating system I/O cache Very expensive to perform – rewrites all segments Not efficient on changing data May cause performance issues
  • 78. 7 01 Force Merge – The Bad Invalidates operating system I/O cache Very expensive to perform – rewrites all segments Not efficient on changing data May cause performance issues Will cause temporary increase of disk usage (up to 3x)
  • 79. 7 01 Force Merge – SolrCloud Performance Example
  • 80. 8 01 Force Merge – SolrCloud Performance Example
  • 81. 8 01 Force Merge – Legacy Index on the master server Solr Master Solr Slave Solr Slave Solr Slave index Documents
  • 82. 8 01 Force Merge – Legacy Index on the master server Force merge on the master server Solr Master Solr Slave Solr Slave Solr Slave force merge
  • 83. 8 01 Force Merge – Legacy Index on the master server Force merge on the master server Replicate after optimize is done Solr Master Solr Slave Solr Slave Solr Slave pull after optimize
  • 84. 8 01 Force Merge – SolrCloud (Solr 7 – pull replicas) Create collection Force merge Solr will do the rest Solr Solr Solr Solr Primary 1 Primary 2 Pull Replica 2 Pull Replica 1
  • 85. 8 01 Force Merge – SolrCloud (NRT, pre 7.0) Ask yourself if you really need force merge Solr Solr Solr Solr
  • 86. 8 01 Force Merge – SolrCloud (NRT replicas, pre 7.0) Ask yourself if you really need force merge Create collection on part of the nodes Solr Solr Solr Solr Primary 1 Primary 2
  • 87. 8 01 Force Merge – SolrCloud (NRT replicas, pre 7.0) Ask yourself if you really need force merge Create collection on part of the nodes Index Solr Solr Solr Solr Primary 1 Primary 2 DocumentsDocuments Documents Documents
  • 88. 8 01 Force Merge – SolrCloud (NRT replicas, pre 7.0) Ask yourself if you really need force merge Create collection on part of the nodes Index Force merge Solr Solr Solr Solr Primary 1 Primary 2optimize
  • 89. 8 01 Force Merge – SolrCloud (NRT replicas, pre 7.0) Ask yourself if you really need force merge Create collection on part of the nodes Index Force merge Create replicas Solr Solr Solr Solr Primary 1 Primary 2 Replica 2 Replica 1
  • 90. 9 01 Specialized Merge Policy Example – Sorting Sorting Merge Policy Factory Example <mergePolicyFactory class="org.apache.solr.index.SortingMergePolicyFactory"> <str name="sort">timestamp desc</str> <str name="wrapper.prefix">inner</str> <str name="inner.class">org.apache.solr.index.TieredMergePolicyFactory</str> <int name="inner.maxMergeAtOnce">10</int> <int name="inner.segmentsPerTier">10</int> <double name="inner.noCFSRatio">0.1</double> </mergePolicyFactory>
  • 91. 9 01 Specialized Merge Policy Example – Sorting Sorting Merge Policy Factory Example <mergePolicyFactory class="org.apache.solr.index.SortingMergePolicyFactory"> <str name="sort">timestamp desc</str> <str name="wrapper.prefix">inner</str> <str name="inner.class">org.apache.solr.index.TieredMergePolicyFactory</str> <int name="inner.maxMergeAtOnce">10</int> <int name="inner.segmentsPerTier">10</int> <double name="inner.noCFSRatio">0.1</double> </mergePolicyFactory> Pre-sorts data during merge for: - faster range queries - faster data retrieval - possibility of early query termination - convenient for time based data
  • 92. 9 01 http://sematext.com/jobs You love like we do? You want to work with ? Want to work with open source? You want to do fun stuff?