Presentation used at the Hippo meetup about enterprise search which took place in Amsterdam. The talk started with a general introduction about search with lucene, scaling with Solr and the distributed problems that elasticsearch successfully addresses.
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Hippo meetup: enterprise search with Solr and elasticsearch
1. 15th January 2013 – Hippo meetup
Luca Cavanna
Software developer & Search consultant at Trifork Amsterdam
luca.cavanna@trifork.nl - @lucacavanna
2. Trifork (aka Jteam/Dutchworks/Orange11)
Focus areas:
– Big data & Search
– Mobile
– Custom solutions
– Knowledge (GOTO Amsterdam)
● Hippo partner
● Hippo related search projects:
– uva.nl
– working on rijksoverheid.nl
3. Agenda
● Search introduction
– Lucene foundation
– Why do we need Solr or elasticsearch?
● Scaling with Solr
● Elasticsearch distributed nature
● Elasticsearch features
4. Apache Lucene
● High-performance, full-featured text search engine
library written entirely in Java
● It indexes documents as collections of fields
● A field is a string based key-value pair
● What data structure does it use under the hood?
5. Inverted index
term freq Posting list
1 The old night keeper keeps the keep in the town and 1 6
big 2 23
2 In the big old house in the big old gown.
dark 1 6
3 The house in the town had the big old keep did 1 4
grown 1 2
4 Where the old night keeper never did sleep.
had 1 3
house 2 23
5 The night keeper keeps the keep in the night
in 5 12356
6 And keeps in the dark and sleeps in the light. keep 3 135
keeper 3 145
keeps 3 156
light 1 6
never 1 4
night 3 145
old 4 1234
sleep 1 4
sleeps 1 6
the 6 123456
town 2 13
where 1 4
6. Inverted index
● Indexing
– Text analysis
● Tokenization, lowercasing and more
● The inverted index can contain more data
– Term offsets and more
● The inverted index itself doesn't contain the text for
displaying the search results
7. Indexing
● Lucene writes indexes as segments
● Segments are not modifiable: Write-Once
● Each segment is a searchable mini index
● Each segment contains
– Inverted index
– Stored fields
– ...and more
8. Indexing: the commit operation
● Documents are searchable only after a commit!
● Commit gives also durability
● The most expensive operation in Lucene!!!
9. Near-real-time search (since Lucene 2.9, exposed in Solr 4.0)
● With the Lucene near-real time API you don't need a
commit to make new documents searchable
● Less expensive than commit
● Doesn't guarantee durability though
● Exposed as soft commit in Solr 4.0
10. Lucene code example – indexing data
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40,
new StandardAnalyzer(Version.LUCENE_40));
Directory directory = FSDirectory.open(new File("data"));
IndexWriter writer = new IndexWriter(directory, config);
Document document = new Document();
FieldType idFieldType = new FieldType();
idFieldType.setIndexed(true);
idFieldType.setStored(true);
idFieldType.setTokenized(false);
document.add(new Field("id","id-1", idFieldType));
FieldType titleFieldType = new FieldType();
titleFieldType.setIndexed(true);
titleFieldType.setStored(true);
document.add(new Field("title","This is the title", titleFieldType));
FieldType descriptionFieldType = new FieldType();
descriptionFieldType.setIndexed(true);
document.add(new Field("description","This is the description", descriptionFieldType));
writer.addDocument(document);
writer.close();
11. Lucene code example – querying and showing results
QueryParser queryParser = new QueryParser(Version.LUCENE_40, "title",
new StandardAnalyzer(Version.LUCENE_40));
Query query = queryParser.parse(queryAsString);
Directory directory = FSDirectory.open(new File("data"));
IndexReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
TopDocs topDocs = indexSearcher.search(query, 10);
System.out.println("Total hits: " + topDocs.totalHits);
for (ScoreDoc hit : topDocs.scoreDocs) {
Document document = indexSearcher.doc(hit.doc);
for (IndexableField field : document) {
System.out.println(field.name() + ": " + field.stringValue());
}
}
12. What's missing?
● A common way to represent documents
● Interface to send document to (HTTP)
● A way to represent queries
● Interface to send queries to (HTTP)
● Configuration
● Caching
● Distributed infrastructure
● And more....
14. Scaling – why?
‣ The more concurrent searches you run, the slower they
get
‣ Indexing and searching on the same machine will
substantially harm search performance
‣ Segment merging may be CPU/IO intensive
operations
‣ Disk cache invalidation
‣ Fail over
16. Solr replication (pull approach)
• Master-slave based solution
• Single machine for indexing data (master)
• Multiple machines for querying (slaves)
• Master is not aware of the slaves
• Slave is aware of the master
• Load balancer responsible for balancing the query
requests
• What about real-time search? No way!
17. SolrCloud
• A set of new distributed capabilities in Solr
• uses Apache Zookeeper as a system of record for
the cluster state, for central configuration, and for
leader election
• Whatever server (shard) you send data to:
• the documents get distributed over the shards
• A shard can be a leader or a replica and contains a
subset of the data
• Easily scale up adding new Solr nodes
18. elasticsearch
● Distributed search engine built on top of Lucene
● Apache 2 license
● Written in Java
● RESTful
● Created and mainly developed by Shay Banon
● A company behind it: elasticsearch.com
● Regular releases
– Latest release 0.20.2
19. elasticsearch
● Schemaless
– Uses defaults and automatic type guessing
– Custom mappings may be defined if needed
● JSON oriented
● Multi tenancy
– Multiple indexes per node, multiple types per index
● Designed to be distributed from the beginning
● Almost everything is available as API (including
configuration)
● Wide range of administration APIs
20. elasticsearch distributed terminology
● Node: a running instance of elasticsearch which belongs
to a cluster (usually one node per server)
● Cluster: one or more nodes with the same cluster name
● Shard: a single Lucene instance. A low-level worker unit
managed by elasticsearch. An index is split into one or
more shards.
● Index: a logical namespace which points to one or more
shards
– Your code won't deal directly with a shard, only with
an index
– But an index is composed of more lucene indexes
(one per shard)
21. elasticsearch distributed terminology
● More shards:
– improve indexing performance
– increase data distribution (depends on # of nodes)
– Watch out: each shard has a cost as well!
● More replicas:
– increase failover
– improve querying performance
22. Transaction Log
• Indexed docs are fully persistent
• No need for a Lucene IndexWriter#commit
• Managed using a transaction log / WAL
• Full single node durability (kill dash 9)
• Utilized when doing hot relocation of shards
• Periodically “flushed” (calling IW#commit)
• Durability and real time search together!
38. Indexing (Push) - ElasticSearch
• Documents added through push requests
• Full JSON Object representation of Documents supported
• Embedded objects
• 1st class Parent / Child and Versioning
• Near Realtime index refreshing available
• Realtime get supported {
"name": "Luca Cavanna",
"location": {
"city": "Amsterdam",
"country": "The Netherlands"
}
}
39. Indexing (Pull) - ElasticSearch
• Data flows from sources using ‘Rivers’
• Continues to add data as it ‘flows’
• Can be added, removed, configured dynamically
• Out-of-the-box support for CouchDB, Twitter (implemented by the es
team)
• Community implementations for DBs, other NoSQL and Solr
River
River
40. Searching - ElasticSearch
• Search request in Request Body
• Powerful and extensible Query DSL
• Separation of Query and Filters
• Named Filters allowing tracking of which Documents matched which
Filters
• By default storing the source of each document (_source field)
• Catch all feature enabled by default (_all field)
• Sorting of results
• Highlighting, Faceting, Boosting...and more
42. Thanks
There would be a lot more to say:
• Query DSL
• Scripting module (pluggable implementation)
• Percolator
• Running it embedded
Check them out yourself if you are interested!
Questions?