Nested documents and parent/child relationships allow complex queries in ElasticSearch that mimic joins. Nested documents store child documents within parent documents and allow efficient subqueries on the child fields. Parent/child relationships index child documents separately but link them to parent documents, allowing independent updates while still combining parent and child fields in queries. Both approaches allow join-like queries without the overhead of relational joins.
1. Nested & Parent/Child
Docs
hidden gems in ElasticSearch
Anne Veling | ElasticSearch NL Meetup | February 26, 2013
2. agenda
Refworks Flow
Reference Manager for Researchers
Use of ElasticSearch in Flow
Use Case 1: Nested documents
Use Case 2: Parent/Child relations
Lessons Learned
9. Reference Canonicalization
We built a large Citation Authority index in ElasticSearch
With full, deduped metadata for a large portion of English
scientific research
In the Reference Edit screen
Try to find high quality matches to a large index of canonical
references of scientific articles
Based on known fields
Title, possibly partial and incorrect
Author(s)
Other identifying fields: journal, year, …
11. problem
Searching on a sub-document
Searching for all documents where
quthors.lastName: “Russell”
authors.firstNames: “G”
Also matches documents by
“Jack Russell and Frederickson, G”
We need a sub-document JOIN query…
Combined with other information on the parent document (title)
Oh noes! We‟re Can‟t
using a NoSQL we?
database, so we
can‟t…
12. query Lucene block indexing
term term Save “children”
documents always right
lucene documents before their “parent”
document
Requires you to write
BlockJoinQuery
ParentsFilter
ChildQuery
ToParentBlockJoinQuer
y
This means: all children
(and parent!) needs to
be reindexed upon any
change in them…
17. “nested”
Just setting the subdocument type to “nested” in mapping
Combine parent-query with “nested” query that specifies
the path
Complex subcombination JOIN operations
Automatic hiding of “nested” subdocuments
This will increase your index size
18. “nested”
Efficient!
ElasticSearch handles document updates
Child-whereclauses handled INSIDE parent query docEnum
Children are sharded with their parents => locality!
Facet counts (on parent) still correct!
Limitations
Combinations of nested subdocuments with other queries
Like “dis_max”, or “text”
No automatic recognition of “authors.lastName” in other queries
to a “nested” subquery
22. problem
How to index both Doc metadata and Pages text
Doc in Flow app
Pages only in PDF pipeline and on S3
Docs updated frequently, on the Flow app
Reindex Page would require download of text content from S3…
Nested Docs?
No; too slow for updates here…
23. solution
Parent/Child documents in ElasticSearch!
Store parent type on children type mapping
To index a child, specify the parent ID
Stored as “_parent” field on the child
Query
Combine parent query with “has_child” child-query
26. NOT SO SURE WHO IS PARENT, WHO IS C
IN PARENT-CHILD RELATIONS
27. conclusions
Parent/Child „remote key‟ solution in ElasticSearch
Easy connection of two types of documents with
Separate update cycles
Complex JOIN queries possibles, combining parent & child
fields
Slower than “nested”
Locality principle: Children always sharded with parent
Limitations
Has_child filter returns only parents, cannot return child data
But: has_parent filter
ElasticSearches caches parent-child ID table in heap…
28. conclusions
Complex join-style queries can
be done with ElasticSearch SELECT * FROM ARTICLES
LEFT JOIN AUTHORS ON
Easily AUTHORS.ARTICLEID = ARTICLES.ID
WHERE
Efficiently ARTICLES.TITLE MATCHES "market elasticity"
AND
AUTHORS.LASTNAME MATCHES "Russell"
Use “nested” types AND
AUTHORS.FIRSTNAME MATCHES "G"
If data can be duplicated
Very efficient
Use “parent/child” types
For real independently
updateable documents
29. conclusions
ElasticSearch rocks
Hides complex JSON document to Lucene key/value model
mapping
Allows you to easily use more of Lucene greatness
So you can focus on actual queries and use cases
NoSql does not mean NoJoins
Just forcing you to model in such a way, joins will be efficient