Nested and Parent/Child Docs in ElasticSearch

Nested & Parent/Child
Docs
hidden gems in ElasticSearch

Anne Veling | ElasticSearch NL Meetup | February 26, 2013

agenda
Refworks Flow
Reference Manager for Researchers

Use of ElasticSearch in Flow

Use Case 1: Nested documents

Use Case 2: Parent/Child relations

Lessons Learned

introduction
Anne Veling, @anneveling

Self-employed contractor
Software Architect
Agile process management
Performance optimization
Lucene/SOLR/ElasticSearch implementations & training

architecture

Flow Citation
Mongo
Authority

Elastic PDF
Search Pipeline

Citation Canonicalization
Use Case 1

Reference Canonicalization
We built a large Citation Authority index in ElasticSearch
With full, deduped metadata for a large portion of English
scientific research

In the Reference Edit screen
Try to find high quality matches to a large index of canonical
references of scientific articles
Based on known fields
Title, possibly partial and incorrect
Author(s)
Other identifying fields: journal, year, …

{
"query": {
"bool": {
"must": [
{
"text": {
"title": "market elasticity"
}
},
{
"text": {
"authors.lastName": "Russell"
}
},
{
"text": {
"authors.firstNames": "G"
}
}
]
}
}
}

problem
Searching on a sub-document
Searching for all documents where
quthors.lastName: “Russell”
authors.firstNames: “G”
Also matches documents by
“Jack Russell and Frederickson, G”
We need a sub-document JOIN query…
Combined with other information on the parent document (title)

Oh noes! We‟re Can‟t
using a NoSQL we?
database, so we
can‟t…

query Lucene block indexing
term term Save “children”
documents always right
lucene documents before their “parent”
document
Requires you to write
BlockJoinQuery
ParentsFilter
ChildQuery
ToParentBlockJoinQuer
y

This means: all children
(and parent!) needs to
be reindexed upon any
change in them…

mapping
authors: {
properties: {
rawName: {
analyzer: “caName”
type: “string”
},
lastName: {
type: “string”
},
firstNames: {
null_value: “__NONAME”
type: “string”
}
},
type: “nested”
},
title: {
analyzer: “caText”
type: “string”
}

{
{
"bool" : {
"filtered" : {
"must" : [ {
"query" : {
"text" : {
"title" : {
"query" : "market elasticity",
"type" : "phrase",
"slop" : 2
query "text" : {
"lastName" : {
"query" : "Russell",
"type" : "boolean",
"operator" : "AND"
}
}
}
}
}, {
},
"bool" : {
"filter" : {
"must" : {
"missing" : {
"nested" : {
"field" : "firstNames"
"query" : {
}
"bool" : {
}
"should" : [ {
}
"bool" : {
} ]
"must" : [ {
}
"text" : {
},
"lastName" : {
"path" : "authors"
"query" : "Russell",
}
"type" : "boolean"
}
}
}
}
} ]
}, {
}
"bool" : {
}
"must" : {
"bool" : {
"should" : [ {
"text" : { (title:"market elasticity") AND (
"firstNames" : {
"query" : "G",
authors: (
"type" : "boolean" (lastName:"Russell") AND (
}
} (firstNames:"G") OR
}, { (firstNames:"g*") OR
"prefix" : {
"firstNames" : "g" (lastName:"Russell" AND NOT(firstNames))
} )
} ]
} )
}
}
)
} ]
}
},

“nested”
Just setting the subdocument type to “nested” in mapping

Combine parent-query with “nested” query that specifies
the path

Complex subcombination JOIN operations

Automatic hiding of “nested” subdocuments
This will increase your index size

“nested”
Efficient!
ElasticSearch handles document updates
Child-whereclauses handled INSIDE parent query docEnum
Children are sharded with their parents => locality!

Facet counts (on parent) still correct!

Limitations
Combinations of nested subdocuments with other queries
Like “dis_max”, or “text”
No automatic recognition of “authors.lastName” in other queries
to a “nested” subquery

Multipage Indexing
Use Case 2

architecture

doc
Flow Citation
Mongo
Authority

page
page PDF
Elastic page

Search Pipeline

S3

problem
How to index both Doc metadata and Pages text
Doc in Flow app
Pages only in PDF pipeline and on S3
Docs updated frequently, on the Flow app
Reindex Page would require download of text content from S3…

Nested Docs?
No; too slow for updates here…

solution
Parent/Child documents in ElasticSearch!

Store parent type on children type mapping
To index a child, specify the parent ID
Stored as “_parent” field on the child

Query
Combine parent query with “has_child” child-query

itemtext: {
properties: {
text: {
analyzer: “pqdText”,
type: “string”
}
},
_parent: {
type: “item”
}
}

{
"bool" : {
"must" : [ {
"bool" : {
"should" : [ {
"query_string" : {
"query" : "elasticity",
"fields" : [
"item.reference.title^2.0", "item.reference.authors.lastName^1.5", "item.reference.authors.firstNames", "item.r
eference.authors.rawName", "item.reference.contributors.lastName", "item.reference.contributors.firstNames", "i
tem.reference.contributors.rawName", "item.reference.abstr", "item.reference.publication.title^1.5", "item.refe
rence.publication.issn", "item.reference.publication.isbn", "item.reference.publication.abbrev", "item.referenc
e.series.editors.lastName", "item.reference.series.editors.firstNames", "item.reference.series.rawName", "item.
reference.series.title", "item.reference.publisher.name", "item.reference.publisher.location", "item.reference.
publisher.department", "item.reference.userNotes", "item.annotations.note^0.5" ],
"use_dis_max" : true,
"default_operator" : "and"
}
}, {
"has_child" : {
"query" : {
"text" : {
"text" : {
"query" : "elasticity",
"type" : "boolean",
"operator" : "AND"
}
}
},
"type" : "itemtext",
"boost" : 0.1
}
} ]
}
}, {
"term" : {
"userId" : "user:50a3bd090364f635f24c713c"
}
} ]
}

NOT SO SURE WHO IS PARENT, WHO IS C

IN PARENT-CHILD RELATIONS

conclusions
Parent/Child „remote key‟ solution in ElasticSearch
Easy connection of two types of documents with
Separate update cycles
Complex JOIN queries possibles, combining parent & child
fields
Slower than “nested”
Locality principle: Children always sharded with parent

Limitations
Has_child filter returns only parents, cannot return child data
But: has_parent filter
ElasticSearches caches parent-child ID table in heap…

conclusions
Complex join-style queries can
be done with ElasticSearch SELECT * FROM ARTICLES
LEFT JOIN AUTHORS ON
Easily AUTHORS.ARTICLEID = ARTICLES.ID
WHERE
Efficiently ARTICLES.TITLE MATCHES "market elasticity"
AND
AUTHORS.LASTNAME MATCHES "Russell"
Use “nested” types AND
AUTHORS.FIRSTNAME MATCHES "G"
If data can be duplicated
Very efficient

Use “parent/child” types
For real independently
updateable documents

conclusions
ElasticSearch rocks
Hides complex JSON document to Lucene key/value model
mapping
Allows you to easily use more of Lucene greatness
So you can focus on actual queries and use cases

NoSql does not mean NoJoins
Just forcing you to model in such a way, joins will be efficient

ElasticSearch “nested” types:
the best thing since sliced bread

anne@beyondtrees.com
thank you @anneveling

Nested and Parent/Child Docs in ElasticSearch

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (20)

Similaire à Nested and Parent/Child Docs in ElasticSearch

Similaire à Nested and Parent/Child Docs in ElasticSearch (6)

Nested and Parent/Child Docs in ElasticSearch