Schemaless Solr and the Solr Schema REST API

SCHEMALESS SOLR
AND THE SOLR SCHEMA REST API

Steve Rowe
Twitter: @steven_a_rowe

Senior Software Engineer, LucidWorks

Who am I?
• 
• 
• 
• 

LucidWorks employee
Lucene/Solr committer since 2010
JFlex committer since 2008
Previously at the Center for Natural Language Processing
at Syracuse University’s iSchool (School of Information)

•  Twitter: @steven_a_rowe

Schemaless Solr
• 

As of version 4.4, Solr can operate in
schemaless mode:
–  No need to pre-configure fields in the
schema
–  As documents are indexed, previously
unknown fields are automatically added
to the schema
–  Field types are auto-detected from a
limited set of basic types:
•  Long, Double, Boolean, Date, Text
(default)
•  All are multi-valued
–  Works in standalone Solr and SolrCloud

• 

Solr features used to implement
schemaless mode:
–  Managed schema
•  Required for runtime
schema modification
–  Field value class guessing
•  Parsers attempt to detect
the Java class of Stringvalued field content
–  Automatic schema field
addition
•  Java class(es) mapped to
schema field type

The slide about the nature and utility of schemalessness
• 
• 

• 

“Schemaless” does not mean that there is no schema
Search applications need schemas to support non-trivial document models
–  No schema needed when there is only one field, or only one field type, i.e. all
fields share:
•  Document & query processing, including analysis
•  Index features & format
•  Similarity implementation
•  (etc.)
–  Otherwise, search apps need to manage per-field processing configuration (i.e.
a schema) to consistently index documents and effectively serve queries
So what does “schemaless” mean for Solr?
–  No up-front schema configuration required
–  Schema discovery: document structure is either not fixed or not fully known

Dynamic fields
• 
• 

Convention over configuration
Glob-like patterns match field names with field types
!

<dynamicField name="*_i" type="int" indexed="true” stored="true"/>!
<fieldType name="int" class="solr.TrieIntField"!
precisionStep="0" positionIncrementGap="0"/>!
!

• 
• 
• 
• 

Dynamic fields solve the problem of assigning field types to unknown fields by
inferring a field’s type from its name
By contrast, Solr’s schemaless mode infers an unknown field’s type from its value
or values
These two approaches are complementary
The Solr schemaless example defines a number of dynamic fields, including the
*_i ! int mapping above

Schemaless mode example
From example/example-schemaless/solr/collection1/conf/schema.xml:
!

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />!
<field name="_version_" type="long" indexed="true" stored="true"/>!

From example/exampledocs/books.csv:
id,cat,name,price,inStock,author,series_t,sequence_i,genre_s!
0441385532,book,Jhereg,7.95,false,Steven Brust,Vlad Taltos,1,fantasy!
...!

!
$ cd example && java -Dsolr.solr.home=example-schemaless/solr -jar start.jar!
!

$ cd exampledocs && java -Dtype=text/csv -jar post.jar books.csv!
!

SimplePostTool version 1.5!
Posting files to base url http://localhost:8983/solr/update using content-type text/csv..!
POSTing file books.csv!
1 files indexed.!
COMMITting Solr index changes to http://localhost:8983/solr/update..!
Time spent: 0:00:00.147!

Schemaless mode example
$ curl http://localhost:8983/solr/schema/fields!
!

{ "fields":[{
{
{
{

"name":"_version_",
"name":"author",
"name":"cat",
"name":"id",

{ "name":"inStock",
{ "name":"name",
{ "name":"price",
!
id!
cat!
!
!
0441385532! book!
!

"type":"long",
"indexed":true, "stored":true
},!
"type":"text_general"
},!
},!
"type":"string",
"multiValued":false, "indexed":true,!
"required":true,
"stored":true,!
"uniqueKey":true
},!
"type":"booleans"
},!
},!
"type":"tdoubles"
}]}!

name!

price!

inStock!

author!

series_t!

sequence_i! genre_s!

Jhereg!

7.95!

false!

Steven
Brust!

Vlad
Taltos!

1!

fantasy!

!

From example/example-schemaless/solr/collection1/conf/schema.xml:
!

<fieldType name="booleans" class="solr.BoolField" sortMissingLast="true" multiValued="true"/>!
<fieldType name="tdoubles" class="solr.TrieDoubleField" precisionStep="8" !
positionIncrementGap="0" multiValued="true"/>!

!

Managed schema
• 
• 
• 
• 
• 

The schema resource is managed by
Solr, rather than hand edited
On first startup, Solr auto-converts
schema.xml to managed-schema
Managed schema format is currently
XML, but may change in the future
XML comments don’t survive the
conversion.
mutable=true enables runtime
schema modification
–  Automatic schema field addition
–  Schema REST API

From example/example-schemaless/solr/collection1/conf/solrconfig.xml:
!

<schemaFactory class="ManagedIndexSchemaFactory">!
<bool name="mutable">true</bool>!
<str name="managedSchemaResourceName">managed-schema</str>!
</schemaFactory>!

conf/ before startup
currency.xml!
elevate.xml!
lang/!
protwords.txt!
schema.xml!
solrconfig.xml!
stopwords.txt!
synonyms.txt!

conf/ after startup
currency.xml!
elevate.xml!
lang/!
managed-schema!
protwords.txt!
schema.xml.bak!
solrconfig.xml!
stopwords.txt!
synonyms.txt!

Field value class guessing
• 

• 

Unknown fields’ String-typed values
are speculatively parsed
–  Cascading parsers attempt to
recognize field values
–  On failure, the next one is tried
–  First successful parse wins
Reconfigurable
–  Integer parser could be swapped
in for the Long parser, etc.
–  Numeric parsers can take a locale
for java.text.NumberFormat!
–  Date parser, implemented using
Joda-Time, can be configured with
other patterns, a locale, and/or a
default time zone

<updateRequestProcessorChain name="add-unknown-fields-to-the-schema">!
<processor class="solr.RemoveBlankFieldUpdateProcessorFactory"/>!
<processor class="solr.ParseBooleanFieldUpdateProcessorFactory"/>!
<processor class="solr.ParseLongFieldUpdateProcessorFactory"/>!
<processor class="solr.ParseDoubleFieldUpdateProcessorFactory"/>!
<processor class="solr.ParseDateFieldUpdateProcessorFactory">!
<arr name="format">!
<str>yyyy-MM-dd'T'HH:mm:ss.SSSZ</str>!
<str>yyyy-MM-dd'T'HH:mm:ss,SSSZ</str>!
<str>yyyy-MM-dd'T'HH:mm:ss.SSS</str>!
<str>yyyy-MM-dd'T'HH:mm:ss,SSS</str>!
<str>yyyy-MM-dd'T'HH:mm:ssZ</str>!
<str>yyyy-MM-dd'T'HH:mm:ss</str>!
<str>yyyy-MM-dd'T'HH:mmZ</str>!
<str>yyyy-MM-dd'T'HH:mm</str>!
<str>yyyy-MM-dd HH:mm:ss.SSSZ</str>!
<str>yyyy-MM-dd HH:mm:ss,SSSZ</str>!
<str>yyyy-MM-dd HH:mm:ss.SSS</str>!
<str>yyyy-MM-dd HH:mm:ss,SSS</str>!
<str>yyyy-MM-dd HH:mm:ssZ</str>!
<str>yyyy-MM-dd HH:mm:ss</str>!
<str>yyyy-MM-dd HH:mmZ</str>!
<str>yyyy-MM-dd HH:mm</str>!
<str>yyyy-MM-dd</str>!
</arr>!
</processor>!
!

Automatic schema field addition
• 
• 
• 

• 

• 
• 

Field value classes are mapped to
field types
First match wins
If none of the typeMapping-s
match, the default field type is
assigned
If a multi-valued field contains a
mix of value classes, the first
mapping that matches all values’
classes wins
The new field is added to the
schema with the mapped field type
Reconfigurable

<processor class="solr.AddSchemaFieldsUpdateProcessorFactory">!
<str name="defaultFieldType">text_general</str>!
<lst name="typeMapping">!
<str name="valueClass">java.lang.Boolean</str>!
<str name="fieldType">booleans</str>!
</lst>!
<str name="valueClass">java.util.Date</str>!
<str name="fieldType">tdates</str>!
</lst>!
<str name="valueClass">java.lang.Long</str>!
<str name="valueClass">java.lang.Integer</str>!
<str name="fieldType">tlongs</str>!
</lst>!
<str name="valueClass">java.lang.Number</str>!
<str name="fieldType">tdoubles</str>!
</lst>!
</processor>!

Schemaless mode limitations
• 
• 
• 
• 
• 
• 

Automatically adding new schema fields in production may not be a good idea
–  Unwanted fields, e.g. field name typos, won’t trigger an error
First instance wins: field type detection can’t know about the full range of a field’s
values
Wasted space: e.g. Longs are always used, when Integers might suffice
Limited gamut of detectable field types
Single analysis specification for text fields
Single processing model for all fields

Schema REST API: read-only
• 
• 
• 

• 

Each element of the schema is individually readable via the Schema REST API
Output format can be JSON or XML (wt request param)
Read-only elements:
–  The entire schema
•  In addition to JSON and XML output formats, output can also be in
schema.xml format (?wt=schema.xml)
–  All fields, or a specified set of them
–  All dynamic fields, or a specified set of them
–  All field types, or a specific one
–  All copy field directives
–  The schema name, version, uniqueKey, and default query operator
–  The global similarity
Managed schema is not required to use the read-only schema REST API.

Schema REST API: read-only examples
$ SOLR=http://localhost:8983/solr/collection1!
!
$ curl $SOLR/schema/dynamicfields/*_i!

!
!
$ curl $SOLR/schema/uniquekey?wt=xml!

!

!

{!

<?xml version="1.0" encoding="UTF-8"?>!
<response>!
<lst name="responseHeader">!
<int name="status">0</int>!
<int name="QTime">1</int>!
</lst>!
<str name="uniqueKey">id</str>!
</response>!

"responseHeader":{!
"status":0,!
"QTime":1},!
"dynamicField":{!
"name":"*_i",!
"type":"int",!
"indexed":true,!
"stored":true}}!

• 

Schema REST API URLs employ the downcased form of all schema elements, but the
responses use the same casing as schema.xml.

• 

For full details on the Solr Schema REST API, see the Schema API section of the Solr
Reference Guide: https://cwiki.apache.org/conﬂuence/display/solr/Schema+API

Schema REST API: runtime schema modification
• 
• 

• 

• 

• 

To enable schema modification via the schema REST API, the schema must be
managed, and must be configured as mutable.
Schema modifications possible as of Solr 4.4:
–  Fields may be added
•  Copy field directives may optionally be added at the same time
–  Copy field directives may be added
Works under both standalone Solr and SolrCloud
–  Under SolrCloud, conflicting simultaneous requests are detected using a form of
optimistic concurrency and automatically retried
Core/collection reload not required for schema modifications that are compatible with
previously indexed documents
–  Generally additions are not sources of schema incompatibility
Schema incompatibility-inducing operations will require core/collection reload:
–  Modifying or removing (dynamic) fields or copy field directives
–  Modifying all other schema elements

Schema REST API: add field example
$ SOLR=http://localhost:8983/solr/collection1!
!
$ curl $SOLR/schema/fields/claimid -X PUT -H 'Content-type: application/json' --data-binary '!
{ !
"type":"string",!
"stored":true,!
"copyFields": [ !
"claims", !
"all"!
]!
}’!
!

• 
• 

The copyField destinations “claims” and “all” must already exist in the schema.
For full details on the Solr Schema REST API, see the Schema API section of the Solr

Reference Guide: https://cwiki.apache.org/conﬂuence/display/solr/Schema+API

Schema REST API TODOs
• 

https://issues.apache.org/jira/browse/SOLR-4898 is the umbrella JIRA issue
under which further schema REST API work will be done, including:
–  adding dynamic fields
–  adding field types
–  enabling wholesale replacement by PUTing a new schema.
–  modifying and removing fields, dynamic fields, field types, and copy field
directives
–  modifying all remaining aspects of the schema: Name, Version, Unique Key,
Global Similarity, and Default Query Operator

Proposal: Schema Annotations
• 
• 
• 

• 

Add arbitrary metadata at the top level of the schema and at each leaf node
Allow read/write access to that metadata via the REST API.
Uses cases:
–  Round-trippable documentation
•  Conversion to managed schema format drops all comments
–  Documentable tags
–  When modifying the schema via REST API, a "last-modified" annotation could
be automatically added.
–  User-level arbitrary key/value metadata
W3C XML Schema has a similar facility:
http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/
structures.html#element-annotation

Schema Annotation example
<schema name="example" version="1.5">!
<annotation>!
   <description element="tag" !
content="plain-numeric-field-types">!
     Plain numeric field types store and index the!
text value verbatim.!
   </description>!
   <documentation element="copyField">!
     copyField commands copy one field to another at!
the time a document is added to the index. It's!
used either to index the same field differently,!
     or to add multiple fields to the same field for!
easier/faster searching.!
   </documentation>!
   <last-modified>2014-03-08T12:14:02Z</last-modified>!
   …!
</annotation>!
…!

<fieldType name="pint" class="solr.IntField">!
   <annotation>!
     <tag>plain-numeric-field-types</tag>!
   </annotation>!
</fieldType>!
<fieldType name="plong" class="solr.LongField">!
   <annotation>!
     <tag>plain-numeric-field-types</tag>!
   </annotation>!
</fieldType>!
…!
<copyField source="cat" dest="text">!
   <annotation>!
     <todo>Copy to the catchall field?</todo>!
   </annotation>!
</copyField>!
…!
<field name="text" type="text_general">!
   <annotation>!
     <description>catchall field</description>!
     <visibility>public</visibility>!
   </annotation>!
</field>!

Summary
• 

Schemaless Solr mode enables quick prototyping with minimal setup

• 
• 

Schema REST API provides programmatic read/write access to Solr’s schema
More elements writeable soon

• 

Schema annotations would enable round-trippable documentation, tagging, and
arbitrary user-provided metadata

Schemaless Solr and the Solr Schema REST API

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Schemaless Solr and the Solr Schema REST API

Similaire à Schemaless Solr and the Solr Schema REST API (20)

Plus de lucenerevolution

Plus de lucenerevolution (20)

Dernier

Dernier (20)

Schemaless Solr and the Solr Schema REST API