What’s new in apache solr 1.4

Open Source Search
Search:
What’s New
in Apache Solr 1.4
A Lucid Imagination
Technical White Paper

© 2009 by Lucid Imagination, Inc. under the terms of Creative Commons license, as detailed at
http://www.lucidimagination.com/Copyrights-and-Disclaimers/. Version 1.02, published 26 October 2009.
Solr, Lucene, Apachecon and their logos are trademarks of the Apache Software Foundation.

What’s New in Solr 1.4
A Lucid Imagination Technical White Paper • October 2009 Page ii

Abstract
Apache Solr is the definitive application development implementation for Lucene, and it is the
leading open source search platform.

Solr 1.3 set a high bar for functionality, extensibility, and performance. As time marches on, Solr
committers and contributors have been hard at work engineering to make a good thing even
better.

This white paper describes the new features and improvements in the latest version, Apache Solr
1.4. In the simplest terms, Solr is now faster and better than before. Central components of Solr
have been improved to cut the time needed for processing queries and indexing documents. The
goal: to provide a powerful, versatile search application server with ever better scalability,
performance and relevancy. New features include streamlined caching, smarter handling of index
changes, faster faceting, enhanced data import capabilities, speedier numeric range queries,
duplicate detection and more.

A Lucid Imagination Technical White Paper • October 2009 Page iii

Table of Contents
Introduction ............................................................................................................................................................ 1
Performance Improvements............................................................................................................................. 2
Streamlined Caching ........................................................................................................................................ 2
Scalable Concurrent File Access .................................................................................................................. 2
Smarter Handling of Index Changes .......................................................................................................... 3
Faster Faceting ................................................................................................................................................... 4
Streaming Updates for SolrJ .......................................................................................................................... 4
What Else Is New for Solr 1.4 Performance ............................................................................................ 5
Feature Improvements ....................................................................................................................................... 5
Solr Becomes an Omnivore ........................................................................................................................... 5
DataImportHandler Enhancements ........................................................................................................... 6
Smoother Replication ...................................................................................................................................... 7
More Choices for Logging .............................................................................................................................. 8
Multiselect Faceting ......................................................................................................................................... 9
Speedier Range Queries .................................................................................................................................. 9
Duplicate Detection ....................................................................................................................................... 10
New Request Handler Components ........................................................................................................ 11
What Else Is New with Solr 1.4 Features .............................................................................................. 11
Get Started & Resources .................................................................................................................................. 12
Next Steps ............................................................................................................................................................. 12
APPENDIX: Choosing Lucene or Solr .......................................................................................................... 13

A Lucid Imagination Technical White Paper • October 2009 Page iv

Introduction
Apache Solr is the definitive application development implementation for Apache Lucene,
and it is the leading open source search platform. If you imagine Lucene as a high-
performance race car engine, then Solr is all the things that make that engine usable, such
as a chassis, gas pedal, steering wheel, seat, and much more.

Solr makes it easy to develop sophisticated, fast search applications with advanced features
such as faceting. Solr builds on another open source search technology, Lucene, which
provides indexing and search technology, as well as spellchecking, hit highlighting, and
advanced processing capabilities. Both Solr and Lucene are developed at the Apache
Software Foundation.

Lucene currently ranks among the top 15 open source projects and is one of the top 5
Apache projects, with installations at over 4,000 companies. Lucene and Solr downloads
have grown nearly tenfold over the past three years; Solr is the fastest-growing Lucene
subproject. Lucene and Solr offer an attractive alternative to proprietary licensed search
and discovery software vendors.1.

Solr 1.3 set a high bar for functionality, extensibility, and performance. As time marches on,
Solr engineers have been hard at work making a good thing even better. This white paper
describes the new features and
improvements in the latest
version, Solr 1.4. In the
simplest terms, Solr is now
faster and better than before.
Central components of Solr
have been improved to cut the
time needed for processing
queries and indexing
documents. Many new features

1 See the Appendix for a discussion of when to choose Lucene or Solr.

A Lucid Imagination Technical White Paper • October 2009 Page 1

have been added, all with the goal of providing users with the information they want as fast
as possible.

Performance Improvements
Solr 1.4 increases Solr’s speed with numerous improvements in key areas. Some of these
enhancements are high-performance replacements for standard off-the-shelf Java platform
components. Much as a car hobbyist replaces stock parts of an engine, the architects and
programmers working on Solr have replaced crucial components to make Solr 1.4 run
faster than ever for many common operations.

Streamlined Caching
Solr caches data from its index as an optimization, because reading from memory is always
faster than reading from the file system. Over the duration of a single faceting request, the
cache might be accessed hundreds or even thousands of times. Previously, the cache
implementation was a synchronized LinkedHashMap from the Java platform API.
Solr 1.4 uses a new class, ConcurrentLRUCache, which is specifically designed to
minimize the overhead of synchronization. Anecdotal evidence suggests that this
implementation can double query throughput in some circumstances.

Scalable Concurrent File Access
In the past, Solr used the Java platform’s RandomAccessFile to read data from index
files. Reading a portion of a file involves calling seek() to find the right part of the file, and
read() to actually retrieve the data.
Multithreaded access to the same file has meant that the seek() and read() pairs must
be synchronized. If the data to be read isn’t already in the operating system cache, things
get worse: the synchronization causes all other reading threads to wait while the data is
retrieved from disk.


The Java Nonblocking Input/Output (NIO) API offers a much better solution. NIO’s
FileChannel includes a read() method that, in essence, performs a seek() and a
read() in a single operation.
public int read(ByteBuffer dst, long position)
Solr 1.4 uses this NIO method (via Lucene’s NIOFSDirectory) to read index files.2

Smarter Handling of Index Changes
Solr generally keeps a big pile of documents in an existing index. New documents are
periodically added, but usually the number of new documents is small compared with the
size of the index. Solr (via Lucene) stores the index as a collection of segments; as new
documents are added, most of the segments will remain unchanged.
Solr 1.4 is very much aware that, for the most part, index segments don’t change.
Consequently, Solr is much smarter about reusing unchanged segments, which results in
less memory churn, less disk access, and better performance.
reopen()

Index New index

Index segments on disk

2 On Windows, the older RandomAccessFile implementation is used because of a bug in the Windows
NIO implementation.


One example is reloading an index. Previously, the entire index was loaded again, which is
expensive in time and resources. Now, Solr 1.4 is smart enough to reuse index segments
that haven’t changed, resulting in a much more efficient reload of a modified index.
This means that adding new documents to an index and making them available comes at a
lower resource cost. The figure above illustrates the mechanism.
Many other optimizations have been made with respect to index segments. The field cache,
for example, is now split so there is one field cache per segment. Again, this results in much
more efficient processing of index updates, because the field caches for every unchanged
segment do not need to be touched.

Faster Faceting
One of Solr’s killer features is faceting, the ability to quickly narrow and drill down into
search results by categories. Solr uses UnInvertedField to keep mapping between
documents and field values so it can provide faceting information in response to queries.
For multivalued fields, Solr 1.4 includes a new implementation of UnInvertedField that
can be 50 times faster and 5 times smaller than its predecessor. Single value fields still use
either the enum or fieldcache method.

Streaming Updates for SolrJ
SolrJ is the API that Java client applications use to work with Solr. The Solr 1.4 version of
SolrJ includes an optimized implementation, StreamingUpdateSolrServer, which is
useful for indexing many documents at a time.

In one simple test, the number of
documents indexed per second jumped
from 231 to 25,000 using the new
implementation.


For bulk updates, consider switching to the new implementation. In one simple test, the
number of documents indexed per second jumped from 231 to 25,000 when using the new
implementation.

What Else Is New for Solr 1.4 Performance
In addition to these important performance enhancements in Solr 1.4, there are several
more, including:
Binary format for updates, much more compact than XML, now available for SolrJ.
OmitTermFreqAndPositions can be applied to a field so that Solr does not
compute the number of terms and list of positions for that field, which saves time
and space for nontext fields.
Queries that don’t sort by score can eliminate scoring, which speeds up queries.
Filters now apply before the main query, which makes queries 300% faster in some
cases.
New filter implementation for small results sets, so it runs smaller and faster.

Feature Improvements
Aside from performance improvements, Solr 1.4 sports a variety of great new features. As
an open source project, Solr 1.4 is largely created by the people who use it, so the new
features are the ones that the community cares about most passionately.

Solr Becomes an Omnivore
Solr can’t give you good results unless you give it good data. Normally you feed Solr XML
documents corresponding to the structure of your schema. This works fine, and if all your
data consists of XML documents, they can be fed directly to Solr or easily transformed to
the correct input.
Of course, reality is always messy. Chances are that many documents you want to include in
your Solr index are in other file formats, like PDF or Microsoft Word. Fortunately, Solr 1.4
knows how to deal with the mess.


Solr 1.4 can now ingest these other types of documents using a feature called Solr Cell.3 Solr
Cell uses another open source project, Tika, to read documents in a variety of formats and
convert them to an XHTML stream. Solr parses the stream to produce a document, which is
then indexed.
Here are a few of the formats that Tika understands:
• PDF
• OpenDocument (OpenOffice formats)
• Microsoft OLE 2 Compound Document (Word, PowerPoint, Excel, Visio, etc.)
• HTML
• RTF
• gzip
• ZIP
• Java Archive (JAR) files

DataImportHandler Enhancements
DataImportHandler knows how to index data pulled from relational databases or XML
files. The details of what is indexed and how it happens are configured in solrconfig.xml.
Solr 1.4 contains some extremely useful upgrades to DataImportHandler.
The first is the ability to push data into DataImportHandler. In Solr 1.3,
DataImportHandler was pull-only. This meant that the only possibly way to push data
to Solr was to use the update XML or CSV format, which meant you couldn’t take advantage
of any of DataImportHandler’s capabilities. In the Solr 1.4 world, a new component
called ContentStreamDataSource allows you to use DataImportHandler’s features
for indexing content.
Another powerful enhancement in Solr 1.4 is the ability to listen for import events. All you
need to do is provide an implementation of the EventListener interface and let Solr

3 The name is based on the acronym Content Extraction Library (CEL). This feature is also known by its
more technical name ExtractingRequestHandler.


know about it in solrconfig.xml. When importing begins and ends, your listener will be
notified.
Solr 1.4 also brings the ability to control error handling in DataImportHandler. For
each entity, you can control what happens when an error occurs via solrconfig.xml. The
choices for error handling are as follows:
abort : The import is stopped and all changes are rolled back.
skip : The current document is skipped.
continue : Import continues as if the error did not occur.
DataImportHandler contains many more enhancements and optimizations in Solr 1.4,
including new data sources, new entity processors, and new transformers.

Smoother Replication
Replication is a fancy name for making a copy of a Solr index, which at its heart is just a
matter of copying files. Making copies of an index is useful for two reasons. The first is
simply to create a backup. The second reason is to place the same index on multiple Solr
servers, which is necessary if you want to distribute incoming requests to improve
performance.
Prior to Solr 1.4, replication was implemented with shell scripts, and consequently would
only work effectively on platforms with a shell, like Linux; it relied on the Unix rsync file
utility and it relied on the OS providing hard links, which could require cumbersome
scripting, excluding tiered deployments on Windows platforms.
In Solr 1.4, replication has been abstracted and implemented entirely at the Java platform
layer, which means it will work (and work the same) wherever the Java platform runs. This
is great news for anyone using Solr because it means that backups can be performed in the
same way on a Solr instance, regardless of hardware or operating system, and it means that
configuring replication across multiple Solr instances is similarly uniform. Replication does
not require a backup and the index is copied from one live index to another.
Replication and backups are configured in solrconfig.xml. Add a couple lines if you just want
to make a backup—you can choose to backup upon Solr startup or after every commit or
optimize. In addition, you can use an http command to request a backup at any time.


If you need to replicate an index across multiple servers, the configuration is pretty simple.
Set it up on the master server’s solrconfig.xml like this:
<requestHandler name="/replication" class="solr.ReplicationHandler">
<lst name="master">
<str name="replicateAfter">commit</str>
<str name="confFiles">schema.xml,stopwords.txt</str>
</lst>
</requestHandler>
You can choose to replicate on startup, after commits, or after optimization. The
confFiles element specifies configuration files you want to replicate to slaves.
Once the server configuration is done, point the slaves at the master, something like this:
<requestHandler name="/replication" class="solr.ReplicationHandler">
<lst name="slave">
<str name="masterUrl">
http://masterhostname:8983/solr/replication
</str>
<str name="pollInterval">00:00:60</str>
</lst>
</requestHandler>
The slaves periodically query the master to see if the index has changed. If so, they pull
down the changes and apply them. That’s all!

More Choices for Logging
Logging is a crucial capability in a server application. Administrators examine logs to
monitor Solr instances and figure out how to make them run optimally. Up until now, Solr
used the logging facility included with the Java Development Kit (JDK).
Solr 1.4 uses a more flexible logging framework, SLF4J. SLF4J can bind to several logging
implementations, including log4j, Jakarta Commons Logging (JCL), and JDK logging. This
binding can be changed at runtime simply by switching JAR files around.


This is the best possible kind of upgrade. The default configuration, binding SLF4J to JDK
logging, provides the same functionality as previous releases of Solr. However, you now
have the option of easily plugging in log4j or JCL if you prefer.

Multiselect Faceting
Faceting is the ability to group search results by certain fields. Solr 1.4 adds support for
multiselect faceting, which is the ability to narrow search results by multiple facets.
Solr’s support is generic and includes the ability to tag filters and to exclude filters by tag
when faceting. A sample query string might look like this:
q=index replication&facet=true
&fq={!tag=proj}project:(lucene OR solr)
&facet.field={!ex=proj}project
&facet.field={!ex=src}source
To see this in action, check out the search facility that Lucid Imagination provides to search
technical knowledge resources on Solr along with Lucene and all its subprojects:
http://search.lucidimagination.com/.

Speedier Range Queries
Solr can process queries that include numeric ranges, which means it can answer questions
like “Which hats are between size 56 and 64?” and “Which swimming pools are less than 10
meters long?”
In Solr 1.4, standard range queries now use a prefix tree or trie. Numbers are placed into
the tree based on their digits, which makes range queries faster than comparing each
complete number. Thus, for example, 175 is indexed as hundreds:1 tens:17 ones:175.
The results have been observed at up to 40 times faster than standard range queries
To take advantage of fast range queries, use the TrieField type in your schema. The
implementation takes care of the details, and you will notice that range queries are
significantly faster.
The illustration below shows an Example of a Prefix Tree, where the leaves of the tree hold
the actual term values and all the descendants of a node have a common prefix associated
with the node. Bold circles mark all relevant nodes to retrieve a range from 215 to 977.


Let’s look at another example, this time in the schema. The type attribute in the schema’s
field type declaration tells Solr which numeric type you will represent with TrieField.
Here are a few declarations that show how to use TrieField for various numeric types:
<fieldType name="tint" class="solr.TrieField" type="integer"
omitNorms="true"
positionIncrementGap="0" indexed="true" stored="false" />
<fieldType name="tlong" class="solr.TrieField" type="long"
omitNorms="true"
<fieldType name="tdouble" class="solr.TrieField" type="double"
omitNorms="true"

Duplicate Detection
With large sets of documents to be indexed, it is important to detect documents that are
identical or nearly identical so that the document only gets added to the index once.
Solr 1.4 offers this capability, named document duplicate detection or deduplication. The
more technical name is SignatureUpdateProcessor.
SignatureUpdateProcessor creates a message digest or hash value from some or all of
the fields of a document. The hash value acts like a fingerprint for the document and can be
quickly compared to the hash values for other documents.


Several hashing algorithms are available: MD5Signature and Lookup3Signature are
both useful for exact matching, while TextProfileSignature (from the Apache Nutch
project) is a fuzzy hashing implementation to detect documents that are nearly equivalent.

New Request Handler Components
New request handler components are now available in Solr 1.4:
ClusteringComponent uses Carrot2 to dynamically cluster the top N search
results, something like dynamically discovered facets.
TermsComponent returns indexed terms and document frequency in a field, useful
for auto-suggest, etc.
TermVectorComponent returns term information per document (term
frequency, positions).
StatsComponent computes statistics on numeric fields: min, max, sum,
sumOfSquares, count, missing, mean, stddev.

What Else Is New with Solr 1.4 Features
Solr 1.4 has many other new features. A few of them are listed here:
• Ranges over arbitrary functions: {!frange l=1 u=2}sqrt(sum(a,b))
• Nested queries, for function queries too
• solrjs: JavaScript client library
• commitWithin: doc must be committed within x milliseconds
• Binary field type
• Merge one index into another
• SolrJ client for load balancing and failover
• Field globbing for some params: hl.fl=*_text
• Doublemetaphone, Arabic stemmer, etc.
• VelocityResponseWriter: template responses using Velocity


Get Started & Resources
http://www.lucidimagination.com/blog/2009/02/05/looking-forward-to-new-features-
in-solr-14/
http://wiki.apache.org/solr/SolrReplication
http://wiki.apache.org/solr/ExtractingRequestHandler
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-
Extraction-Tika
http://www.lucidimagination.com/blog/tag/range-queries/
http://www.slf4j.org/manual.html
http://wiki.apache.org/solr/Deduplication
http://shalinsays.blogspot.com/2009/09/whats-new-in-dataimporthandler-in-solr.html

Next Steps
For more information on how Lucid Imagination can help your employees, customers, and
partners find the information they need more quickly, effectively, and at lower cost, please
visit http://www.lucidimagination.com/ to access blog posts, articles, and reviews of
dozens of successful implementations.
Certified Distributions from Lucid Imagination are complete, supported bundles of
software which include additional bug fixes, performance enhancements, along with our
free 30-day Get Started program. Coupled with one of our support subscriptions, a Certified
Distribution can provide a complete environment to develop, deploy, and maintain
commercial-grade search applications. Certified Distributions are available at
www.lucidimagination.com/Downloads.
Please e-mail specific questions to:
Support and Service: support@lucidimagination.com
Sales and Commercial: sales@lucidimagination.com
Consulting: consulting@lucidimagination.com
Or call: 1.650.353.4057


APPENDIX: Choosing Lucene or Solr
The great improvements in the capabilities of Lucene and Solr open source search
technology have created rapidly growing interest in using them as alternatives to other
search applications. As is often the case with open-source technology, online community
documentation provides rich details on features and variations, but does little to provide
explicit direction on which technologies would be the best choice. So when is Lucene
preferable to Solr and vice versa?
There is in fact no single answer, as Lucene and Solr bring very similar underlying
technology to bear on somewhat distinct problems. Solr is versatile and powerful, a full-
featured, production-ready search application server requiring little formal software
programming. Lucene presents a collection of directly callable Java libraries, with fine-
grained control of machine functions and independence from higher-level protocols.
In choosing which might be best for your search solution, the key questions to consider are
application scope, deployment environment, and software development preferences.
If you are new to developing search applications, you should start with Solr. Solr provides
scalable search power out of the box, whereas Lucene requires solid information retrieval
experience and some meaningful heavy lifting in Java to take advantage of its capabilities.
In many instances, Solr doesn’t even require any real programming.
Solr is essentially the “serverization” of Lucene, and many of its abstract functions are
highly similar, if not just the same. If you are building an app for the enterprise sector, for
instance, you will find Solr an almost 100% match to your business requirements: it comes
ready to run in a servlet container such as Tomcat or Jetty, and ready to scale in a
production Java environment. Its RESTful interfaces and XML-based configuration files can
greatly accelerate application development and maintenance. In fact, Lucene programmers
have often reported that they find Solr to contain “the same features I was going to build
myself as a framework for Lucene, but already very-well implemented.” Once you start
with Solr, and you find yourself using a lot of the features Solr provides out of the box, you
will likely be better off using Solr’s well-organized extension mechanisms instead of
starting from scratch using Apache Lucene.
If, on the other hand, you don’t want to make any calls via HTTP, and want to have all of
your resources controlled exclusively by Java API calls that you write, Lucene may be a
better choice. Lucene works best when constructing and embedding a state-of-the-art
search engine, allowing programmers to assemble and compile inside a native Java


application. Some programmers set aside the convenience of Solr in order to more directly
control the large set of sophisticated features with low-level access, data, or state
manipulation, and choose Lucene instead, for example with byte-level manipulation of
segments or intervention in data I/O. Investment at the lower level enables development of
extremely sophisticated, cutting edge text search and retrieval capabilities.
As for features, the latest version of Solr generally encapsulates the latest version of
Lucene. As the two are in many ways functional siblings, spending time on gaining a solid
understanding how Lucene works internally can help you understand Apache Solr and its
extension of Lucene's workings.
No matter which you choose, the power of open source search is yours to harness. More
information on both Lucene and Solr can be found at http://www.lucidimagination.com.


What’s new in apache solr 1.4

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to What’s new in apache solr 1.4

Similar to What’s new in apache solr 1.4 (20)

More from Lucidworks (Archived)

More from Lucidworks (Archived) (20)

Recently uploaded

Recently uploaded (20)

What’s new in apache solr 1.4