Letting In the Light: Using Solr as an External Search Component
* Jay Luker, IT Specialist, ADS, jluker@cfa.harvard.edu
* Benoit Thiell, software developer, ADS, bthiell@cfa.harvard.edu
Code4Lib 2011, Tuesday 8 February, 14:30 - 14:50
It’s well-established that Solr provides an excellent foundation for building a faceted search engine. But what if your application’s foundation has already been constructed? How do you add Solr as a federated, fulltext search component to an existing system that already provides a full set of well-crafted scoring and ranking mechanisms?
This talk will describe a work-in-progress project at the Smithsonian/NASA Astrophysics Data System to migrate its aging search platform to Invenio, an open-source institutional repository and digital library system originally developed at CERN, while at the same time incorporating Solr as an external component for both faceting and fulltext search.
In this presentation we'll start with a short introduction of Invenio and then move on to the good stuff: an in-depth exploration of our use of Solr. We'll explain the challenges that we faced, what we learned about some particular Solr internals, interesting paths we chose not to follow, and the solutions we finally developed, including the creation of custom Solr request handlers and query parser classes.
This presentation will be quite technical and will show a measure of horrible Java code. Benoit will probably run away during that part.
Scanning the Internet for External Cloud Exposures via SSL Certs
Letting In the Light: Using Solr as an External Search Component
1. Letting In The Light Using Solr as an External Search Component Jay Luker Benoit Thiell SAO/NASA Astrophysics Data System http://adsabs.harvard.edu/
20. 2000: Extension of the server to allow storing multimedia content (photos, posters, brochures, videos) and creation of the open-source project CDSware project
The SAO/NASA Astrophysics Data System (ADS) is a Digital Library portal for researchers in Astronomy and Physics, operated by the Smithsonian Astrophysical Observatory (SAO) under a NASA grant.
1994 was the move to the web
Astronomy: 1.8M Physics: 5.8M Arxiv e-prints: 650K Citations: 40M (over 3.4M papers with citations) Curated links: 23M (fulltext, data products, citations) 4M scanned pages, 625K articles 650K pages historical material Advanced search allows for searching by astronomical object (via SIMBAD) and attributes like “has dataset” TWITA = The Website Is The API: via data_type=<foo> param, also structured metadata within the pages
INSPIRE: Invenio for SPIRES, the Physics database at Stanford.
Obviously, performance was also an objective Invenio team had been skeptical of the necessity of incorporating an external tool/service to do fulltext indexing and/or faceting, but once introduced to solr they quickly came around In spite of the fact that at least some of the fancypants sorting, ranking, filtering functionality could most likely be reproduced using Solr, there was a strong reluctance to rewrite that code. Writing as little java as possible doesn't just come from a java-phobic frame of mind; it's also about limiting how much we rely on custom solr components. Rely as much as possible on what Solr affords. Loose integration in this case means the ability to swap in alternate services for retrieving fulltext search results and facets. More on how we succeeded in that towards the end.
When we talk about the ids being sent back and forth between Invenio & Solr we are talking about the schema ids.
So what's going on here? Our first thought was maybe it was the time needed to serialize/de-serialize the response, but that turned out not to be it.
No need to specify number of rows or which fields to return
Post-processing = 2 nd order searching, filtering Can't retreive facets with the initial query because the final list of search results will depend on Invenio post-processing. So how do you send a very large set of ids to get a set of facet results?
Satisfies all most objectives. We get searching & faceting We don't have to write a lot of python or java: invenio needs the indexing piece Not duplicating anything that invenio already does very well Loosely coupled because communication is in a form that is native to invenio, we could easily swap in/out different services for either piece
Seems like a lot, but in total lines of code it's not that much, especially considering it's in Java. Plus I suck at Java and I was able to do it all in 2-3 weeks of trial and error hacking. Plus, it all very closely conforms to the affordances of the Solr API. Only one small thing that might be considered a “hack”.
Defining our custom query component and telling the default solr search handler to use it Also defining our custom response writer
A query component class has two opportunities to interact with the incoming request: prepare & process. We only need process.
These times include decompressing and unmarshalling the bitset into an invenio intbitset object in python
Defining our custom query component and telling the default solr search handler to use it Also defining our custom response writer
PyLucene is a Python wrapper around Java Lucene. It embeds a Java VM with Lucene into a Python process. The extension is machine-generated with JCC, a C++ code generator that makes it possible to call into Java classes from Python via Java's Native Invocation Interface (JNI).
PyLucene is a Python wrapper around Java Lucene. It embeds a Java VM with Lucene into a Python process. The extension is machine-generated with JCC, a C++ code generator that makes it possible to call into Java classes from Python via Java's Native Invocation Interface (JNI).
Invenio team had been skeptical of the necessity of incorporating an external tool/service to do fulltext indexing and/or faceting, but once introduced to solr they quickly came around In spite of the fact that at least some of the fancypants sorting, ranking, filtering functionality could most likely be reproduced using Solr, there was a strong reluctance to rewrite that code. Writing as little java as possible doesn't just come from a java-phobic frame of mind; it's also about limiting how much we rely on custom solr components. Rely as much as possible on what Solr affords. Loose integration in this case means the ability to swap in alternate services for retrieving fulltext search results and facets. More on how we succeeded in that towards the end.