Letting In the Light: Using Solr as an External Search Component

1. Letting In The Light Using Solr as an External Search Component Jay Luker Benoit Thiell SAO/NASA Astrophysics Data System http://adsabs.harvard.edu/

3. Overview of Invenio

4. Our Solr-Invenio Integration Project

5. A few tips on Solr hacking along the way

7. Was restructured in 1994 to become an A&I service for astronomers and astrophysicists, with fulltext archive

8. Has 100% penetration in astronomical community, with take-up in other areas of space sciences, engineering and physics

10. 625K fulltext articles

12. Search, Browse, Notifications, Personalization

13. API access to all content (TWITA)

14. Network of 12 mirror sites

15. ADS Labs: http://labs.adsabs.harvard.edu

20. 2000: Extension of the server to allow storing multimedia content (photos, posters, brochures, videos) and creation of the open-source project CDSware project

21. Renamed CDS Invenio and then Invenio

22. Both an institutional repository and a digital library

23. Check it out! -> http://invenio-software.org/

25. Growing penetration in the field of physics

26. Metadata curation tools (record editor, merger)

27. Support of citations graphs and citation-based searches

28. Second-order searches support

30. Coupled with MySQL only (for now)

31. Scales to sets of 2M+ records

32. MARC storage of records

34. Format conversion (MARCXML, DC, NLM, etc)

35. References and citations handler

36. Plot and figure extraction

39. Stored marshalled in the database and used as such in the search engine

41. Invenio's indexing is slow by design (providing search speed) but it is too slow for such a large repository

42. Solr has a wide community of users/developers and lots of extensions.

44. Invenio's search engine requires full sets of results

45. Communicate over HTTP with very large payloads

46. Invenio + Solr

49. Take advantage of Solr faceting

50. Not duplicate existing Invenio functionality

51. Write as little code as possible

52. Keep things loosely coupled

53. Problem #1 Retrieving very large result set of ids. Like, millions.

54. The WTH Approach http://myhost:8983/solr/select? q={foo} & fl=id & rows={n} Query for foo Only return the id field Return n rows of the result

56. Can be integers, strings, etc

58. Always integers

59. Unique within an index segment

60. The WTH Approach * warmed cache, different servers, same LAN seconds

61. So what's going on here? document cache Query Response QueryResult [1,5,16,84,...] Lucene Doc id: 1234, bibcode: <lazy>, Title: <lazy>, ...

62. Solution: Custom Collector QueryResult [1,5,16,84,...] Query Response

63. Solution: Custom Collector ... InvenioIdCollector collector = new InvenioIdCollector(); searcher.search(query, collector); ArrayList<Integer> ids = collector.getIds(); rsp.add(“ids”, ids); ... MyQueryComponent.java ... ArrayList<Integer> ids = new ArrayList<Intger>(); ... Public void collect(int doc) { this.ids.add(this.idMap[doc]); } ... MyCollector.java

64. OK, Let's Try This Again http://myhost:8983/solr/select? q={foo} & qt=my_querytype Query for foo Use our custom query handler

66. Better. But ...

67. Problem #2 Facets.

68. Fulltext Search Record Ids Invenio What's Missing? Solr Query Processing Post-processing Return/Render

69. Fulltext Search Record ids Invenio Again, WTH? Record ids? Facets Solr Query Processing Post-processing Return/Render

70. Fulltext Search Invenio BitSet Invenio Current Solution Invenio BitSet Facets Solr Query Processing Post-processing Return/Render

72. Custom Collector to collect doc ids

73. Custom BitSet class (maybe)

74. Custom BinaryResponseWriter

75. Custom QueryComponent for accepting an Integer BitSet query and returning facets

76. Invenio Query Component Config <searchComponent name=" invenio_query " class="org.ads.solr.InvenioQueryComponent" /> <requestHandler name="invenio_query" class="solr.SearchHandler"> <lst name=”defaults”> <str name=”wt”>bitset_stream</str> </lst> <arr name="components"> <str> invenio_query </str> <str>stats</str> </arr> </requestHandler> ... <queryResponseWriter name="bitset_stream" class="org.ads.solr.InvenioBitsetStreamResponseWriter"/> solrconfig.xml

77. Invenio Query Component public void process(ResponseBuilder rb) throws IOException { SolrQueryResponse rsp = rb.rsp; SolrIndexSearcher searcher = rb.req.getSearcher(); InvenioIdCollector collector = new InvenioIdCollector(); SolrIndexSearcher.QueryCommand cmd = rb.getQueryCommand(); Query query = cmd.getQuery(); searcher.search(query, collector ); InvenioBitSet bitset = collector .getBitSet(); rsp.add("bitset", bitset); } InvenioQueryComponent.java

78. Invenio Id Collector public void setNextReader(IndexReader reader, int docBase) throws IOException { this.reader = reader; this.docBase = docBase; try { this. idMap = FieldCache.DEFAULT.getInts( this.reader, "id"); } catch (IOException e) { SolrException.logOnce( SolrCore.log, "Exception during idMap init", e); } } InvenioIdCollector.java

79. Response Writer public void write(OutputStream out, SolrQueryRequest req, SolrQueryResponse rsp) { InvenioBitSet bitset = (InvenioBitSet) rsp.getValues().get("bitset"); ZOutputStream zOut = new ZOutputStream(out, JZlib.Z_BEST_SPEED); try { zOut.write( bitset .toByteArray()); zOut.flush(); } catch (IOException e) { SolrException.logOnce(SolrCore.log, "Exception during compression/output of bitset", e); } } InvenioBitsetStreamResponseWriter.java

81. Invenio Facet Component Config <searchComponent name=" invenio_facets " class="org.ads.solr.InvenioFacetComponent" /> <requestHandler name="/invenio_facets" class="solr.SearchHandler"> <lst name="defaults"> <str name="wt">json</str> <str name="q.op">OR</str> <str name="rows">0</str> <str name="facet">true</str> <str name="facet.field">author_facet</str> ... </lst> <arr name="components"> <str> invenio_facets </str> <str>facet</str> </arr> </requestHandler> solrconfig.xml

82. A bit of python r = urllib2.Request(facet_query_url) data = bitset.fastdump() boundary = mimetools.choose_boundary() contents = '--%s' % boundary contents += 'Content-Disposition: form-data;' + 'name="bitset"; filename="bitset"' contents += 'Content-Type: application/octet-stream' contents += '' + data + '' contents += '--%s--' % boundary r.add_data(contents) r.add_unredirected_header('Content-Type', 'multipart/form-data; boundary=%s' % boundary) u = urllib2.urlopen(r) facet_data = simplejson.load(u)

83. Facet Query Component ... Iterable<ContentStream> streams = req.getContentStreams(); ... InputStream is = stream.getStream(); ByteArrayOutputStream bOut = new ByteArrayOutputStream(); ZInputStream zIn = new ZinputStream(is); IOUtils.copy(zIn, bOut); InvenioBitSet bitset = new InvenioBitSet(bOut.toByteArray()); ... InvenioFacetComponent.java

84. Facet Query Component (cont.) ... BitDocSet docSetFilter = new BitDocSet(); int i = 0; while ( bitset .nextSetBit(i) != -1) { int nextBit = bitset .nextSetBit(i); int lucene_id = idMap.get(nextBit); docSetFilter .add(lucene_id); i = nextBit + 1; } ... SolrIndexSearcher.QueryCommand cmd = rb.getQueryCommand(); cmd.setFilter( docSetFilter ); SolrIndexSearcher.QueryResult result = new SolrIndexSearcher.QueryResult(); searcher.search(result,cmd); rb.setResult( result ); ... InvenioFacetComponent.java

86. Pylucene Embedded solr cpython within Java ... Alternative Approaches

88. Is there a way to bypass the Collector stage completely?

89. How can we return document scores?

90. Alternative approaches: pylucene, pylucene + solr, cpython within Java.

92. The Invenio Team, especially...

93. Roman Chyla

94. Jan Iwaszkiewicz https://github.com/lbjay/solr-invenio

Letting In the Light: Using Solr as an External Search Component

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Letting In the Light: Using Solr as an External Search Component

Similaire à Letting In the Light: Using Solr as an External Search Component (20)

Plus de Jay Luker

Plus de Jay Luker (7)

Dernier

Dernier (20)

Letting In the Light: Using Solr as an External Search Component

Notes de l'éditeur