SlideShare une entreprise Scribd logo
1  sur  44
Letting In The Light Using Solr as an External Search Component Jay Luker Benoit Thiell SAO/NASA Astrophysics Data System http://adsabs.harvard.edu/
[object Object],[object Object]
Overview of Invenio
Our Solr-Invenio Integration Project
A few tips on Solr hacking along the way
[object Object],[object Object]
Was restructured in 1994 to become an A&I service for astronomers and astrophysicists, with fulltext archive
Has 100% penetration in astronomical community, with take-up in other areas of space sciences, engineering and physics
[object Object],[object Object]
625K fulltext articles
Painstakingly curated collection of citations and links to fulltext and data products ,[object Object],[object Object]
Search, Browse, Notifications, Personalization
API access to all content (TWITA)
Network of 12 mirror sites
ADS Labs:  http://labs.adsabs.harvard.edu
 
 
 
Never heard of  ? ,[object Object]
2000: Extension of the server to allow storing multimedia content (photos, posters, brochures, videos) and creation of the open-source project  CDSware  project
Renamed  CDS Invenio  and then  Invenio
Both an institutional repository and a digital library
Check it out!  ->   http://invenio-software.org/
Why choose Invenio? ,[object Object]
Growing penetration in the field of physics
Metadata curation tools (record editor, merger)
Support of citations graphs and citation-based searches
Second-order searches support
Under the hood ,[object Object]
Coupled with MySQL only (for now)
Scales to sets of 2M+ records
MARC storage of records
Modular architecture with: ,[object Object]
Format conversion (MARCXML, DC, NLM, etc)
References and citations handler
Plot and figure extraction
invenio.intbitset ,[object Object]
In-house C implementation of Python sets ,[object Object]
Stored marshalled in the database and used as such in the search engine
Invenio sounds great! Why use Solr then? ,[object Object]
Invenio's indexing is slow by design (providing search speed) but it is too slow for such a large repository
Solr has a wide community of users/developers and lots of extensions.
Issues with the integration ,[object Object]
Invenio's search engine requires full sets of results

Contenu connexe

Tendances

Url Connection
Url ConnectionUrl Connection
Url Connection
phanleson
 
The life of a query (oracle edition)
The life of a query (oracle edition)The life of a query (oracle edition)
The life of a query (oracle edition)
maclean liu
 
Hadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonHadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With Python
Joe Stein
 

Tendances (20)

Multi faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & loggingMulti faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & logging
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
Url Connection
Url ConnectionUrl Connection
Url Connection
 
Url Connection
Url ConnectionUrl Connection
Url Connection
 
PEARC17: Designsafe: Using Elasticsearch to Share and Search Data on a Scienc...
PEARC17: Designsafe: Using Elasticsearch to Share and Search Data on a Scienc...PEARC17: Designsafe: Using Elasticsearch to Share and Search Data on a Scienc...
PEARC17: Designsafe: Using Elasticsearch to Share and Search Data on a Scienc...
 
URL Class in JAVA
URL Class in JAVAURL Class in JAVA
URL Class in JAVA
 
SQLite Techniques
SQLite TechniquesSQLite Techniques
SQLite Techniques
 
Fast track to lucene
Fast track to luceneFast track to lucene
Fast track to lucene
 
The life of a query (oracle edition)
The life of a query (oracle edition)The life of a query (oracle edition)
The life of a query (oracle edition)
 
Lucene And Solr Intro
Lucene And Solr IntroLucene And Solr Intro
Lucene And Solr Intro
 
EuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and HadoopEuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and Hadoop
 
Hadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonHadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With Python
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
 
SQLite Techniques
SQLite TechniquesSQLite Techniques
SQLite Techniques
 
Compass Framework
Compass FrameworkCompass Framework
Compass Framework
 
I/O in java Part 1
I/O in java Part 1I/O in java Part 1
I/O in java Part 1
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
4 sw architectures and sparql
4 sw architectures and sparql4 sw architectures and sparql
4 sw architectures and sparql
 
Search Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and SolrSearch Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and Solr
 
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter Data
 

Similaire à Letting In the Light: Using Solr as an External Search Component

Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr Tutorial
Sourcesense
 
RESTful SOA - 中科院暑期讲座
RESTful SOA - 中科院暑期讲座RESTful SOA - 中科院暑期讲座
RESTful SOA - 中科院暑期讲座
Li Yi
 
EXPath: the packaging system and the webapp framework
EXPath: the packaging system and the webapp frameworkEXPath: the packaging system and the webapp framework
EXPath: the packaging system and the webapp framework
Florent Georges
 

Similaire à Letting In the Light: Using Solr as an External Search Component (20)

Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr Tutorial
 
Solr Presentation
Solr PresentationSolr Presentation
Solr Presentation
 
How we build Vox
How we build VoxHow we build Vox
How we build Vox
 
Web Scraping with PHP
Web Scraping with PHPWeb Scraping with PHP
Web Scraping with PHP
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
 
RESTful SOA - 中科院暑期讲座
RESTful SOA - 中科院暑期讲座RESTful SOA - 中科院暑期讲座
RESTful SOA - 中科院暑期讲座
 
REST dojo Comet
REST dojo CometREST dojo Comet
REST dojo Comet
 
EXPath: the packaging system and the webapp framework
EXPath: the packaging system and the webapp frameworkEXPath: the packaging system and the webapp framework
EXPath: the packaging system and the webapp framework
 
Intro To Spring Python
Intro To Spring PythonIntro To Spring Python
Intro To Spring Python
 
Sumo Logic "How to" Webinar: Advanced Analytics
Sumo Logic "How to" Webinar: Advanced AnalyticsSumo Logic "How to" Webinar: Advanced Analytics
Sumo Logic "How to" Webinar: Advanced Analytics
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
 
Django tech-talk
Django tech-talkDjango tech-talk
Django tech-talk
 
Practical catalyst
Practical catalystPractical catalyst
Practical catalyst
 
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real TimeApache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real Time
 
Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseApache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San Jose
 
OrientDB for real & Web App development
OrientDB for real & Web App developmentOrientDB for real & Web App development
OrientDB for real & Web App development
 
Introducing Struts 2
Introducing Struts 2Introducing Struts 2
Introducing Struts 2
 
Itemscript, a specification for RESTful JSON integration
Itemscript, a specification for RESTful JSON integrationItemscript, a specification for RESTful JSON integration
Itemscript, a specification for RESTful JSON integration
 
Instrumenting plugins for Performance Schema
Instrumenting plugins for Performance SchemaInstrumenting plugins for Performance Schema
Instrumenting plugins for Performance Schema
 
Using Rails to Create an Enterprise App: A Real-Life Case Study
Using Rails to Create an Enterprise App: A Real-Life Case StudyUsing Rails to Create an Enterprise App: A Real-Life Case Study
Using Rails to Create an Enterprise App: A Real-Life Case Study
 

Plus de Jay Luker (7)

Coinage
CoinageCoinage
Coinage
 
Learning Engineering Initiatives at Harvard DCE
Learning Engineering Initiatives at Harvard DCELearning Engineering Initiatives at Harvard DCE
Learning Engineering Initiatives at Harvard DCE
 
Using SweetSpotSimilarity for Solr Fulltext Indexing
Using SweetSpotSimilarity for Solr Fulltext IndexingUsing SweetSpotSimilarity for Solr Fulltext Indexing
Using SweetSpotSimilarity for Solr Fulltext Indexing
 
N Characters in Search of an Author: Improving Author Name Indexing & Searchi...
N Characters in Search of an Author: Improving Author Name Indexing & Searchi...N Characters in Search of an Author: Improving Author Name Indexing & Searchi...
N Characters in Search of an Author: Improving Author Name Indexing & Searchi...
 
LexFarm Busa Farm Site Plan
LexFarm Busa Farm Site PlanLexFarm Busa Farm Site Plan
LexFarm Busa Farm Site Plan
 
LexFarm Presentation
LexFarm PresentationLexFarm Presentation
LexFarm Presentation
 
LexFarm Proposal
LexFarm ProposalLexFarm Proposal
LexFarm Proposal
 

Dernier

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Dernier (20)

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 

Letting In the Light: Using Solr as an External Search Component

  • 1. Letting In The Light Using Solr as an External Search Component Jay Luker Benoit Thiell SAO/NASA Astrophysics Data System http://adsabs.harvard.edu/
  • 2.
  • 5. A few tips on Solr hacking along the way
  • 6.
  • 7. Was restructured in 1994 to become an A&I service for astronomers and astrophysicists, with fulltext archive
  • 8. Has 100% penetration in astronomical community, with take-up in other areas of space sciences, engineering and physics
  • 9.
  • 11.
  • 13. API access to all content (TWITA)
  • 14. Network of 12 mirror sites
  • 15. ADS Labs: http://labs.adsabs.harvard.edu
  • 16.  
  • 17.  
  • 18.  
  • 19.
  • 20. 2000: Extension of the server to allow storing multimedia content (photos, posters, brochures, videos) and creation of the open-source project CDSware project
  • 21. Renamed CDS Invenio and then Invenio
  • 22. Both an institutional repository and a digital library
  • 23. Check it out! -> http://invenio-software.org/
  • 24.
  • 25. Growing penetration in the field of physics
  • 26. Metadata curation tools (record editor, merger)
  • 27. Support of citations graphs and citation-based searches
  • 29.
  • 30. Coupled with MySQL only (for now)
  • 31. Scales to sets of 2M+ records
  • 32. MARC storage of records
  • 33.
  • 36. Plot and figure extraction
  • 37.
  • 38.
  • 39. Stored marshalled in the database and used as such in the search engine
  • 40.
  • 41. Invenio's indexing is slow by design (providing search speed) but it is too slow for such a large repository
  • 42. Solr has a wide community of users/developers and lots of extensions.
  • 43.
  • 44. Invenio's search engine requires full sets of results
  • 45. Communicate over HTTP with very large payloads
  • 47.  
  • 48.
  • 49. Take advantage of Solr faceting
  • 50. Not duplicate existing Invenio functionality
  • 51. Write as little code as possible
  • 53. Problem #1 Retrieving very large result set of ids. Like, millions.
  • 54. The WTH Approach http://myhost:8983/solr/select? q={foo} & fl=id & rows={n} Query for foo Only return the id field Return n rows of the result
  • 55.
  • 56. Can be integers, strings, etc
  • 57.
  • 59. Unique within an index segment
  • 60. The WTH Approach * warmed cache, different servers, same LAN seconds
  • 61. So what's going on here? document cache Query Response QueryResult [1,5,16,84,...] Lucene Doc id: 1234, bibcode: <lazy>, Title: <lazy>, ...
  • 62. Solution: Custom Collector QueryResult [1,5,16,84,...] Query Response
  • 63. Solution: Custom Collector ... InvenioIdCollector collector = new InvenioIdCollector(); searcher.search(query, collector); ArrayList<Integer> ids = collector.getIds(); rsp.add(“ids”, ids); ... MyQueryComponent.java ... ArrayList<Integer> ids = new ArrayList<Intger>(); ... Public void collect(int doc) { this.ids.add(this.idMap[doc]); } ... MyCollector.java
  • 64. OK, Let's Try This Again http://myhost:8983/solr/select? q={foo} & qt=my_querytype Query for foo Use our custom query handler
  • 65.  
  • 66. Better. But ...
  • 68. Fulltext Search Record Ids Invenio What's Missing? Solr Query Processing Post-processing Return/Render
  • 69. Fulltext Search Record ids Invenio Again, WTH? Record ids? Facets Solr Query Processing Post-processing Return/Render
  • 70. Fulltext Search Invenio BitSet Invenio Current Solution Invenio BitSet Facets Solr Query Processing Post-processing Return/Render
  • 71.
  • 72. Custom Collector to collect doc ids
  • 75. Custom QueryComponent for accepting an Integer BitSet query and returning facets
  • 76. Invenio Query Component Config <searchComponent name=&quot; invenio_query &quot; class=&quot;org.ads.solr.InvenioQueryComponent&quot; /> <requestHandler name=&quot;invenio_query&quot; class=&quot;solr.SearchHandler&quot;> <lst name=”defaults”> <str name=”wt”>bitset_stream</str> </lst> <arr name=&quot;components&quot;> <str> invenio_query </str> <str>stats</str> </arr> </requestHandler> ... <queryResponseWriter name=&quot;bitset_stream&quot; class=&quot;org.ads.solr.InvenioBitsetStreamResponseWriter&quot;/> solrconfig.xml
  • 77. Invenio Query Component public void process(ResponseBuilder rb) throws IOException { SolrQueryResponse rsp = rb.rsp; SolrIndexSearcher searcher = rb.req.getSearcher(); InvenioIdCollector collector = new InvenioIdCollector(); SolrIndexSearcher.QueryCommand cmd = rb.getQueryCommand(); Query query = cmd.getQuery(); searcher.search(query, collector ); InvenioBitSet bitset = collector .getBitSet(); rsp.add(&quot;bitset&quot;, bitset); } InvenioQueryComponent.java
  • 78. Invenio Id Collector public void setNextReader(IndexReader reader, int docBase) throws IOException { this.reader = reader; this.docBase = docBase; try { this. idMap = FieldCache.DEFAULT.getInts( this.reader, &quot;id&quot;); } catch (IOException e) { SolrException.logOnce( SolrCore.log, &quot;Exception during idMap init&quot;, e); } } InvenioIdCollector.java
  • 79. Response Writer public void write(OutputStream out, SolrQueryRequest req, SolrQueryResponse rsp) { InvenioBitSet bitset = (InvenioBitSet) rsp.getValues().get(&quot;bitset&quot;); ZOutputStream zOut = new ZOutputStream(out, JZlib.Z_BEST_SPEED); try { zOut.write( bitset .toByteArray()); zOut.flush(); } catch (IOException e) { SolrException.logOnce(SolrCore.log, &quot;Exception during compression/output of bitset&quot;, e); } } InvenioBitsetStreamResponseWriter.java
  • 80.  
  • 81. Invenio Facet Component Config <searchComponent name=&quot; invenio_facets &quot; class=&quot;org.ads.solr.InvenioFacetComponent&quot; /> <requestHandler name=&quot;/invenio_facets&quot; class=&quot;solr.SearchHandler&quot;> <lst name=&quot;defaults&quot;> <str name=&quot;wt&quot;>json</str> <str name=&quot;q.op&quot;>OR</str> <str name=&quot;rows&quot;>0</str> <str name=&quot;facet&quot;>true</str> <str name=&quot;facet.field&quot;>author_facet</str> ... </lst> <arr name=&quot;components&quot;> <str> invenio_facets </str> <str>facet</str> </arr> </requestHandler> solrconfig.xml
  • 82. A bit of python r = urllib2.Request(facet_query_url) data = bitset.fastdump() boundary = mimetools.choose_boundary() contents = '--%s' % boundary contents += 'Content-Disposition: form-data;' + 'name=&quot;bitset&quot;; filename=&quot;bitset&quot;' contents += 'Content-Type: application/octet-stream' contents += '' + data + '' contents += '--%s--' % boundary r.add_data(contents) r.add_unredirected_header('Content-Type', 'multipart/form-data; boundary=%s' % boundary) u = urllib2.urlopen(r) facet_data = simplejson.load(u)
  • 83. Facet Query Component ... Iterable<ContentStream> streams = req.getContentStreams(); ... InputStream is = stream.getStream(); ByteArrayOutputStream bOut = new ByteArrayOutputStream(); ZInputStream zIn = new ZinputStream(is); IOUtils.copy(zIn, bOut); InvenioBitSet bitset = new InvenioBitSet(bOut.toByteArray()); ... InvenioFacetComponent.java
  • 84. Facet Query Component (cont.) ... BitDocSet docSetFilter = new BitDocSet(); int i = 0; while ( bitset .nextSetBit(i) != -1) { int nextBit = bitset .nextSetBit(i); int lucene_id = idMap.get(nextBit); docSetFilter .add(lucene_id); i = nextBit + 1; } ... SolrIndexSearcher.QueryCommand cmd = rb.getQueryCommand(); cmd.setFilter( docSetFilter ); SolrIndexSearcher.QueryResult result = new SolrIndexSearcher.QueryResult(); searcher.search(result,cmd); rb.setResult( result ); ... InvenioFacetComponent.java
  • 85.  
  • 86. Pylucene Embedded solr cpython within Java ... Alternative Approaches
  • 87.
  • 88. Is there a way to bypass the Collector stage completely?
  • 89. How can we return document scores?
  • 90. Alternative approaches: pylucene, pylucene + solr, cpython within Java.
  • 91.
  • 92. The Invenio Team, especially...

Notes de l'éditeur

  1. The SAO/NASA Astrophysics Data System (ADS) is a Digital Library portal for researchers in Astronomy and Physics, operated by the Smithsonian Astrophysical Observatory (SAO) under a NASA grant.
  2. 1994 was the move to the web
  3. Astronomy: 1.8M Physics: 5.8M Arxiv e-prints: 650K Citations: 40M (over 3.4M papers with citations) Curated links: 23M (fulltext, data products, citations) 4M scanned pages, 625K articles 650K pages historical material Advanced search allows for searching by astronomical object (via SIMBAD) and attributes like “has dataset” TWITA = The Website Is The API: via data_type=&lt;foo&gt; param, also structured metadata within the pages
  4. INSPIRE: Invenio for SPIRES, the Physics database at Stanford.
  5. Obviously, performance was also an objective Invenio team had been skeptical of the necessity of incorporating an external tool/service to do fulltext indexing and/or faceting, but once introduced to solr they quickly came around In spite of the fact that at least some of the fancypants sorting, ranking, filtering functionality could most likely be reproduced using Solr, there was a strong reluctance to rewrite that code. Writing as little java as possible doesn&apos;t just come from a java-phobic frame of mind; it&apos;s also about limiting how much we rely on custom solr components. Rely as much as possible on what Solr affords. Loose integration in this case means the ability to swap in alternate services for retrieving fulltext search results and facets. More on how we succeeded in that towards the end.
  6. When we talk about the ids being sent back and forth between Invenio &amp; Solr we are talking about the schema ids.
  7. So what&apos;s going on here? Our first thought was maybe it was the time needed to serialize/de-serialize the response, but that turned out not to be it.
  8. QueryResultMaxDocsCached QueryResultWindowSize enableLazyFieldLoading
  9. No need to specify number of rows or which fields to return
  10. Post-processing = 2 nd order searching, filtering Can&apos;t retreive facets with the initial query because the final list of search results will depend on Invenio post-processing. So how do you send a very large set of ids to get a set of facet results?
  11. Satisfies all most objectives. We get searching &amp; faceting We don&apos;t have to write a lot of python or java: invenio needs the indexing piece Not duplicating anything that invenio already does very well Loosely coupled because communication is in a form that is native to invenio, we could easily swap in/out different services for either piece
  12. Seems like a lot, but in total lines of code it&apos;s not that much, especially considering it&apos;s in Java. Plus I suck at Java and I was able to do it all in 2-3 weeks of trial and error hacking. Plus, it all very closely conforms to the affordances of the Solr API. Only one small thing that might be considered a “hack”.
  13. Defining our custom query component and telling the default solr search handler to use it Also defining our custom response writer
  14. A query component class has two opportunities to interact with the incoming request: prepare &amp; process. We only need process.
  15. These times include decompressing and unmarshalling the bitset into an invenio intbitset object in python
  16. Defining our custom query component and telling the default solr search handler to use it Also defining our custom response writer
  17. PyLucene is a Python wrapper around Java Lucene. It embeds a Java VM with Lucene into a Python process. The extension is machine-generated with JCC, a C++ code generator that makes it possible to call into Java classes from Python via Java&apos;s Native Invocation Interface (JNI).
  18. PyLucene is a Python wrapper around Java Lucene. It embeds a Java VM with Lucene into a Python process. The extension is machine-generated with JCC, a C++ code generator that makes it possible to call into Java classes from Python via Java&apos;s Native Invocation Interface (JNI).
  19. Invenio team had been skeptical of the necessity of incorporating an external tool/service to do fulltext indexing and/or faceting, but once introduced to solr they quickly came around In spite of the fact that at least some of the fancypants sorting, ranking, filtering functionality could most likely be reproduced using Solr, there was a strong reluctance to rewrite that code. Writing as little java as possible doesn&apos;t just come from a java-phobic frame of mind; it&apos;s also about limiting how much we rely on custom solr components. Rely as much as possible on what Solr affords. Loose integration in this case means the ability to swap in alternate services for retrieving fulltext search results and facets. More on how we succeeded in that towards the end.