SlideShare une entreprise Scribd logo
1  sur  44
Letting In The Light Using Solr as an External Search Component Jay Luker Benoit Thiell SAO/NASA Astrophysics Data System http://adsabs.harvard.edu/
[object Object],[object Object]
Overview of Invenio
Our Solr-Invenio Integration Project
A few tips on Solr hacking along the way
[object Object],[object Object]
Was restructured in 1994 to become an A&I service for astronomers and astrophysicists, with fulltext archive
Has 100% penetration in astronomical community, with take-up in other areas of space sciences, engineering and physics
[object Object],[object Object]
625K fulltext articles
Painstakingly curated collection of citations and links to fulltext and data products ,[object Object],[object Object]
Search, Browse, Notifications, Personalization
API access to all content (TWITA)
Network of 12 mirror sites
ADS Labs:  http://labs.adsabs.harvard.edu
 
 
 
Never heard of  ? ,[object Object]
2000: Extension of the server to allow storing multimedia content (photos, posters, brochures, videos) and creation of the open-source project  CDSware  project
Renamed  CDS Invenio  and then  Invenio
Both an institutional repository and a digital library
Check it out!  ->   http://invenio-software.org/
Why choose Invenio? ,[object Object]
Growing penetration in the field of physics
Metadata curation tools (record editor, merger)
Support of citations graphs and citation-based searches
Second-order searches support
Under the hood ,[object Object]
Coupled with MySQL only (for now)
Scales to sets of 2M+ records
MARC storage of records
Modular architecture with: ,[object Object]
Format conversion (MARCXML, DC, NLM, etc)
References and citations handler
Plot and figure extraction
invenio.intbitset ,[object Object]
In-house C implementation of Python sets ,[object Object]
Stored marshalled in the database and used as such in the search engine
Invenio sounds great! Why use Solr then? ,[object Object]
Invenio's indexing is slow by design (providing search speed) but it is too slow for such a large repository
Solr has a wide community of users/developers and lots of extensions.
Issues with the integration ,[object Object]
Invenio's search engine requires full sets of results

Contenu connexe

Tendances

Multi faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & loggingMulti faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & logginglucenerevolution
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and SolrGrant Ingersoll
 
Url Connection
Url ConnectionUrl Connection
Url Connectionphanleson
 
PEARC17: Designsafe: Using Elasticsearch to Share and Search Data on a Scienc...
PEARC17: Designsafe: Using Elasticsearch to Share and Search Data on a Scienc...PEARC17: Designsafe: Using Elasticsearch to Share and Search Data on a Scienc...
PEARC17: Designsafe: Using Elasticsearch to Share and Search Data on a Scienc...Josue Balandrano
 
SQLite Techniques
SQLite TechniquesSQLite Techniques
SQLite Techniquesjoaopmaia
 
The life of a query (oracle edition)
The life of a query (oracle edition)The life of a query (oracle edition)
The life of a query (oracle edition)maclean liu
 
EuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and HadoopEuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and HadoopMax Tepkeev
 
Hadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonHadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonJoe Stein
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Kai Chan
 
SQLite Techniques
SQLite TechniquesSQLite Techniques
SQLite Techniquesjoaopmaia
 
Compass Framework
Compass FrameworkCompass Framework
Compass FrameworkLukas Vlcek
 
I/O in java Part 1
I/O in java Part 1I/O in java Part 1
I/O in java Part 1ashishspace
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Search Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and SolrSearch Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and SolrKai Chan
 
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataYanchang Zhao
 

Tendances (20)

Multi faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & loggingMulti faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & logging
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
Url Connection
Url ConnectionUrl Connection
Url Connection
 
Url Connection
Url ConnectionUrl Connection
Url Connection
 
PEARC17: Designsafe: Using Elasticsearch to Share and Search Data on a Scienc...
PEARC17: Designsafe: Using Elasticsearch to Share and Search Data on a Scienc...PEARC17: Designsafe: Using Elasticsearch to Share and Search Data on a Scienc...
PEARC17: Designsafe: Using Elasticsearch to Share and Search Data on a Scienc...
 
URL Class in JAVA
URL Class in JAVAURL Class in JAVA
URL Class in JAVA
 
SQLite Techniques
SQLite TechniquesSQLite Techniques
SQLite Techniques
 
Fast track to lucene
Fast track to luceneFast track to lucene
Fast track to lucene
 
The life of a query (oracle edition)
The life of a query (oracle edition)The life of a query (oracle edition)
The life of a query (oracle edition)
 
Lucene And Solr Intro
Lucene And Solr IntroLucene And Solr Intro
Lucene And Solr Intro
 
EuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and HadoopEuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and Hadoop
 
Hadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonHadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With Python
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
 
SQLite Techniques
SQLite TechniquesSQLite Techniques
SQLite Techniques
 
Compass Framework
Compass FrameworkCompass Framework
Compass Framework
 
I/O in java Part 1
I/O in java Part 1I/O in java Part 1
I/O in java Part 1
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
4 sw architectures and sparql
4 sw architectures and sparql4 sw architectures and sparql
4 sw architectures and sparql
 
Search Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and SolrSearch Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and Solr
 
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter Data
 

Similaire à Letting In the Light: Using Solr as an External Search Component

Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialSourcesense
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptxAndrew Lamb
 
RESTful SOA - 中科院暑期讲座
RESTful SOA - 中科院暑期讲座RESTful SOA - 中科院暑期讲座
RESTful SOA - 中科院暑期讲座Li Yi
 
EXPath: the packaging system and the webapp framework
EXPath: the packaging system and the webapp frameworkEXPath: the packaging system and the webapp framework
EXPath: the packaging system and the webapp frameworkFlorent Georges
 
Intro To Spring Python
Intro To Spring PythonIntro To Spring Python
Intro To Spring Pythongturnquist
 
Sumo Logic "How to" Webinar: Advanced Analytics
Sumo Logic "How to" Webinar: Advanced AnalyticsSumo Logic "How to" Webinar: Advanced Analytics
Sumo Logic "How to" Webinar: Advanced AnalyticsSumo Logic
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...Andrew Lamb
 
Django tech-talk
Django tech-talkDjango tech-talk
Django tech-talkdtdannen
 
Practical catalyst
Practical catalystPractical catalyst
Practical catalystdwm042
 
Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseApache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseHao Chen
 
OrientDB for real & Web App development
OrientDB for real & Web App developmentOrientDB for real & Web App development
OrientDB for real & Web App developmentLuca Garulli
 
Introducing Struts 2
Introducing Struts 2Introducing Struts 2
Introducing Struts 2wiradikusuma
 
Itemscript, a specification for RESTful JSON integration
Itemscript, a specification for RESTful JSON integrationItemscript, a specification for RESTful JSON integration
Itemscript, a specification for RESTful JSON integration{item:foo}
 
Instrumenting plugins for Performance Schema
Instrumenting plugins for Performance SchemaInstrumenting plugins for Performance Schema
Instrumenting plugins for Performance SchemaMark Leith
 
Using Rails to Create an Enterprise App: A Real-Life Case Study
Using Rails to Create an Enterprise App: A Real-Life Case StudyUsing Rails to Create an Enterprise App: A Real-Life Case Study
Using Rails to Create an Enterprise App: A Real-Life Case StudyDavid Keener
 

Similaire à Letting In the Light: Using Solr as an External Search Component (20)

Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr Tutorial
 
Solr Presentation
Solr PresentationSolr Presentation
Solr Presentation
 
How we build Vox
How we build VoxHow we build Vox
How we build Vox
 
Web Scraping with PHP
Web Scraping with PHPWeb Scraping with PHP
Web Scraping with PHP
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
 
RESTful SOA - 中科院暑期讲座
RESTful SOA - 中科院暑期讲座RESTful SOA - 中科院暑期讲座
RESTful SOA - 中科院暑期讲座
 
REST dojo Comet
REST dojo CometREST dojo Comet
REST dojo Comet
 
EXPath: the packaging system and the webapp framework
EXPath: the packaging system and the webapp frameworkEXPath: the packaging system and the webapp framework
EXPath: the packaging system and the webapp framework
 
Intro To Spring Python
Intro To Spring PythonIntro To Spring Python
Intro To Spring Python
 
Sumo Logic "How to" Webinar: Advanced Analytics
Sumo Logic "How to" Webinar: Advanced AnalyticsSumo Logic "How to" Webinar: Advanced Analytics
Sumo Logic "How to" Webinar: Advanced Analytics
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
 
Django tech-talk
Django tech-talkDjango tech-talk
Django tech-talk
 
Practical catalyst
Practical catalystPractical catalyst
Practical catalyst
 
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real TimeApache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real Time
 
Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseApache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San Jose
 
OrientDB for real & Web App development
OrientDB for real & Web App developmentOrientDB for real & Web App development
OrientDB for real & Web App development
 
Introducing Struts 2
Introducing Struts 2Introducing Struts 2
Introducing Struts 2
 
Itemscript, a specification for RESTful JSON integration
Itemscript, a specification for RESTful JSON integrationItemscript, a specification for RESTful JSON integration
Itemscript, a specification for RESTful JSON integration
 
Instrumenting plugins for Performance Schema
Instrumenting plugins for Performance SchemaInstrumenting plugins for Performance Schema
Instrumenting plugins for Performance Schema
 
Using Rails to Create an Enterprise App: A Real-Life Case Study
Using Rails to Create an Enterprise App: A Real-Life Case StudyUsing Rails to Create an Enterprise App: A Real-Life Case Study
Using Rails to Create an Enterprise App: A Real-Life Case Study
 

Plus de Jay Luker

Learning Engineering Initiatives at Harvard DCE
Learning Engineering Initiatives at Harvard DCELearning Engineering Initiatives at Harvard DCE
Learning Engineering Initiatives at Harvard DCEJay Luker
 
Using SweetSpotSimilarity for Solr Fulltext Indexing
Using SweetSpotSimilarity for Solr Fulltext IndexingUsing SweetSpotSimilarity for Solr Fulltext Indexing
Using SweetSpotSimilarity for Solr Fulltext IndexingJay Luker
 
N Characters in Search of an Author: Improving Author Name Indexing & Searchi...
N Characters in Search of an Author: Improving Author Name Indexing & Searchi...N Characters in Search of an Author: Improving Author Name Indexing & Searchi...
N Characters in Search of an Author: Improving Author Name Indexing & Searchi...Jay Luker
 
LexFarm Busa Farm Site Plan
LexFarm Busa Farm Site PlanLexFarm Busa Farm Site Plan
LexFarm Busa Farm Site PlanJay Luker
 
LexFarm Presentation
LexFarm PresentationLexFarm Presentation
LexFarm PresentationJay Luker
 
LexFarm Proposal
LexFarm ProposalLexFarm Proposal
LexFarm ProposalJay Luker
 

Plus de Jay Luker (7)

Coinage
CoinageCoinage
Coinage
 
Learning Engineering Initiatives at Harvard DCE
Learning Engineering Initiatives at Harvard DCELearning Engineering Initiatives at Harvard DCE
Learning Engineering Initiatives at Harvard DCE
 
Using SweetSpotSimilarity for Solr Fulltext Indexing
Using SweetSpotSimilarity for Solr Fulltext IndexingUsing SweetSpotSimilarity for Solr Fulltext Indexing
Using SweetSpotSimilarity for Solr Fulltext Indexing
 
N Characters in Search of an Author: Improving Author Name Indexing & Searchi...
N Characters in Search of an Author: Improving Author Name Indexing & Searchi...N Characters in Search of an Author: Improving Author Name Indexing & Searchi...
N Characters in Search of an Author: Improving Author Name Indexing & Searchi...
 
LexFarm Busa Farm Site Plan
LexFarm Busa Farm Site PlanLexFarm Busa Farm Site Plan
LexFarm Busa Farm Site Plan
 
LexFarm Presentation
LexFarm PresentationLexFarm Presentation
LexFarm Presentation
 
LexFarm Proposal
LexFarm ProposalLexFarm Proposal
LexFarm Proposal
 

Dernier

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Dernier (20)

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

Letting In the Light: Using Solr as an External Search Component

  • 1. Letting In The Light Using Solr as an External Search Component Jay Luker Benoit Thiell SAO/NASA Astrophysics Data System http://adsabs.harvard.edu/
  • 2.
  • 5. A few tips on Solr hacking along the way
  • 6.
  • 7. Was restructured in 1994 to become an A&I service for astronomers and astrophysicists, with fulltext archive
  • 8. Has 100% penetration in astronomical community, with take-up in other areas of space sciences, engineering and physics
  • 9.
  • 11.
  • 13. API access to all content (TWITA)
  • 14. Network of 12 mirror sites
  • 15. ADS Labs: http://labs.adsabs.harvard.edu
  • 16.  
  • 17.  
  • 18.  
  • 19.
  • 20. 2000: Extension of the server to allow storing multimedia content (photos, posters, brochures, videos) and creation of the open-source project CDSware project
  • 21. Renamed CDS Invenio and then Invenio
  • 22. Both an institutional repository and a digital library
  • 23. Check it out! -> http://invenio-software.org/
  • 24.
  • 25. Growing penetration in the field of physics
  • 26. Metadata curation tools (record editor, merger)
  • 27. Support of citations graphs and citation-based searches
  • 29.
  • 30. Coupled with MySQL only (for now)
  • 31. Scales to sets of 2M+ records
  • 32. MARC storage of records
  • 33.
  • 36. Plot and figure extraction
  • 37.
  • 38.
  • 39. Stored marshalled in the database and used as such in the search engine
  • 40.
  • 41. Invenio's indexing is slow by design (providing search speed) but it is too slow for such a large repository
  • 42. Solr has a wide community of users/developers and lots of extensions.
  • 43.
  • 44. Invenio's search engine requires full sets of results
  • 45. Communicate over HTTP with very large payloads
  • 47.  
  • 48.
  • 49. Take advantage of Solr faceting
  • 50. Not duplicate existing Invenio functionality
  • 51. Write as little code as possible
  • 53. Problem #1 Retrieving very large result set of ids. Like, millions.
  • 54. The WTH Approach http://myhost:8983/solr/select? q={foo} & fl=id & rows={n} Query for foo Only return the id field Return n rows of the result
  • 55.
  • 56. Can be integers, strings, etc
  • 57.
  • 59. Unique within an index segment
  • 60. The WTH Approach * warmed cache, different servers, same LAN seconds
  • 61. So what's going on here? document cache Query Response QueryResult [1,5,16,84,...] Lucene Doc id: 1234, bibcode: <lazy>, Title: <lazy>, ...
  • 62. Solution: Custom Collector QueryResult [1,5,16,84,...] Query Response
  • 63. Solution: Custom Collector ... InvenioIdCollector collector = new InvenioIdCollector(); searcher.search(query, collector); ArrayList<Integer> ids = collector.getIds(); rsp.add(“ids”, ids); ... MyQueryComponent.java ... ArrayList<Integer> ids = new ArrayList<Intger>(); ... Public void collect(int doc) { this.ids.add(this.idMap[doc]); } ... MyCollector.java
  • 64. OK, Let's Try This Again http://myhost:8983/solr/select? q={foo} & qt=my_querytype Query for foo Use our custom query handler
  • 65.  
  • 66. Better. But ...
  • 68. Fulltext Search Record Ids Invenio What's Missing? Solr Query Processing Post-processing Return/Render
  • 69. Fulltext Search Record ids Invenio Again, WTH? Record ids? Facets Solr Query Processing Post-processing Return/Render
  • 70. Fulltext Search Invenio BitSet Invenio Current Solution Invenio BitSet Facets Solr Query Processing Post-processing Return/Render
  • 71.
  • 72. Custom Collector to collect doc ids
  • 75. Custom QueryComponent for accepting an Integer BitSet query and returning facets
  • 76. Invenio Query Component Config <searchComponent name=&quot; invenio_query &quot; class=&quot;org.ads.solr.InvenioQueryComponent&quot; /> <requestHandler name=&quot;invenio_query&quot; class=&quot;solr.SearchHandler&quot;> <lst name=”defaults”> <str name=”wt”>bitset_stream</str> </lst> <arr name=&quot;components&quot;> <str> invenio_query </str> <str>stats</str> </arr> </requestHandler> ... <queryResponseWriter name=&quot;bitset_stream&quot; class=&quot;org.ads.solr.InvenioBitsetStreamResponseWriter&quot;/> solrconfig.xml
  • 77. Invenio Query Component public void process(ResponseBuilder rb) throws IOException { SolrQueryResponse rsp = rb.rsp; SolrIndexSearcher searcher = rb.req.getSearcher(); InvenioIdCollector collector = new InvenioIdCollector(); SolrIndexSearcher.QueryCommand cmd = rb.getQueryCommand(); Query query = cmd.getQuery(); searcher.search(query, collector ); InvenioBitSet bitset = collector .getBitSet(); rsp.add(&quot;bitset&quot;, bitset); } InvenioQueryComponent.java
  • 78. Invenio Id Collector public void setNextReader(IndexReader reader, int docBase) throws IOException { this.reader = reader; this.docBase = docBase; try { this. idMap = FieldCache.DEFAULT.getInts( this.reader, &quot;id&quot;); } catch (IOException e) { SolrException.logOnce( SolrCore.log, &quot;Exception during idMap init&quot;, e); } } InvenioIdCollector.java
  • 79. Response Writer public void write(OutputStream out, SolrQueryRequest req, SolrQueryResponse rsp) { InvenioBitSet bitset = (InvenioBitSet) rsp.getValues().get(&quot;bitset&quot;); ZOutputStream zOut = new ZOutputStream(out, JZlib.Z_BEST_SPEED); try { zOut.write( bitset .toByteArray()); zOut.flush(); } catch (IOException e) { SolrException.logOnce(SolrCore.log, &quot;Exception during compression/output of bitset&quot;, e); } } InvenioBitsetStreamResponseWriter.java
  • 80.  
  • 81. Invenio Facet Component Config <searchComponent name=&quot; invenio_facets &quot; class=&quot;org.ads.solr.InvenioFacetComponent&quot; /> <requestHandler name=&quot;/invenio_facets&quot; class=&quot;solr.SearchHandler&quot;> <lst name=&quot;defaults&quot;> <str name=&quot;wt&quot;>json</str> <str name=&quot;q.op&quot;>OR</str> <str name=&quot;rows&quot;>0</str> <str name=&quot;facet&quot;>true</str> <str name=&quot;facet.field&quot;>author_facet</str> ... </lst> <arr name=&quot;components&quot;> <str> invenio_facets </str> <str>facet</str> </arr> </requestHandler> solrconfig.xml
  • 82. A bit of python r = urllib2.Request(facet_query_url) data = bitset.fastdump() boundary = mimetools.choose_boundary() contents = '--%s' % boundary contents += 'Content-Disposition: form-data;' + 'name=&quot;bitset&quot;; filename=&quot;bitset&quot;' contents += 'Content-Type: application/octet-stream' contents += '' + data + '' contents += '--%s--' % boundary r.add_data(contents) r.add_unredirected_header('Content-Type', 'multipart/form-data; boundary=%s' % boundary) u = urllib2.urlopen(r) facet_data = simplejson.load(u)
  • 83. Facet Query Component ... Iterable<ContentStream> streams = req.getContentStreams(); ... InputStream is = stream.getStream(); ByteArrayOutputStream bOut = new ByteArrayOutputStream(); ZInputStream zIn = new ZinputStream(is); IOUtils.copy(zIn, bOut); InvenioBitSet bitset = new InvenioBitSet(bOut.toByteArray()); ... InvenioFacetComponent.java
  • 84. Facet Query Component (cont.) ... BitDocSet docSetFilter = new BitDocSet(); int i = 0; while ( bitset .nextSetBit(i) != -1) { int nextBit = bitset .nextSetBit(i); int lucene_id = idMap.get(nextBit); docSetFilter .add(lucene_id); i = nextBit + 1; } ... SolrIndexSearcher.QueryCommand cmd = rb.getQueryCommand(); cmd.setFilter( docSetFilter ); SolrIndexSearcher.QueryResult result = new SolrIndexSearcher.QueryResult(); searcher.search(result,cmd); rb.setResult( result ); ... InvenioFacetComponent.java
  • 85.  
  • 86. Pylucene Embedded solr cpython within Java ... Alternative Approaches
  • 87.
  • 88. Is there a way to bypass the Collector stage completely?
  • 89. How can we return document scores?
  • 90. Alternative approaches: pylucene, pylucene + solr, cpython within Java.
  • 91.
  • 92. The Invenio Team, especially...

Notes de l'éditeur

  1. The SAO/NASA Astrophysics Data System (ADS) is a Digital Library portal for researchers in Astronomy and Physics, operated by the Smithsonian Astrophysical Observatory (SAO) under a NASA grant.
  2. 1994 was the move to the web
  3. Astronomy: 1.8M Physics: 5.8M Arxiv e-prints: 650K Citations: 40M (over 3.4M papers with citations) Curated links: 23M (fulltext, data products, citations) 4M scanned pages, 625K articles 650K pages historical material Advanced search allows for searching by astronomical object (via SIMBAD) and attributes like “has dataset” TWITA = The Website Is The API: via data_type=&lt;foo&gt; param, also structured metadata within the pages
  4. INSPIRE: Invenio for SPIRES, the Physics database at Stanford.
  5. Obviously, performance was also an objective Invenio team had been skeptical of the necessity of incorporating an external tool/service to do fulltext indexing and/or faceting, but once introduced to solr they quickly came around In spite of the fact that at least some of the fancypants sorting, ranking, filtering functionality could most likely be reproduced using Solr, there was a strong reluctance to rewrite that code. Writing as little java as possible doesn&apos;t just come from a java-phobic frame of mind; it&apos;s also about limiting how much we rely on custom solr components. Rely as much as possible on what Solr affords. Loose integration in this case means the ability to swap in alternate services for retrieving fulltext search results and facets. More on how we succeeded in that towards the end.
  6. When we talk about the ids being sent back and forth between Invenio &amp; Solr we are talking about the schema ids.
  7. So what&apos;s going on here? Our first thought was maybe it was the time needed to serialize/de-serialize the response, but that turned out not to be it.
  8. QueryResultMaxDocsCached QueryResultWindowSize enableLazyFieldLoading
  9. No need to specify number of rows or which fields to return
  10. Post-processing = 2 nd order searching, filtering Can&apos;t retreive facets with the initial query because the final list of search results will depend on Invenio post-processing. So how do you send a very large set of ids to get a set of facet results?
  11. Satisfies all most objectives. We get searching &amp; faceting We don&apos;t have to write a lot of python or java: invenio needs the indexing piece Not duplicating anything that invenio already does very well Loosely coupled because communication is in a form that is native to invenio, we could easily swap in/out different services for either piece
  12. Seems like a lot, but in total lines of code it&apos;s not that much, especially considering it&apos;s in Java. Plus I suck at Java and I was able to do it all in 2-3 weeks of trial and error hacking. Plus, it all very closely conforms to the affordances of the Solr API. Only one small thing that might be considered a “hack”.
  13. Defining our custom query component and telling the default solr search handler to use it Also defining our custom response writer
  14. A query component class has two opportunities to interact with the incoming request: prepare &amp; process. We only need process.
  15. These times include decompressing and unmarshalling the bitset into an invenio intbitset object in python
  16. Defining our custom query component and telling the default solr search handler to use it Also defining our custom response writer
  17. PyLucene is a Python wrapper around Java Lucene. It embeds a Java VM with Lucene into a Python process. The extension is machine-generated with JCC, a C++ code generator that makes it possible to call into Java classes from Python via Java&apos;s Native Invocation Interface (JNI).
  18. PyLucene is a Python wrapper around Java Lucene. It embeds a Java VM with Lucene into a Python process. The extension is machine-generated with JCC, a C++ code generator that makes it possible to call into Java classes from Python via Java&apos;s Native Invocation Interface (JNI).
  19. Invenio team had been skeptical of the necessity of incorporating an external tool/service to do fulltext indexing and/or faceting, but once introduced to solr they quickly came around In spite of the fact that at least some of the fancypants sorting, ranking, filtering functionality could most likely be reproduced using Solr, there was a strong reluctance to rewrite that code. Writing as little java as possible doesn&apos;t just come from a java-phobic frame of mind; it&apos;s also about limiting how much we rely on custom solr components. Rely as much as possible on what Solr affords. Loose integration in this case means the ability to swap in alternate services for retrieving fulltext search results and facets. More on how we succeeded in that towards the end.