SlideShare une entreprise Scribd logo
1  sur  4
Télécharger pour lire hors ligne
Personalised
access to
cultural heritage
spaces

Recommendations for
the automatic enrichment of DL content
using open source software
Authors:

Eneko Agirre and Arantxa Otegi

www.paths-project.eu
PATHS is funded by the European Commission FP7 programme under Digital Libraries and Digital Preservation
Recommendations for the automatic enrichment
of DL content using open source software

1 Introduction
2 Producing intra-collections links
2.1 Similarity links
2.2 Typed-similarity links
3 Producing background links
4 Ontology extension
References

1 Introduction
PATHS uses text processing software to enable the following functionality in the prototypes
(see deliverables D2.1 and D2.2 http://paths-project.eu/eng/Resources):
●
●
●

intra-collection links
background links
ontology extension

The aim of PATHS is to investigate the use of those functionalities to better serve exploration
by users. Most of the software is in-house and relatively complex. The release of the software
as open source was not in the DOW, and the exploitations plan goes in the software-as-aservice direction, where the processing is done in servers prepared by the partners. For
instance, in a closely related Best Practice Network (LoCloud http://www.locloud.eu/), servers
for automatic content enrichment of digital library items are to be produced.
Alternatively, projects like OpeNER (http://www.opener-project.org/) are devoted to the
release of open source tools which are similar to the ones used for text processing in PATHS.
The release of all PATHS software as open source out-of-the-box packages would be a
major undertaking, on a par to those European projects.
The goal of this document is to present an overall set of recommendations for the automatic
enrichment of Digital Libraries content using open source software. We think this would be
useful for third-parties who would like to offer similar services. Note that this is not a step-bystep guide for reimplementation, but an overall view of the required software and
programming effort involved.
This document is structured according to each of the enrichment tasks described in PATHS
deliverables D2.1 and D2.2.

PATHS is funded by the European Commission FP7 programme under Digital Libraries and Digital Preservation
2

Producing intra-collections links

PATHS produced both generic similarity links and more specific typed-similarity links.

2.1 Similarity links
The target items would need to be parsed by the following Natural Language Processing
(NLP) tools: Pos tagging and lemmatization. Several open source products exist, including
the two used in PATHS:
CoreNLP (http://nlp.stanford.edu/software/corenlp.shtml)
and
Freeling (http://nlp.lsi.upc.edu/freeling/).
Regarding multilinguality PATHS worked on Spanish and English texts. Freeling covers both,
and Stanford works out-of-the-box for English. Recent projects like OpenNER also offer a
suite of open source NLP tools including, in addition to English and Spanish, four other
European languages.
In addition, we used in-house scripts to process the internal representation of the items,
extract the textual pieces, and produce the enriched representations.
The actual production of the similarity links requires additional software, which in this case
has been developed in-house. No out-of-the box alternatives exist. The interested parties
would need to replicate the software described in (Aletras et al. 2012).
As an alternative, the PATHS server demonstrating similarity links described in (Agirre et al.
2013b) uses the functionality provided by a search engine (Solr http://lucene.apache.org/solr/)
to provide similar items. Note that this alternative does not require additional NLP tools, as it
uses their own stop words and stemming algorithm.

2.2 Typed-similarity links
The target items would need to be parsed by the following NLP tools: Pos tagging and
lemmatization (see previous section).
In addition, we used in-house scripts to process the internal representation of the items,
extract the textual pieces, and produce the enriched representations.
The actual production of the typed-similarity links requires additional software, including open
source machine learning software (Weka http://www.cs.waikato.ac.nz/ml/weka/) and the inhouse scripts to extract features from items, train the machine learning models on the
publicly available typed-similarity datasets produced by PATHS (http://ixa2.si.ehu.es/sts/),
and use the machine learning models on the target items. No out-of-the-box alternatives
exist. The interested parties would need to replicate the software as described in (Agirre et al.
2013a).
3

Producing background links

In order to produce background links we used Wikipedia Miner, an open source software
available at http://wikipedia-miner.cms.waikato.ac.nz. We used wikipedia miner out-of-thebox. In addition, we used in-house scripts to process the internal representation of the items,
extract the textual pieces, and produce the enriched representations.

4

Ontology extension

The target items would need to be parsed by the following NLP tools: Pos tagging and
lemmatisation (see previous section).
In addition, we used in-house scripts to process the internal representation of the items,
extract the textual pieces, and produce the enriched representations.
The actual production of the vocabulary requires additional software. We first extract the
background links (see previous section), and then find the most relevant Wikipedia articles
per item. This is done globally, analysing the statistics of the whole collection. Those articles
are used to categorize the items according to a Wikipedia-based category system, which is
trimmed-down to only cover the categories which are relevant to the collection at hand. No
out-of-the-box alternatives exist. The interested parties would need to replicate the software
as described in (Fernando et al. 2012).

References
Agirre E., Aletras N., Gonzalez-Agirre A., Rigau G., Stevenson M. (2013a). UBC UOSTYPED: Regression for typed-similarity. The Second Joint Conference on Lexical and
Computational Semantics (*SEM 2013)
Agirre E., Barrena A., Fernandez K., Miranda E., Otegi A., Soroa A. (2013b). PATHSenrich:
a Web Service Prototype for Automatic Cultural Heritage Item Enrichment in Research and
Advanced Technology for Digital Libraries, International Conference on Theory and Practice
of Digital Libraries, TPDL 2013, Valletta, Malta, September 22-26, 2013. Lecture Notes in
Computer Science, vol. 8092, pp 462-465.
Aletras N., Stevenson M., Clough P. (2012). Computing Similarity between Items in a Digital
Library of Cultural Heritage. In ACM JOCCH.
Fernando S., Hall M., Agirre E., Soroa A., Clough P., Stevenson M. (2012). Comparing
taxonomies for organising collections of documents. In Proceedings of the 24th International
Conference on Computational Linguistics (Coling 2012), pages 879-894, Mumbai, India,
December 2012.

Contenu connexe

En vedette

PATHS system architecture
PATHS system architecturePATHS system architecture
PATHS system architecturepathsproject
 
Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...
Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...
Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...mediamera
 
Word pressで情報を得るのに役立つwebサイトの紹介
Word pressで情報を得るのに役立つwebサイトの紹介Word pressで情報を得るのに役立つwebサイトの紹介
Word pressで情報を得るのに役立つwebサイトの紹介Akinori Tateyama
 
My E-mail appears as spam | The 7 major reasons | Part 5#17
My E-mail appears as spam | The 7 major reasons | Part 5#17My E-mail appears as spam | The 7 major reasons | Part 5#17
My E-mail appears as spam | The 7 major reasons | Part 5#17Eyal Doron
 
Evaluating the Use of Clustering for Automatically Organising Digital Library...
Evaluating the Use of Clustering for Automatically Organising Digital Library...Evaluating the Use of Clustering for Automatically Organising Digital Library...
Evaluating the Use of Clustering for Automatically Organising Digital Library...pathsproject
 
Cross-lingual event-mining using wordnet as a shared knowledge interface
Cross-lingual event-mining using wordnet as a shared knowledge interfaceCross-lingual event-mining using wordnet as a shared knowledge interface
Cross-lingual event-mining using wordnet as a shared knowledge interfacepathsproject
 
What are the possible damages of phishing and spoofing mail attacks part 2#...
What are the possible damages of phishing and spoofing mail attacks   part 2#...What are the possible damages of phishing and spoofing mail attacks   part 2#...
What are the possible damages of phishing and spoofing mail attacks part 2#...Eyal Doron
 
PATHS Evaluation of the 1st paths prototype
PATHS Evaluation of the 1st paths prototypePATHS Evaluation of the 1st paths prototype
PATHS Evaluation of the 1st paths prototypepathsproject
 
PATHS state of the art monitoring report
PATHS state of the art monitoring reportPATHS state of the art monitoring report
PATHS state of the art monitoring reportpathsproject
 
A pilot on Semantic Textual Similarity
A pilot on Semantic Textual SimilarityA pilot on Semantic Textual Similarity
A pilot on Semantic Textual Similaritypathsproject
 
PATHS at EuropeanaTech 2011, Vienna
PATHS at EuropeanaTech 2011, ViennaPATHS at EuropeanaTech 2011, Vienna
PATHS at EuropeanaTech 2011, Viennapathsproject
 
Semantic Enrichment of Cultural Heritage content in PATHS
Semantic Enrichment of Cultural Heritage content in PATHSSemantic Enrichment of Cultural Heritage content in PATHS
Semantic Enrichment of Cultural Heritage content in PATHSpathsproject
 
The importance of Exchange 2013 CAS in Exchange 2013 coexistence | Part 2/2 |...
The importance of Exchange 2013 CAS in Exchange 2013 coexistence | Part 2/2 |...The importance of Exchange 2013 CAS in Exchange 2013 coexistence | Part 2/2 |...
The importance of Exchange 2013 CAS in Exchange 2013 coexistence | Part 2/2 |...Eyal Doron
 
My E-mail appears as spam - troubleshooting path - part 11 of 17
My E-mail appears as spam - troubleshooting path - part 11 of 17My E-mail appears as spam - troubleshooting path - part 11 of 17
My E-mail appears as spam - troubleshooting path - part 11 of 17Eyal Doron
 

En vedette (18)

Boletim informativo - janeiro
Boletim informativo - janeiroBoletim informativo - janeiro
Boletim informativo - janeiro
 
Fichas s
Fichas sFichas s
Fichas s
 
PATHS system architecture
PATHS system architecturePATHS system architecture
PATHS system architecture
 
Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...
Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...
Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...
 
Word pressで情報を得るのに役立つwebサイトの紹介
Word pressで情報を得るのに役立つwebサイトの紹介Word pressで情報を得るのに役立つwebサイトの紹介
Word pressで情報を得るのに役立つwebサイトの紹介
 
My E-mail appears as spam | The 7 major reasons | Part 5#17
My E-mail appears as spam | The 7 major reasons | Part 5#17My E-mail appears as spam | The 7 major reasons | Part 5#17
My E-mail appears as spam | The 7 major reasons | Part 5#17
 
Think before you speak
Think before you speakThink before you speak
Think before you speak
 
Evaluating the Use of Clustering for Automatically Organising Digital Library...
Evaluating the Use of Clustering for Automatically Organising Digital Library...Evaluating the Use of Clustering for Automatically Organising Digital Library...
Evaluating the Use of Clustering for Automatically Organising Digital Library...
 
Zp primary school, ganegaon khalsa
Zp primary school, ganegaon khalsaZp primary school, ganegaon khalsa
Zp primary school, ganegaon khalsa
 
Cross-lingual event-mining using wordnet as a shared knowledge interface
Cross-lingual event-mining using wordnet as a shared knowledge interfaceCross-lingual event-mining using wordnet as a shared knowledge interface
Cross-lingual event-mining using wordnet as a shared knowledge interface
 
What are the possible damages of phishing and spoofing mail attacks part 2#...
What are the possible damages of phishing and spoofing mail attacks   part 2#...What are the possible damages of phishing and spoofing mail attacks   part 2#...
What are the possible damages of phishing and spoofing mail attacks part 2#...
 
PATHS Evaluation of the 1st paths prototype
PATHS Evaluation of the 1st paths prototypePATHS Evaluation of the 1st paths prototype
PATHS Evaluation of the 1st paths prototype
 
PATHS state of the art monitoring report
PATHS state of the art monitoring reportPATHS state of the art monitoring report
PATHS state of the art monitoring report
 
A pilot on Semantic Textual Similarity
A pilot on Semantic Textual SimilarityA pilot on Semantic Textual Similarity
A pilot on Semantic Textual Similarity
 
PATHS at EuropeanaTech 2011, Vienna
PATHS at EuropeanaTech 2011, ViennaPATHS at EuropeanaTech 2011, Vienna
PATHS at EuropeanaTech 2011, Vienna
 
Semantic Enrichment of Cultural Heritage content in PATHS
Semantic Enrichment of Cultural Heritage content in PATHSSemantic Enrichment of Cultural Heritage content in PATHS
Semantic Enrichment of Cultural Heritage content in PATHS
 
The importance of Exchange 2013 CAS in Exchange 2013 coexistence | Part 2/2 |...
The importance of Exchange 2013 CAS in Exchange 2013 coexistence | Part 2/2 |...The importance of Exchange 2013 CAS in Exchange 2013 coexistence | Part 2/2 |...
The importance of Exchange 2013 CAS in Exchange 2013 coexistence | Part 2/2 |...
 
My E-mail appears as spam - troubleshooting path - part 11 of 17
My E-mail appears as spam - troubleshooting path - part 11 of 17My E-mail appears as spam - troubleshooting path - part 11 of 17
My E-mail appears as spam - troubleshooting path - part 11 of 17
 

Similaire à Recommendations for the automatic enrichment of digital library content using open source software, PATHS report

PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...pathsproject
 
Real time text stream processing - a dynamic and distributed nlp pipeline
Real time text stream  processing - a dynamic and distributed nlp pipelineReal time text stream  processing - a dynamic and distributed nlp pipeline
Real time text stream processing - a dynamic and distributed nlp pipelineConference Papers
 
Sentiment analyzer and opinion mining
Sentiment analyzer and opinion miningSentiment analyzer and opinion mining
Sentiment analyzer and opinion miningAnkush Mehta
 
2010-04-14 EDUCON eMadrid uned mrartacho
2010-04-14 EDUCON eMadrid uned mrartacho2010-04-14 EDUCON eMadrid uned mrartacho
2010-04-14 EDUCON eMadrid uned mrartachoeMadrid network
 
07419051111154767
0741905111115476707419051111154767
07419051111154767gayatri 24
 
Knowledge based-interaction-in-software-development
Knowledge based-interaction-in-software-developmentKnowledge based-interaction-in-software-development
Knowledge based-interaction-in-software-developmentDimitris Panagiotou
 
ABCD Open Source Software for managing ETD repositories
ABCD Open Source Software for managing ETD repositoriesABCD Open Source Software for managing ETD repositories
ABCD Open Source Software for managing ETD repositoriessangeetadhamdhere
 
A Service-Oriented National E-Theses Information System And Repository
A Service-Oriented National E-Theses Information System And RepositoryA Service-Oriented National E-Theses Information System And Repository
A Service-Oriented National E-Theses Information System And RepositoryJill Brown
 
Research Tool - End Note
Research Tool - End NoteResearch Tool - End Note
Research Tool - End Noteador
 
A Distributed Audio Personalization Framework over Android
A Distributed Audio Personalization Framework over AndroidA Distributed Audio Personalization Framework over Android
A Distributed Audio Personalization Framework over AndroidUniversity of Piraeus
 
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT ecij
 
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT ecij
 
A knowledge-workbench-for-software-development
A knowledge-workbench-for-software-developmentA knowledge-workbench-for-software-development
A knowledge-workbench-for-software-developmentDimitris Panagiotou
 
IRJET- Hosting NLP based Chatbot on AWS Cloud using Docker
IRJET-  	  Hosting NLP based Chatbot on AWS Cloud using DockerIRJET-  	  Hosting NLP based Chatbot on AWS Cloud using Docker
IRJET- Hosting NLP based Chatbot on AWS Cloud using DockerIRJET Journal
 
QEBU: AN ADVANCED GRAPHICAL EDITOR FOR THE EBUCORE METADATA SET | Paolo PASIN...
QEBU: AN ADVANCED GRAPHICAL EDITOR FOR THE EBUCORE METADATA SET | Paolo PASIN...QEBU: AN ADVANCED GRAPHICAL EDITOR FOR THE EBUCORE METADATA SET | Paolo PASIN...
QEBU: AN ADVANCED GRAPHICAL EDITOR FOR THE EBUCORE METADATA SET | Paolo PASIN...FIAT/IFTA
 
Bio2RDF presentation at Combine 2012
Bio2RDF presentation at Combine 2012Bio2RDF presentation at Combine 2012
Bio2RDF presentation at Combine 2012François Belleau
 

Similaire à Recommendations for the automatic enrichment of digital library content using open source software, PATHS report (20)

PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
 
C04 07 1519
C04 07 1519C04 07 1519
C04 07 1519
 
Real time text stream processing - a dynamic and distributed nlp pipeline
Real time text stream  processing - a dynamic and distributed nlp pipelineReal time text stream  processing - a dynamic and distributed nlp pipeline
Real time text stream processing - a dynamic and distributed nlp pipeline
 
Sentiment analyzer and opinion mining
Sentiment analyzer and opinion miningSentiment analyzer and opinion mining
Sentiment analyzer and opinion mining
 
2010-04-14 EDUCON eMadrid uned mrartacho
2010-04-14 EDUCON eMadrid uned mrartacho2010-04-14 EDUCON eMadrid uned mrartacho
2010-04-14 EDUCON eMadrid uned mrartacho
 
07419051111154767
0741905111115476707419051111154767
07419051111154767
 
Knowledge based-interaction-in-software-development
Knowledge based-interaction-in-software-developmentKnowledge based-interaction-in-software-development
Knowledge based-interaction-in-software-development
 
ABCD Open Source Software for managing ETD repositories
ABCD Open Source Software for managing ETD repositoriesABCD Open Source Software for managing ETD repositories
ABCD Open Source Software for managing ETD repositories
 
New Goals of PARES: Spanish Archives Web Portal
New Goals of PARES: Spanish Archives Web PortalNew Goals of PARES: Spanish Archives Web Portal
New Goals of PARES: Spanish Archives Web Portal
 
A Service-Oriented National E-Theses Information System And Repository
A Service-Oriented National E-Theses Information System And RepositoryA Service-Oriented National E-Theses Information System And Repository
A Service-Oriented National E-Theses Information System And Repository
 
Research Tool - End Note
Research Tool - End NoteResearch Tool - End Note
Research Tool - End Note
 
Dspace
DspaceDspace
Dspace
 
Dspace
DspaceDspace
Dspace
 
A Distributed Audio Personalization Framework over Android
A Distributed Audio Personalization Framework over AndroidA Distributed Audio Personalization Framework over Android
A Distributed Audio Personalization Framework over Android
 
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
 
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
 
A knowledge-workbench-for-software-development
A knowledge-workbench-for-software-developmentA knowledge-workbench-for-software-development
A knowledge-workbench-for-software-development
 
IRJET- Hosting NLP based Chatbot on AWS Cloud using Docker
IRJET-  	  Hosting NLP based Chatbot on AWS Cloud using DockerIRJET-  	  Hosting NLP based Chatbot on AWS Cloud using Docker
IRJET- Hosting NLP based Chatbot on AWS Cloud using Docker
 
QEBU: AN ADVANCED GRAPHICAL EDITOR FOR THE EBUCORE METADATA SET | Paolo PASIN...
QEBU: AN ADVANCED GRAPHICAL EDITOR FOR THE EBUCORE METADATA SET | Paolo PASIN...QEBU: AN ADVANCED GRAPHICAL EDITOR FOR THE EBUCORE METADATA SET | Paolo PASIN...
QEBU: AN ADVANCED GRAPHICAL EDITOR FOR THE EBUCORE METADATA SET | Paolo PASIN...
 
Bio2RDF presentation at Combine 2012
Bio2RDF presentation at Combine 2012Bio2RDF presentation at Combine 2012
Bio2RDF presentation at Combine 2012
 

Plus de pathsproject

Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...pathsproject
 
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...pathsproject
 
Implementing Recommendations in the PATHS system, SUEDL 2013
Implementing Recommendations in the PATHS system, SUEDL 2013Implementing Recommendations in the PATHS system, SUEDL 2013
Implementing Recommendations in the PATHS system, SUEDL 2013pathsproject
 
User-Centred Design to Support Exploration and Path Creation in Cultural Her...
 User-Centred Design to Support Exploration and Path Creation in Cultural Her... User-Centred Design to Support Exploration and Path Creation in Cultural Her...
User-Centred Design to Support Exploration and Path Creation in Cultural Her...pathsproject
 
Generating Paths through Cultural Heritage Collections Latech2013 paper
Generating Paths through Cultural Heritage Collections Latech2013 paperGenerating Paths through Cultural Heritage Collections Latech2013 paper
Generating Paths through Cultural Heritage Collections Latech2013 paperpathsproject
 
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...pathsproject
 
Generating Paths through Cultural Heritage Collections, LATECH 2013 paper
Generating Paths through Cultural Heritage Collections, LATECH 2013 paperGenerating Paths through Cultural Heritage Collections, LATECH 2013 paper
Generating Paths through Cultural Heritage Collections, LATECH 2013 paperpathsproject
 
PATHS @ LATECH 2013
PATHS @ LATECH 2013PATHS @ LATECH 2013
PATHS @ LATECH 2013pathsproject
 
PATHS at the eChallenges conference
PATHS at the eChallenges conferencePATHS at the eChallenges conference
PATHS at the eChallenges conferencepathsproject
 
PATHS at the EAA conference 2013
PATHS at the EAA conference 2013PATHS at the EAA conference 2013
PATHS at the EAA conference 2013pathsproject
 
PATHS at the eCult dialogue day 2013
PATHS at the eCult dialogue day 2013PATHS at the eCult dialogue day 2013
PATHS at the eCult dialogue day 2013pathsproject
 
Comparing taxonomies for organising collections of documents presentation
Comparing taxonomies for organising collections of documents presentationComparing taxonomies for organising collections of documents presentation
Comparing taxonomies for organising collections of documents presentationpathsproject
 
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual SimilaritySemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual Similaritypathsproject
 
Comparing taxonomies for organising collections of documents
Comparing taxonomies for organising collections of documentsComparing taxonomies for organising collections of documents
Comparing taxonomies for organising collections of documentspathsproject
 
PATHS Final prototype interface design v1.0
PATHS Final prototype interface design v1.0PATHS Final prototype interface design v1.0
PATHS Final prototype interface design v1.0pathsproject
 
PATHS Second prototype-functional-spec
PATHS Second prototype-functional-specPATHS Second prototype-functional-spec
PATHS Second prototype-functional-specpathsproject
 
PATHS Final state of art monitoring report v0_4
PATHS  Final state of art monitoring report v0_4PATHS  Final state of art monitoring report v0_4
PATHS Final state of art monitoring report v0_4pathsproject
 
PATHS first paths prototype
PATHS first paths prototypePATHS first paths prototype
PATHS first paths prototypepathsproject
 
PATHS Content processing 2nd prototype-revised.v2
PATHS Content processing 2nd prototype-revised.v2PATHS Content processing 2nd prototype-revised.v2
PATHS Content processing 2nd prototype-revised.v2pathsproject
 
PATHS Content processing 1st prototype
PATHS  Content processing 1st prototypePATHS  Content processing 1st prototype
PATHS Content processing 1st prototypepathsproject
 

Plus de pathsproject (20)

Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
 
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
 
Implementing Recommendations in the PATHS system, SUEDL 2013
Implementing Recommendations in the PATHS system, SUEDL 2013Implementing Recommendations in the PATHS system, SUEDL 2013
Implementing Recommendations in the PATHS system, SUEDL 2013
 
User-Centred Design to Support Exploration and Path Creation in Cultural Her...
 User-Centred Design to Support Exploration and Path Creation in Cultural Her... User-Centred Design to Support Exploration and Path Creation in Cultural Her...
User-Centred Design to Support Exploration and Path Creation in Cultural Her...
 
Generating Paths through Cultural Heritage Collections Latech2013 paper
Generating Paths through Cultural Heritage Collections Latech2013 paperGenerating Paths through Cultural Heritage Collections Latech2013 paper
Generating Paths through Cultural Heritage Collections Latech2013 paper
 
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
 
Generating Paths through Cultural Heritage Collections, LATECH 2013 paper
Generating Paths through Cultural Heritage Collections, LATECH 2013 paperGenerating Paths through Cultural Heritage Collections, LATECH 2013 paper
Generating Paths through Cultural Heritage Collections, LATECH 2013 paper
 
PATHS @ LATECH 2013
PATHS @ LATECH 2013PATHS @ LATECH 2013
PATHS @ LATECH 2013
 
PATHS at the eChallenges conference
PATHS at the eChallenges conferencePATHS at the eChallenges conference
PATHS at the eChallenges conference
 
PATHS at the EAA conference 2013
PATHS at the EAA conference 2013PATHS at the EAA conference 2013
PATHS at the EAA conference 2013
 
PATHS at the eCult dialogue day 2013
PATHS at the eCult dialogue day 2013PATHS at the eCult dialogue day 2013
PATHS at the eCult dialogue day 2013
 
Comparing taxonomies for organising collections of documents presentation
Comparing taxonomies for organising collections of documents presentationComparing taxonomies for organising collections of documents presentation
Comparing taxonomies for organising collections of documents presentation
 
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual SimilaritySemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
 
Comparing taxonomies for organising collections of documents
Comparing taxonomies for organising collections of documentsComparing taxonomies for organising collections of documents
Comparing taxonomies for organising collections of documents
 
PATHS Final prototype interface design v1.0
PATHS Final prototype interface design v1.0PATHS Final prototype interface design v1.0
PATHS Final prototype interface design v1.0
 
PATHS Second prototype-functional-spec
PATHS Second prototype-functional-specPATHS Second prototype-functional-spec
PATHS Second prototype-functional-spec
 
PATHS Final state of art monitoring report v0_4
PATHS  Final state of art monitoring report v0_4PATHS  Final state of art monitoring report v0_4
PATHS Final state of art monitoring report v0_4
 
PATHS first paths prototype
PATHS first paths prototypePATHS first paths prototype
PATHS first paths prototype
 
PATHS Content processing 2nd prototype-revised.v2
PATHS Content processing 2nd prototype-revised.v2PATHS Content processing 2nd prototype-revised.v2
PATHS Content processing 2nd prototype-revised.v2
 
PATHS Content processing 1st prototype
PATHS  Content processing 1st prototypePATHS  Content processing 1st prototype
PATHS Content processing 1st prototype
 

Dernier

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 

Dernier (20)

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 

Recommendations for the automatic enrichment of digital library content using open source software, PATHS report

  • 1. Personalised access to cultural heritage spaces Recommendations for the automatic enrichment of DL content using open source software Authors: Eneko Agirre and Arantxa Otegi www.paths-project.eu PATHS is funded by the European Commission FP7 programme under Digital Libraries and Digital Preservation
  • 2. Recommendations for the automatic enrichment of DL content using open source software 1 Introduction 2 Producing intra-collections links 2.1 Similarity links 2.2 Typed-similarity links 3 Producing background links 4 Ontology extension References 1 Introduction PATHS uses text processing software to enable the following functionality in the prototypes (see deliverables D2.1 and D2.2 http://paths-project.eu/eng/Resources): ● ● ● intra-collection links background links ontology extension The aim of PATHS is to investigate the use of those functionalities to better serve exploration by users. Most of the software is in-house and relatively complex. The release of the software as open source was not in the DOW, and the exploitations plan goes in the software-as-aservice direction, where the processing is done in servers prepared by the partners. For instance, in a closely related Best Practice Network (LoCloud http://www.locloud.eu/), servers for automatic content enrichment of digital library items are to be produced. Alternatively, projects like OpeNER (http://www.opener-project.org/) are devoted to the release of open source tools which are similar to the ones used for text processing in PATHS. The release of all PATHS software as open source out-of-the-box packages would be a major undertaking, on a par to those European projects. The goal of this document is to present an overall set of recommendations for the automatic enrichment of Digital Libraries content using open source software. We think this would be useful for third-parties who would like to offer similar services. Note that this is not a step-bystep guide for reimplementation, but an overall view of the required software and programming effort involved. This document is structured according to each of the enrichment tasks described in PATHS deliverables D2.1 and D2.2. PATHS is funded by the European Commission FP7 programme under Digital Libraries and Digital Preservation
  • 3. 2 Producing intra-collections links PATHS produced both generic similarity links and more specific typed-similarity links. 2.1 Similarity links The target items would need to be parsed by the following Natural Language Processing (NLP) tools: Pos tagging and lemmatization. Several open source products exist, including the two used in PATHS: CoreNLP (http://nlp.stanford.edu/software/corenlp.shtml) and Freeling (http://nlp.lsi.upc.edu/freeling/). Regarding multilinguality PATHS worked on Spanish and English texts. Freeling covers both, and Stanford works out-of-the-box for English. Recent projects like OpenNER also offer a suite of open source NLP tools including, in addition to English and Spanish, four other European languages. In addition, we used in-house scripts to process the internal representation of the items, extract the textual pieces, and produce the enriched representations. The actual production of the similarity links requires additional software, which in this case has been developed in-house. No out-of-the box alternatives exist. The interested parties would need to replicate the software described in (Aletras et al. 2012). As an alternative, the PATHS server demonstrating similarity links described in (Agirre et al. 2013b) uses the functionality provided by a search engine (Solr http://lucene.apache.org/solr/) to provide similar items. Note that this alternative does not require additional NLP tools, as it uses their own stop words and stemming algorithm. 2.2 Typed-similarity links The target items would need to be parsed by the following NLP tools: Pos tagging and lemmatization (see previous section). In addition, we used in-house scripts to process the internal representation of the items, extract the textual pieces, and produce the enriched representations. The actual production of the typed-similarity links requires additional software, including open source machine learning software (Weka http://www.cs.waikato.ac.nz/ml/weka/) and the inhouse scripts to extract features from items, train the machine learning models on the publicly available typed-similarity datasets produced by PATHS (http://ixa2.si.ehu.es/sts/), and use the machine learning models on the target items. No out-of-the-box alternatives exist. The interested parties would need to replicate the software as described in (Agirre et al. 2013a).
  • 4. 3 Producing background links In order to produce background links we used Wikipedia Miner, an open source software available at http://wikipedia-miner.cms.waikato.ac.nz. We used wikipedia miner out-of-thebox. In addition, we used in-house scripts to process the internal representation of the items, extract the textual pieces, and produce the enriched representations. 4 Ontology extension The target items would need to be parsed by the following NLP tools: Pos tagging and lemmatisation (see previous section). In addition, we used in-house scripts to process the internal representation of the items, extract the textual pieces, and produce the enriched representations. The actual production of the vocabulary requires additional software. We first extract the background links (see previous section), and then find the most relevant Wikipedia articles per item. This is done globally, analysing the statistics of the whole collection. Those articles are used to categorize the items according to a Wikipedia-based category system, which is trimmed-down to only cover the categories which are relevant to the collection at hand. No out-of-the-box alternatives exist. The interested parties would need to replicate the software as described in (Fernando et al. 2012). References Agirre E., Aletras N., Gonzalez-Agirre A., Rigau G., Stevenson M. (2013a). UBC UOSTYPED: Regression for typed-similarity. The Second Joint Conference on Lexical and Computational Semantics (*SEM 2013) Agirre E., Barrena A., Fernandez K., Miranda E., Otegi A., Soroa A. (2013b). PATHSenrich: a Web Service Prototype for Automatic Cultural Heritage Item Enrichment in Research and Advanced Technology for Digital Libraries, International Conference on Theory and Practice of Digital Libraries, TPDL 2013, Valletta, Malta, September 22-26, 2013. Lecture Notes in Computer Science, vol. 8092, pp 462-465. Aletras N., Stevenson M., Clough P. (2012). Computing Similarity between Items in a Digital Library of Cultural Heritage. In ACM JOCCH. Fernando S., Hall M., Agirre E., Soroa A., Clough P., Stevenson M. (2012). Comparing taxonomies for organising collections of documents. In Proceedings of the 24th International Conference on Computational Linguistics (Coling 2012), pages 879-894, Mumbai, India, December 2012.