SlideShare a Scribd company logo
1 of 33
2/14/2014
Biotea, RDF4PMC

RDF4PMC, RDFizing
PubMed Central
Alexander Garcia1, Leyla Jael García Castro2, Casey McLaughlin1
1Florida State University
2Universitat Jaume I

1
The Biotea project
Why Semantic Web Technologies?
RDF4PMC in a nutshell
Architecture
RDFization process
•
•
•
•

PMC RDFization
Content enrichment
Some numbers for RDF4PMC
Architecture

• Using the data
•
•
•
•

•
•
•
•
•

SPARQL
Bio2RDF integration
Web services
A first prototype

Challenges and Lessons
Currently working on…
Future Work
Conclusions
Acknowledgments

Biotea, RDF4PMC

•
•
•
•
•

2/14/2014

Outline

2
Christine L. Borgman

• Methodologies, methods and techniques supporting semantic
enrichment of scholarly communication
• Once enriched, then how is this changing our user experience?

Biotea, RDF4PMC

Scholarly data and documents are of most value
when they are interconnected rather than
independent

2/14/2014

Biotea

3
Biotea

• How are publications connected to each other?
• Putting together explicit assertions from different papers to
form new implicit assertions
• Semantic
Web
Technology
supporting
scholarly
communication, Literature Based Discovery and the SearchRetrieval-and-Interacting-with-the-Document (SRID) processes

Biotea, RDF4PMC

Christine L. Borgman

2/14/2014

Scholarly data and documents are of most value when they are
interconnected rather than independent

4
• Retrieve all papers that have a component X (CHEBI)

and the cellular location in GO terms

Biotea, RDF4PMC

• Generates an adaptable open approach, the data becomes the
platform
• The SW delivers an integrative platform
• Makes it easier for the community to build over the platform
• Simplifies programmatic access to information

2/14/2014

Why SWT for research documents

• As simple as relating terminologies
• Delivers Social Network ready content
5
Biotea, RDF4PMC

• Delivers an interoperable, interlinked, and selfdescribing document model in the biomedical
domain.
• A network of interconnected documents
• Semantic infrastructure for PMC
• An interface to the Web of Data
• A knowledge model for biomedical literature –
easily extendible

2/14/2014

RDF4PMC in a nutshell

6
Biotea, RDF4PMC

• RDFizing biomedical literature by orchestrating
ontologies such as
• DoCO, BIBO, DC, FOAF, W3CPROV, and others
• Datasets are available
• RDF for metadata and content
• RDF for annotations from text-mining
• RDFizator will be available
• Adding other ontologies and annotators is possible
• Working with XML from other sources is possible

2/14/2014

RDF4PMC in a nutshell

7
PMC RDFization

RDF Generation

Biotea, RDF4PMC

References Enrichment

2/14/2014

Metadata+ Content
+ References

RDFReactor

PMC NXML

8
9

Biotea, RDF4PMC

2/14/2014
Annotations: Content Enrichment

Biotea, RDF4PMC

2/14/2014

Enriched RDF

RDF Generation

Automatic Annotation
Web service

Metadata+ Content
+ References

Web service

10
11

Biotea, RDF4PMC

2/14/2014
Biotea, RDF4PMC

2/14/2014

RDF4PMC, some numbers

12
RDF4PMC Server Architecture
RDF DB
Slave
RDF DB
Master
Master
Server

Import scripts
+ RDF files

PMC RDFization

Web &
SPARQL
Server
(development)

RDF DB
Slave Web &
SPARQL
Server
(production)
Consuming the data: SPARQL
Query expressed in natural
SPARQL query
language

Retrieving PubMed

?article a bibo:Document ;
bibo:pmid ?pmid ;

identifier, article title,

dcterms:title ?title .

section title, and

?section a doco:Section ;

paragraphs for those

dcterms:isPartOf ?article ;
dcterms:title ?secTitle .

Biotea, RDF4PMC

WHERE {

2/14/2014

SELECT ?pmid ?title ?secTitle ?text



articles containing the

FILTER (regex(str(?secTitle), "introduction", "i")).
?para a doco:Paragraph ;
dcterms:isPartOf ?section ;

term “cancer” in any
section whose title

cnt:chars ?text .
FILTER (regex(str(?text), "cancer", "i")).
} LIMIT 50

includes “introduction”

14
Consuming the data: SPARQL
Query expressed in natural

Retrieving PubMed identifier
SELECT distinct ?pmid
for those articles that have
WHERE {

been semantically annotated
?article a bibo:AcademicArticle ;
with the biological entity
bibo:pmid ?pmid .


Biotea, RDF4PMC

language

2/14/2014

SPARQL query

CHEBI:60004. The semantic

?annotation a aot:ExactQualifier ;
annotation comes from the
ao:annotatesResource ?article ;
occurrence of the term
ao:hasTopic <http://purl.obolibrary.org/obo/CHEBI_60004> .
“mixture” in any paragraph
}
of the retrieved articles.
CHEBI:60004
A mixture is a chemical substance composed of multiple molecules, at least two of which are of a different kind

15
Annotations

Biotea, RDF4PMC

2/14/2014

Content

Metadata & References

Bio2RDF Integration

16
Consuming the data: Web services
Retrieval

Service

A list of topics and their related vocabularies

http://biotea.idiginfo.org/api/topics

All topics related to a term

e.g., http://biotea.idiginfo.org/api/topics?term=cancer

All vocabularies related to a term

e.g., http://biotea.idiginfo.org/api/vocabularies?term=cancer

All terms that start with a specific string (for autocompletion)

e.g.,http://biotea.idiginfo.org/api/terms?prefix=canc

All topics related to a vocabulary

e.g., http://biotea.idiginfo.org/api/topics?vocabulary=po

RDF of articles that include a term

e.g., http://biotea.idiginfo.org/api/articles?term=cancer

Count of RDF of articles that include a term

e.g., http://biotea.idiginfo.org/api/articles?term=cancer&count=true

2/14/2014

http://biotea.idiginfo.org/api/terms

Biotea, RDF4PMC

A list of terms and their related topics

17
A list of vocabularies and their prefixes

http://biotea.idiginfo.org/vocabularies

RDF of articles that include a vocabulary

e.g., http://biotea.idiginfo.org/api/articles?vocabulary=po
Semantically enriched
publication

Metadata+ Content
+ References

Automatically
Annotated RDF

Biotea, RDF4PMC

2/14/2014

Consuming the data: a dashboard for
semantic bio-publications

SPARQL

18
Catalase
Consuming the data: first prototype
Cloud of Bio-annotations
(term + # of bio-entities)

2/14/2014

Title &
authors

Biotea, RDF4PMC

Links

Abstract

Paragraphs containing
the annotation selected
by the user

Graphical tools

19
Biotea, RDF4PMC

2/14/2014

Consuming the data: A first prototype

20
Challenges and Lessons
Tables and images  Links
Inline tables  Format is lost
Supplementary material
Most of them follow one DTD but …

• References
• At least 4 different styles
• Some times are just plain text

Biotea, RDF4PMC

•
•
•
•

2/14/2014

• Content

• Annotators
• Not always available
• Stop words are tricky
21
Challenges and Lessons

• Annotation is context dependent

Biotea, RDF4PMC

• Delivering the expressivity of the data set to the end user is a
complex issue

2/14/2014

• Where are the facts? How to validate the facts?

• Maintaining the triplet store has a learning curve of its own
• Building SW infrastructure is H A R D

22
Currently working on:
Literature Discovery Process
• Search
• Usually string-based search mechanisms
• Little cognitive support

• Retrieval
• Simple list of DB entries
• Little cognitive support

• Interacting with the document
• Straight into the PDF
• Zero cognitive support
• Data availability
Currently working on:
Literature Discovery Process
• Search
• Usually string-based search mechanisms
• Little cognitive support

• Retrieval
• Simple list of DB entries
• Little cognitive support
• How, why and where are a set of documents similar?

• Interacting with the document
• Straight into the PDF
• Zero cognitive support
Currently working on:
Literature Discovery Process
• Search
• Usually string-based search mechanisms
• Little cognitive support

• Retrieval
• Simple list of DB entries
• Little cognitive support

• Interacting with the document
• Straight into the PDF
• Zero cognitive support
Future Work

• User Experience
•
•
•
•

Web services for data analysis
RDF browser
More visualization tools
Supporting and taking advantage of the structure of the
document
• Collaborative element

Biotea, RDF4PMC

• URI standardization following similar patterns to identifiers.org
and Bio2RDF
• Integration into Bio2RDF
• Dataset identification and summary (void)
• Improve data for references

2/14/2014

• RDF

29
Future Work

Biotea, RDF4PMC

• From PDF to XML to RDF to Enriched Metadata
for the PDF
• The PDF is gently introduced in the WoD
• Once the metadata has been enriched then

2/14/2014

• Application in Clinical Psychology, the MSRC case

• Rich interaction supporting: SEARCH-RETRIEVALINTERACTION WITH THE DOCUMENT (PDF)
30
Conclusions

• New vocabularies as well as annotators can easily be plugged in
• Our approach is useful for both open and non-open access
datasets

Biotea, RDF4PMC

• the transformation into RDF from the original PMC files
• the annotation of the RDF
• an API which makes that data available.

2/14/2014

• We provide

• Publishers may decide what to expose via RDF and what content to
make available

• Our approach is also applicable for PDF-only environments
31
The MSRC consortium
Greg Riccardi, FSU
Oscar Corcho, UPM
Olga Giraldo, UPM
Bob Morris, Harvard University
Michel Dumontier, Carleton University
Dietrich Rebholz-Schuhmann, University of Zurich
Diane Leiva, FSU
US DoD Grant MOMRP Grant w81xwh-10-2-0181
All of those who gave us feedback about the RDFization and
the quality of our RDF datasets

Biotea, RDF4PMC

•
•
•
•
•
•
•
•
•
•

2/14/2014

Acknowledgments

32
Contacts
• Alexander García: agarciac@gmail.com
• L. Jael García Castro: leylajael@gmail.com

Biotea, RDF4PMC

2/14/2014

Thanks for you attention

33

More Related Content

What's hot

Linked Data for Libraries: Experiments between Cornell, Harvard and Stanford
Linked Data for Libraries: Experiments between Cornell, Harvard and StanfordLinked Data for Libraries: Experiments between Cornell, Harvard and Stanford
Linked Data for Libraries: Experiments between Cornell, Harvard and StanfordSimeon Warner
 
Usage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosUsage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosEUCLID project
 
MR^3: Meta-Model Management based on RDFs Revision Reflection
MR^3: Meta-Model Management based on RDFs Revision ReflectionMR^3: Meta-Model Management based on RDFs Revision Reflection
MR^3: Meta-Model Management based on RDFs Revision ReflectionTakeshi Morita
 
Building Linked Data Applications
Building Linked Data ApplicationsBuilding Linked Data Applications
Building Linked Data ApplicationsEUCLID project
 
The OpenOffice.org ODF Toolkit Project
The OpenOffice.org ODF Toolkit ProjectThe OpenOffice.org ODF Toolkit Project
The OpenOffice.org ODF Toolkit ProjectAlexandro Colorado
 
Linked Data Modeling for Beginner
Linked Data Modeling for BeginnerLinked Data Modeling for Beginner
Linked Data Modeling for BeginnerMyungjin Lee
 
WWW2014 Overview of W3C Linked Data Platform 20140410
WWW2014 Overview of W3C Linked Data Platform 20140410WWW2014 Overview of W3C Linked Data Platform 20140410
WWW2014 Overview of W3C Linked Data Platform 20140410Arnaud Le Hors
 
Querying Linked Data on Android
Querying Linked Data on AndroidQuerying Linked Data on Android
Querying Linked Data on AndroidEUCLID project
 
Producing, publishing and consuming linked data - CSHALS 2013
Producing, publishing and consuming linked data - CSHALS 2013Producing, publishing and consuming linked data - CSHALS 2013
Producing, publishing and consuming linked data - CSHALS 2013François Belleau
 
Documents, services, and data on the web
Documents, services, and data on the webDocuments, services, and data on the web
Documents, services, and data on the webChiara Del Vescovo
 
DLF 2015 Presentation, "RDF in the Real World."
DLF 2015 Presentation, "RDF in the Real World." DLF 2015 Presentation, "RDF in the Real World."
DLF 2015 Presentation, "RDF in the Real World." Avalon Media System
 
Re-using Media on the Web: Media fragment re-mixing and playout
Re-using Media on the Web: Media fragment re-mixing and playoutRe-using Media on the Web: Media fragment re-mixing and playout
Re-using Media on the Web: Media fragment re-mixing and playoutMediaMixerCommunity
 

What's hot (20)

semanticweb
semanticwebsemanticweb
semanticweb
 
Linked Data for Libraries: Experiments between Cornell, Harvard and Stanford
Linked Data for Libraries: Experiments between Cornell, Harvard and StanfordLinked Data for Libraries: Experiments between Cornell, Harvard and Stanford
Linked Data for Libraries: Experiments between Cornell, Harvard and Stanford
 
Xiaoli Li: MARC to BIBFRAME (Linked Data)
Xiaoli Li: MARC to BIBFRAME (Linked Data)Xiaoli Li: MARC to BIBFRAME (Linked Data)
Xiaoli Li: MARC to BIBFRAME (Linked Data)
 
Usage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosUsage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application Scenarios
 
MR^3: Meta-Model Management based on RDFs Revision Reflection
MR^3: Meta-Model Management based on RDFs Revision ReflectionMR^3: Meta-Model Management based on RDFs Revision Reflection
MR^3: Meta-Model Management based on RDFs Revision Reflection
 
Building Linked Data Applications
Building Linked Data ApplicationsBuilding Linked Data Applications
Building Linked Data Applications
 
The OpenOffice.org ODF Toolkit Project
The OpenOffice.org ODF Toolkit ProjectThe OpenOffice.org ODF Toolkit Project
The OpenOffice.org ODF Toolkit Project
 
Linked Data Modeling for Beginner
Linked Data Modeling for BeginnerLinked Data Modeling for Beginner
Linked Data Modeling for Beginner
 
Timbuctoo 2 EASY
Timbuctoo 2 EASYTimbuctoo 2 EASY
Timbuctoo 2 EASY
 
WWW2014 Overview of W3C Linked Data Platform 20140410
WWW2014 Overview of W3C Linked Data Platform 20140410WWW2014 Overview of W3C Linked Data Platform 20140410
WWW2014 Overview of W3C Linked Data Platform 20140410
 
Querying Linked Data on Android
Querying Linked Data on AndroidQuerying Linked Data on Android
Querying Linked Data on Android
 
April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early AdoptersApril 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
 
April 24, 2013 NISO/DCMI Webinar: Deployment of RDA (Resource Description and...
April 24, 2013 NISO/DCMI Webinar: Deployment of RDA (Resource Description and...April 24, 2013 NISO/DCMI Webinar: Deployment of RDA (Resource Description and...
April 24, 2013 NISO/DCMI Webinar: Deployment of RDA (Resource Description and...
 
Querying Linked Data
Querying Linked DataQuerying Linked Data
Querying Linked Data
 
Producing, publishing and consuming linked data - CSHALS 2013
Producing, publishing and consuming linked data - CSHALS 2013Producing, publishing and consuming linked data - CSHALS 2013
Producing, publishing and consuming linked data - CSHALS 2013
 
Thompson 6-jun15-final
Thompson 6-jun15-finalThompson 6-jun15-final
Thompson 6-jun15-final
 
Documents, services, and data on the web
Documents, services, and data on the webDocuments, services, and data on the web
Documents, services, and data on the web
 
Fedora Migration Considerations
Fedora Migration ConsiderationsFedora Migration Considerations
Fedora Migration Considerations
 
DLF 2015 Presentation, "RDF in the Real World."
DLF 2015 Presentation, "RDF in the Real World." DLF 2015 Presentation, "RDF in the Real World."
DLF 2015 Presentation, "RDF in the Real World."
 
Re-using Media on the Web: Media fragment re-mixing and playout
Re-using Media on the Web: Media fragment re-mixing and playoutRe-using Media on the Web: Media fragment re-mixing and playout
Re-using Media on the Web: Media fragment re-mixing and playout
 

Viewers also liked

Monday presentation 1336-may23
Monday presentation 1336-may23Monday presentation 1336-may23
Monday presentation 1336-may23alexander garcia
 
Scientific background: chemistry student
Scientific background: chemistry studentScientific background: chemistry student
Scientific background: chemistry studentFederico Floris
 
Biotea poster biolinks at ISMB 2013
Biotea poster biolinks at ISMB 2013Biotea poster biolinks at ISMB 2013
Biotea poster biolinks at ISMB 2013alexander garcia
 
Paper as a Research Object
Paper as a Research ObjectPaper as a Research Object
Paper as a Research Objectalexander garcia
 
LDQL: A Query Language for the Web of Linked Data
LDQL: A Query Language for the Web of Linked DataLDQL: A Query Language for the Web of Linked Data
LDQL: A Query Language for the Web of Linked DataOlaf Hartig
 
OWLGrEd Ontology Visualizer
OWLGrEd Ontology VisualizerOWLGrEd Ontology Visualizer
OWLGrEd Ontology VisualizerUldis Bojars
 
The Semantics of SPARQL
The Semantics of SPARQLThe Semantics of SPARQL
The Semantics of SPARQLOlaf Hartig
 
The Outcome Economy
The Outcome EconomyThe Outcome Economy
The Outcome EconomyHelge Tennø
 

Viewers also liked (9)

Monday presentation 1336-may23
Monday presentation 1336-may23Monday presentation 1336-may23
Monday presentation 1336-may23
 
Scientific background: chemistry student
Scientific background: chemistry studentScientific background: chemistry student
Scientific background: chemistry student
 
Biotea poster biolinks at ISMB 2013
Biotea poster biolinks at ISMB 2013Biotea poster biolinks at ISMB 2013
Biotea poster biolinks at ISMB 2013
 
Paper as a Research Object
Paper as a Research ObjectPaper as a Research Object
Paper as a Research Object
 
LDQL: A Query Language for the Web of Linked Data
LDQL: A Query Language for the Web of Linked DataLDQL: A Query Language for the Web of Linked Data
LDQL: A Query Language for the Web of Linked Data
 
OWLGrEd Ontology Visualizer
OWLGrEd Ontology VisualizerOWLGrEd Ontology Visualizer
OWLGrEd Ontology Visualizer
 
Nanotweets
NanotweetsNanotweets
Nanotweets
 
The Semantics of SPARQL
The Semantics of SPARQLThe Semantics of SPARQL
The Semantics of SPARQL
 
The Outcome Economy
The Outcome EconomyThe Outcome Economy
The Outcome Economy
 

Similar to RDF for PubMedCentral

11.5.14 Presentation Slides, “Fedora 4.0 in Action at Penn State and Stanford”
11.5.14 Presentation Slides, “Fedora 4.0 in Action at Penn State and Stanford”11.5.14 Presentation Slides, “Fedora 4.0 in Action at Penn State and Stanford”
11.5.14 Presentation Slides, “Fedora 4.0 in Action at Penn State and Stanford”DuraSpace
 
Modern PHP RDF toolkits: a comparative study
Modern PHP RDF toolkits: a comparative studyModern PHP RDF toolkits: a comparative study
Modern PHP RDF toolkits: a comparative studyMarius Butuc
 
W4 4 marc-alexandre-nolin-v2
W4 4 marc-alexandre-nolin-v2W4 4 marc-alexandre-nolin-v2
W4 4 marc-alexandre-nolin-v2nolmar01
 
Wed batsakis tut_challenges of preservations
Wed batsakis tut_challenges of preservationsWed batsakis tut_challenges of preservations
Wed batsakis tut_challenges of preservationseswcsummerschool
 
Wed batsakis tut_chalasdlenges of preservations
Wed batsakis tut_chalasdlenges of preservationsWed batsakis tut_chalasdlenges of preservations
Wed batsakis tut_chalasdlenges of preservationseswcsummerschool
 
RDA: Alive and Well and Still Speaking MARC
RDA: Alive and Well and Still Speaking MARCRDA: Alive and Well and Still Speaking MARC
RDA: Alive and Well and Still Speaking MARCDiane Hillmann
 
Bio2RDF presentation at Combine 2012
Bio2RDF presentation at Combine 2012Bio2RDF presentation at Combine 2012
Bio2RDF presentation at Combine 2012François Belleau
 
RDFa Introductory Course Session 3/4 Why RDFa
RDFa Introductory Course Session 3/4 Why RDFaRDFa Introductory Course Session 3/4 Why RDFa
RDFa Introductory Course Session 3/4 Why RDFaPlatypus
 
Linked Data efforts for data standards in biopharma and healthcare
Linked Data efforts for data standards in biopharma and healthcareLinked Data efforts for data standards in biopharma and healthcare
Linked Data efforts for data standards in biopharma and healthcareKerstin Forsberg
 
Linked data for Libraries
Linked data for LibrariesLinked data for Libraries
Linked data for Librariesrobin fay
 
Library LicExamination Review Class 2016
Library LicExamination Review Class 2016Library LicExamination Review Class 2016
Library LicExamination Review Class 2016SuJunMEjDaEnchu
 
Ifla swsig meeting - Puerto Rico - 20110817
Ifla swsig meeting - Puerto Rico - 20110817Ifla swsig meeting - Puerto Rico - 20110817
Ifla swsig meeting - Puerto Rico - 20110817Figoblog
 
RDF Linked Data - Automatic Exchange of BIM Containers
RDF Linked Data - Automatic Exchange of BIM ContainersRDF Linked Data - Automatic Exchange of BIM Containers
RDF Linked Data - Automatic Exchange of BIM ContainersSafe Software
 

Similar to RDF for PubMedCentral (20)

11.5.14 Presentation Slides, “Fedora 4.0 in Action at Penn State and Stanford”
11.5.14 Presentation Slides, “Fedora 4.0 in Action at Penn State and Stanford”11.5.14 Presentation Slides, “Fedora 4.0 in Action at Penn State and Stanford”
11.5.14 Presentation Slides, “Fedora 4.0 in Action at Penn State and Stanford”
 
Freire model api
Freire model apiFreire model api
Freire model api
 
Library Linked Data and the Future of Bibliographic Control
Library Linked Data and the Future of Bibliographic ControlLibrary Linked Data and the Future of Bibliographic Control
Library Linked Data and the Future of Bibliographic Control
 
Modern PHP RDF toolkits: a comparative study
Modern PHP RDF toolkits: a comparative studyModern PHP RDF toolkits: a comparative study
Modern PHP RDF toolkits: a comparative study
 
W4 4 marc-alexandre-nolin-v2
W4 4 marc-alexandre-nolin-v2W4 4 marc-alexandre-nolin-v2
W4 4 marc-alexandre-nolin-v2
 
Converting GHO to RDF
Converting GHO to RDFConverting GHO to RDF
Converting GHO to RDF
 
Wed batsakis tut_challenges of preservations
Wed batsakis tut_challenges of preservationsWed batsakis tut_challenges of preservations
Wed batsakis tut_challenges of preservations
 
Wed batsakis tut_chalasdlenges of preservations
Wed batsakis tut_chalasdlenges of preservationsWed batsakis tut_chalasdlenges of preservations
Wed batsakis tut_chalasdlenges of preservations
 
RDA: Alive and Well and Still Speaking MARC
RDA: Alive and Well and Still Speaking MARCRDA: Alive and Well and Still Speaking MARC
RDA: Alive and Well and Still Speaking MARC
 
Bio2RDF presentation at Combine 2012
Bio2RDF presentation at Combine 2012Bio2RDF presentation at Combine 2012
Bio2RDF presentation at Combine 2012
 
Why rdfa
Why rdfaWhy rdfa
Why rdfa
 
RDFa Introductory Course Session 3/4 Why RDFa
RDFa Introductory Course Session 3/4 Why RDFaRDFa Introductory Course Session 3/4 Why RDFa
RDFa Introductory Course Session 3/4 Why RDFa
 
Linked Data efforts for data standards in biopharma and healthcare
Linked Data efforts for data standards in biopharma and healthcareLinked Data efforts for data standards in biopharma and healthcare
Linked Data efforts for data standards in biopharma and healthcare
 
Lawless-3-jun15
Lawless-3-jun15Lawless-3-jun15
Lawless-3-jun15
 
Linked Data Competency Index : Mapping the field for teachers and learners
 Linked Data Competency Index : Mapping the field for teachers and learners Linked Data Competency Index : Mapping the field for teachers and learners
Linked Data Competency Index : Mapping the field for teachers and learners
 
Scholze liber 2015-06-25_final
Scholze liber 2015-06-25_finalScholze liber 2015-06-25_final
Scholze liber 2015-06-25_final
 
Linked data for Libraries
Linked data for LibrariesLinked data for Libraries
Linked data for Libraries
 
Library LicExamination Review Class 2016
Library LicExamination Review Class 2016Library LicExamination Review Class 2016
Library LicExamination Review Class 2016
 
Ifla swsig meeting - Puerto Rico - 20110817
Ifla swsig meeting - Puerto Rico - 20110817Ifla swsig meeting - Puerto Rico - 20110817
Ifla swsig meeting - Puerto Rico - 20110817
 
RDF Linked Data - Automatic Exchange of BIM Containers
RDF Linked Data - Automatic Exchange of BIM ContainersRDF Linked Data - Automatic Exchange of BIM Containers
RDF Linked Data - Automatic Exchange of BIM Containers
 

Recently uploaded

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 

Recently uploaded (20)

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 

RDF for PubMedCentral

  • 1. 2/14/2014 Biotea, RDF4PMC RDF4PMC, RDFizing PubMed Central Alexander Garcia1, Leyla Jael García Castro2, Casey McLaughlin1 1Florida State University 2Universitat Jaume I 1
  • 2. The Biotea project Why Semantic Web Technologies? RDF4PMC in a nutshell Architecture RDFization process • • • • PMC RDFization Content enrichment Some numbers for RDF4PMC Architecture • Using the data • • • • • • • • • SPARQL Bio2RDF integration Web services A first prototype Challenges and Lessons Currently working on… Future Work Conclusions Acknowledgments Biotea, RDF4PMC • • • • • 2/14/2014 Outline 2
  • 3. Christine L. Borgman • Methodologies, methods and techniques supporting semantic enrichment of scholarly communication • Once enriched, then how is this changing our user experience? Biotea, RDF4PMC Scholarly data and documents are of most value when they are interconnected rather than independent 2/14/2014 Biotea 3
  • 4. Biotea • How are publications connected to each other? • Putting together explicit assertions from different papers to form new implicit assertions • Semantic Web Technology supporting scholarly communication, Literature Based Discovery and the SearchRetrieval-and-Interacting-with-the-Document (SRID) processes Biotea, RDF4PMC Christine L. Borgman 2/14/2014 Scholarly data and documents are of most value when they are interconnected rather than independent 4
  • 5. • Retrieve all papers that have a component X (CHEBI) and the cellular location in GO terms Biotea, RDF4PMC • Generates an adaptable open approach, the data becomes the platform • The SW delivers an integrative platform • Makes it easier for the community to build over the platform • Simplifies programmatic access to information 2/14/2014 Why SWT for research documents • As simple as relating terminologies • Delivers Social Network ready content 5
  • 6. Biotea, RDF4PMC • Delivers an interoperable, interlinked, and selfdescribing document model in the biomedical domain. • A network of interconnected documents • Semantic infrastructure for PMC • An interface to the Web of Data • A knowledge model for biomedical literature – easily extendible 2/14/2014 RDF4PMC in a nutshell 6
  • 7. Biotea, RDF4PMC • RDFizing biomedical literature by orchestrating ontologies such as • DoCO, BIBO, DC, FOAF, W3CPROV, and others • Datasets are available • RDF for metadata and content • RDF for annotations from text-mining • RDFizator will be available • Adding other ontologies and annotators is possible • Working with XML from other sources is possible 2/14/2014 RDF4PMC in a nutshell 7
  • 8. PMC RDFization RDF Generation Biotea, RDF4PMC References Enrichment 2/14/2014 Metadata+ Content + References RDFReactor PMC NXML 8
  • 10. Annotations: Content Enrichment Biotea, RDF4PMC 2/14/2014 Enriched RDF RDF Generation Automatic Annotation Web service Metadata+ Content + References Web service 10
  • 13. RDF4PMC Server Architecture RDF DB Slave RDF DB Master Master Server Import scripts + RDF files PMC RDFization Web & SPARQL Server (development) RDF DB Slave Web & SPARQL Server (production)
  • 14. Consuming the data: SPARQL Query expressed in natural SPARQL query language Retrieving PubMed ?article a bibo:Document ; bibo:pmid ?pmid ; identifier, article title, dcterms:title ?title . section title, and ?section a doco:Section ; paragraphs for those dcterms:isPartOf ?article ; dcterms:title ?secTitle . Biotea, RDF4PMC WHERE { 2/14/2014 SELECT ?pmid ?title ?secTitle ?text  articles containing the FILTER (regex(str(?secTitle), "introduction", "i")). ?para a doco:Paragraph ; dcterms:isPartOf ?section ; term “cancer” in any section whose title cnt:chars ?text . FILTER (regex(str(?text), "cancer", "i")). } LIMIT 50 includes “introduction” 14
  • 15. Consuming the data: SPARQL Query expressed in natural Retrieving PubMed identifier SELECT distinct ?pmid for those articles that have WHERE { been semantically annotated ?article a bibo:AcademicArticle ; with the biological entity bibo:pmid ?pmid .  Biotea, RDF4PMC language 2/14/2014 SPARQL query CHEBI:60004. The semantic ?annotation a aot:ExactQualifier ; annotation comes from the ao:annotatesResource ?article ; occurrence of the term ao:hasTopic <http://purl.obolibrary.org/obo/CHEBI_60004> . “mixture” in any paragraph } of the retrieved articles. CHEBI:60004 A mixture is a chemical substance composed of multiple molecules, at least two of which are of a different kind 15
  • 16. Annotations Biotea, RDF4PMC 2/14/2014 Content Metadata & References Bio2RDF Integration 16
  • 17. Consuming the data: Web services Retrieval Service A list of topics and their related vocabularies http://biotea.idiginfo.org/api/topics All topics related to a term e.g., http://biotea.idiginfo.org/api/topics?term=cancer All vocabularies related to a term e.g., http://biotea.idiginfo.org/api/vocabularies?term=cancer All terms that start with a specific string (for autocompletion) e.g.,http://biotea.idiginfo.org/api/terms?prefix=canc All topics related to a vocabulary e.g., http://biotea.idiginfo.org/api/topics?vocabulary=po RDF of articles that include a term e.g., http://biotea.idiginfo.org/api/articles?term=cancer Count of RDF of articles that include a term e.g., http://biotea.idiginfo.org/api/articles?term=cancer&count=true 2/14/2014 http://biotea.idiginfo.org/api/terms Biotea, RDF4PMC A list of terms and their related topics 17 A list of vocabularies and their prefixes http://biotea.idiginfo.org/vocabularies RDF of articles that include a vocabulary e.g., http://biotea.idiginfo.org/api/articles?vocabulary=po
  • 18. Semantically enriched publication Metadata+ Content + References Automatically Annotated RDF Biotea, RDF4PMC 2/14/2014 Consuming the data: a dashboard for semantic bio-publications SPARQL 18 Catalase
  • 19. Consuming the data: first prototype Cloud of Bio-annotations (term + # of bio-entities) 2/14/2014 Title & authors Biotea, RDF4PMC Links Abstract Paragraphs containing the annotation selected by the user Graphical tools 19
  • 20. Biotea, RDF4PMC 2/14/2014 Consuming the data: A first prototype 20
  • 21. Challenges and Lessons Tables and images  Links Inline tables  Format is lost Supplementary material Most of them follow one DTD but … • References • At least 4 different styles • Some times are just plain text Biotea, RDF4PMC • • • • 2/14/2014 • Content • Annotators • Not always available • Stop words are tricky 21
  • 22. Challenges and Lessons • Annotation is context dependent Biotea, RDF4PMC • Delivering the expressivity of the data set to the end user is a complex issue 2/14/2014 • Where are the facts? How to validate the facts? • Maintaining the triplet store has a learning curve of its own • Building SW infrastructure is H A R D 22
  • 23. Currently working on: Literature Discovery Process • Search • Usually string-based search mechanisms • Little cognitive support • Retrieval • Simple list of DB entries • Little cognitive support • Interacting with the document • Straight into the PDF • Zero cognitive support • Data availability
  • 24.
  • 25. Currently working on: Literature Discovery Process • Search • Usually string-based search mechanisms • Little cognitive support • Retrieval • Simple list of DB entries • Little cognitive support • How, why and where are a set of documents similar? • Interacting with the document • Straight into the PDF • Zero cognitive support
  • 26.
  • 27. Currently working on: Literature Discovery Process • Search • Usually string-based search mechanisms • Little cognitive support • Retrieval • Simple list of DB entries • Little cognitive support • Interacting with the document • Straight into the PDF • Zero cognitive support
  • 28.
  • 29. Future Work • User Experience • • • • Web services for data analysis RDF browser More visualization tools Supporting and taking advantage of the structure of the document • Collaborative element Biotea, RDF4PMC • URI standardization following similar patterns to identifiers.org and Bio2RDF • Integration into Bio2RDF • Dataset identification and summary (void) • Improve data for references 2/14/2014 • RDF 29
  • 30. Future Work Biotea, RDF4PMC • From PDF to XML to RDF to Enriched Metadata for the PDF • The PDF is gently introduced in the WoD • Once the metadata has been enriched then 2/14/2014 • Application in Clinical Psychology, the MSRC case • Rich interaction supporting: SEARCH-RETRIEVALINTERACTION WITH THE DOCUMENT (PDF) 30
  • 31. Conclusions • New vocabularies as well as annotators can easily be plugged in • Our approach is useful for both open and non-open access datasets Biotea, RDF4PMC • the transformation into RDF from the original PMC files • the annotation of the RDF • an API which makes that data available. 2/14/2014 • We provide • Publishers may decide what to expose via RDF and what content to make available • Our approach is also applicable for PDF-only environments 31
  • 32. The MSRC consortium Greg Riccardi, FSU Oscar Corcho, UPM Olga Giraldo, UPM Bob Morris, Harvard University Michel Dumontier, Carleton University Dietrich Rebholz-Schuhmann, University of Zurich Diane Leiva, FSU US DoD Grant MOMRP Grant w81xwh-10-2-0181 All of those who gave us feedback about the RDFization and the quality of our RDF datasets Biotea, RDF4PMC • • • • • • • • • • 2/14/2014 Acknowledgments 32
  • 33. Contacts • Alexander García: agarciac@gmail.com • L. Jael García Castro: leylajael@gmail.com Biotea, RDF4PMC 2/14/2014 Thanks for you attention 33

Editor's Notes

  1. Not limited to open access modelsNot limited to closed business models
  2. In spite of the advances, scientific publications remain poorly connected to each other as well as to external resources. Furthermore, most of the information remains locked up in discrete documents without machine-processable content.
  3. 3rd point : As easy as building mash-ups
  4. 4th point then The paper becomes an
  5. Distribution of the first 20 journals, corresponding to about 40% of the total of 270,834 Number of biological entities identified across the papers
  6. SPARQL queries can be used to retrieve metadata and content. It is possible to specify words and sentences that should appear in the text or in the section title.
  7. As content has been semantically enriched, it is possible to retrieve articles based on either the annotated terms, e.g., “mixture,” or their corresponding biological entities, e.g., CHEBI:60004.
  8. Users search for a human gene names; the term is initially resolved against GeneWiki, the associated UniProt accession is then used in the SPARQL query. The resulting set includes publication metadata, abstract, and a cloud of annotations. b) Enriched content based on annotations is displayed on the interactive zone; this may be the annotated paragraph, a chemical entity, or protein related information and so on.
  9. Graph-based retrieval for the terms “catalase”; only shared terms with more than 30 associated biological terms are included in the results. We want to visualoize the network of interconected documents. How is a document related to another document, what do they share.
  10. What are the challenges we have face so far in the PMC rdfication case? Well a first challenge is related to content, tables and images are not part of the core RDF but are referenced as links so we still have access to them. However, there are some inline tables whose files and columns are described in XML; so far we recover the content but not the format so we are missing information there. The supplementary material is still difficult to integrate, it could be a Section, another paper, a technical report following a total different schema, can be outside PMC, can be a footnote... Also, most of the documents follow the same schema, but we have found some few with a different one, those cannot be currently processed. Those cases are less than the 5%.As for the references, we have at least 4 different styles, so the process is different in each case and some times we get references that are just plain text so we have to do some extra process in order to get the metadata associated.And the annotators, well, they are web services so they can be down, or too busy. Also, the stop words are tricky, we have tried to avoid them as much as possible but it is always possible to get some noisy terms in the annotations.Nota: stop words son palabrascomo “they” “can” que son evitadaspor los anotadorespor ser muycomunes o muycortas. Peroesdifícilcubrirlastodas, tenemoslasmáscomunes. Otracuestiónesquepuede ser queperdamosinformaciónporquealgunosacrónimospuede ser iguales a un stop word. Porejemplo CAN es un acrónimousado en CHEBI perocomoesunapalabracomún la evitamos, si en algún paper CAN esusado en el contexto de CHEBI y no del verboauxiliar, estaríamosperdiendoesainformación.
  11. Methods&amp;materials, what Olga is working onCollaborative element… still do not know but something should be done there