Repurposing VIVO Data to Analyze Publications and Infer Expertise

Repurposing authoritative data
about faculty to analyze publication
output, infer expertise, and
recommend grant opportunities
Paul Albert, Don Carpenter, and Jie Lin
paa2013@med.cornell.edu
Weill Cornell Medical College

Email
cyz2123@med.cornell.edu
Phone
646-962-2551
Address
1300 York Avenue
New York, NY 10065
Other sites
Clinical proﬁle

Email
cyz2123@med.cornell.edu
Phone
646-962-2551
Address
1300 York Avenue
New York, NY 10065
Other sites
Clinical proﬁle
Where is the ongoing
motivation to keep these
proﬁles current?

Researchers use proﬁle
systems to ﬁnd collaborators.
A widely invoked “fact” about VIVO
(also an old Russian proverb)
“

How can VIVO data address
pressing needs in order to
strengthen its viability?

1. Administrators want reports.
2. Both administrators and researchers
want to know about funding
opportunities.
Pressing needs

Invention is 1% inspiration and
(due to rounding error) 98%
perspiration.
Thomas A. Edison
Source: Yahoo Answers
“

Administrators are avid
consumers of institutional data.

Proposed question #1
Publications appearing in journals of
a given impact factor

In any given year, which paper has
the most incoming citations?

Which papers that have received
federal funding are not deposited in
PubMed Central?

Which clinical departments tend to
publish the most?

What articles have faculty published
in the last month in which they were
ﬁrst or last author?

Institutional publication
reporting: choose two*
• High quality disambiguation (>90% accuracy)
• Minimal delay between review and inclusion
in the reporting system
• Tool is simple enough to allow anyone to use
* Or one

Sample SPARQL query
SELECT distinct ?Person1_firstName ?Person1_lastName ?Person1_primaryEmail ?
AcademicArticle1_label ?Journal1_label ?AcademicArticle1_pmid ?
DateTimeValue1_dateTime
WHERE{
?AcademicArticle1 rdf:type bibo:Document .
?AcademicArticle1 bibo:pmid ?AcademicArticle1_pmid .
?AcademicArticle1 vivo:dateTimeValue ?DateTimeValue1 .
?AcademicArticle1 vivo:informationResourceSupportedBy ?FundingOrganization1 .
?DateTimeValue1 rdf:type vivo:DateTimeValue .
?DateTimeValue1 vivo:dateTime ?DateTimeValue1_dateTime .
?FundingOrganization1 rdf:type vivo:FundingOrganization .
?FundingOrganization1 rdfs:label ?FundingOrganization1_label .
?AcademicArticle1 rdfs:label ?AcademicArticle1_label .
?AcademicArticle1 vivo:hasPublicationVenue ?Journal1 .
?Journal1 rdf:type bibo:Journal .
?Journal1 rdfs:label ?Journal1_label .
?AcademicArticle1 vivo:informationResourceInAuthorship ?Authorship1 .
?Authorship1 rdf:type vivo:Authorship .
?Authorship1 vivo:linkedAuthor ?Person1 .
?Person1 rdf:type foaf:Person .
?Person1 vivo:primaryEmail ?Person1_primaryEmail .
?Person1 wcmc:cwid ?Person1_cwid .
?Person1 foaf:firstName ?Person1_firstName .
?Person1 foaf:lastName ?Person1_lastName .
FILTER REGEX (str(?FundingOrganization1_label), 'N.I.H.', 'i')
FILTER NOT EXISTS { ?AcademicArticle1 vivo:pmcid ?AcademicArticle1_pmcid .}
FILTER (xsd:dateTime(?DateTimeValue1_dateTime) >
"2008-04-01T00:00:00"^^xsd:dateTime)
FILTER (xsd:dateTime(?DateTimeValue1_dateTime) <
"2012-12-01T00:00:00"^^xsd:dateTime)
}
ORDER BY ?Person1_lastName

SELECT distinct ?Person1_firstName ?Person1_lastName ?
Person1_primaryEmail ?AcademicArticle1_label ?Journal1_label ?
AcademicArticle1_pmid ?DateTimeValue1_dateTime
WHERE{
?AcademicArticle1 rdf:type bibo:Document .
?AcademicArticle1 vivo:dateTimeValue ?DateTimeValue1 .
?AcademicArticle1 vivo:informationResourceSupportedBy ?
FundingOrganization1 .
?DateTimeValue1 rdf:type vivo:DateTimeValue .
?DateTimeValue1 vivo:dateTime ?DateTimeValue1_dateTime .
?FundingOrganization1 rdf:type vivo:FundingOrganization .
?FundingOrganization1 rdfs:label ?FundingOrganization1_label .
?AcademicArticle1 rdfs:label ?AcademicArticle1_label .
?AcademicArticle1 vivo:hasPublicationVenue ?Journal1 .
?Journal1 rdf:type bibo:Journal .
?Journal1 rdfs:label ?Journal1_label .
?AcademicArticle1 vivo:informationResourceInAuthorship ?
Authorship1 .
?Person1 vivo:primaryEmail ?Person1_primaryEmail .
?Person1 foaf:firstName ?Person1_firstName .
?Person1 foaf:lastName ?Person1_lastName .
FILTER REGEX (str(?FundingOrganization1_label), 'N.I.H.', 'i')
FILTER NOT EXISTS { ?AcademicArticle1 vivo:pmcid ?
AcademicArticle1_pmcid .}
FILTER (xsd:dateTime(?DateTimeValue1_dateTime) >
"2008-04-01T00:00:00"^^xsd:dateTime)
FILTER (xsd:dateTime(?DateTimeValue1_dateTime) <
"2012-12-01T00:00:00"^^xsd:dateTime)
}

ORDER BY ?Person1_lastName
+
VIVO
Dashboard

VIVO Dashboard: a tool for easily
running sophisticated reports
Don Carpenter
dwc92@cornell.edu
Cornell University

Prime directive of VIVO
Dashboard
Empower untrained users to run
sophisticated semantic queries on Weill
Cornell faculty publications
* Secondary directive: kill Sarah Connor

Sample SPARQL query
SELECT distinct ?Article1_pmid ?Person1_cwid ?
Authorship1_authorRank
WHERE{
?Article1 rdf:type bibo:Document .
?Article1 vivo:informationResourceInAuthorship ?Authorship1 .
?Article1 bibo:pmid ?Article1_pmid .
?Authorship1 vivo:authorRank ?Authorship1_authorRank .
}

Demo
Demo
Demo
Demo
Demo
Demo
Demo
Demo
Demo
Demo

Workflow
• One-time basis, set up the ﬁelds in the Drupal
admin
• On a weekly basis, execute a set of SPARQL
queries against VIVO’s semantic endpoint.
• Import resulting .csv ﬁles into Drupal.

Technology Stack
• Drupal 7.x
• Stores content using the robust indexing application,
Apache Solr
• AJAX
• Key modules
- Apache Solr
- Facet API
- Facet API graphs
- D3.js (visualization library)
- Charts and graphs
- VIVO Dashboard (custom module)

Performance
• A previous version using MySQL queries took >10
seconds to load
• Completely rewriting the application in Solr
allows us to store X publications
• Performance is now < 5 seconds

Future Work
• Enlist the talents of other Drupal developers
• Release this project as open source code
• Create a visualization for global health expertise

Publications
The following publications are for all publications by active Weill Cornell Medical
College faculty as represented in VIVO.
25
50
75
100
Graph List Export
✓ Research Article (657)
✓ In Process (55)
✓ Review (45)
✓ Clinical Guideline (32)
more...
Publication Type
Author Name
Journal ranking 15.4 - 68.3
Date 2009 - Present
Journal Name

Repurposing authoritative
semantic data to infer expertise
and recommend grant
opportunities
Jie Lin
jie265@gmail.com
Cornell University

Pressing needs
1. Researchers, development officers, and funding
agencies frequently complain that the process of
learning about grant opportunities is inefficient.
2. As a project manager for VIVO, I want to
accurately include researchers' fields and
expertise.

Maybe the needs of grant
recommendations and expertise
can be addressed... together.

1. Gather information about people and grant notices.
2. Algorithmically make personalized recommend-
ations of grant opportunities. (Hard.)
3. In exchange for the promise of higher quality
recommendations, we get busy researchers to
provide us feedback on our initial inferences about
expertise.
4. Use expertise data in VIVO.
Our intended workflow

Sources for people
Source Example
Clinical expertise and board certifications
at WeillCornell.org clinical pathology
Medical Subject Headings (MeSH) in
published papers anti-bacterial agents
Personal statement ... I’ve always enjoyed medical education...
Keywords for NIH grants information system analysis
CFDA labels for NIH grants 93.821 – Lung diseases research
Spending categories for NIH grants neurosciences
ClinicalTrials.gov keywords and system-
inferred MeSH violence research
Global health expertise in Researcher
Profile System Egypt
NCCR category as asserted by CTSC staff Developmental and Child Psychology

ScanGrants
Sources for grant opportunities
Grants.gov
After global pre-ﬁltration n = ~1,200

Concept ranking
• Term Frequency-Inverse Document Frequency –
reward terms for showing up in a person’s list of
terms and penalize terms for being in others.
• Result: no one is expert on “humans”
• No algorithm is perfect so we allow faculty to
provide feedback on the controlled terms we have
inferred for them.

Mapping concepts to fields
• Objective of using a limited number of fields is to
increase overlap between people and grants
• 149 (somewhat arbitrarily) defined fields
• Fields represent eight different lists of fields (Map of
Science, ScanGrants, ABMS specialties...)
• Take concepts and fields and do a co-occurrence search
in MEDLINE.
• For example, after weighting by size of field, how often
does “Natural Language Processing” occur in
conjunction with immunology; medical informatics;
urology...?

The math for mapping people
and grants to ﬁelds

Promise of co-occurrence
searching
Suppose a researcher is working almost exclusively on
autoimmune disease and is highly ranked for the
concept, “apoptosis.”
Apoptosis also frequently co-occurs in MEDLINE with
oncology. Therefore, we can predict her interest in an
oncology grant.

The downside of co-occurrence
searching

Match people to grants
• Not yet done, but early testing is promising.
• The idea is to use cosine similarity to deﬁne how
similar any person-grant combination is to any
other person-grant combination
• Then you can rank those connections by people
or by grant.

Utility for Development Oﬃce
• Suppose Dr. Lamon and the Development Oﬃce
want to identify candidates to apply for a
particular grant.
• He can get an ordered list of the top candidates
of the people who are appropriate for this
opportunity.

Demonstration
Demonstration
Demonstration
Demonstration
Demonstration
Demonstration
Demonstration
Demonstration
Demonstration
Demonstration

Repurposing VIVO Data to Analyze Publications and Infer Expertise

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Repurposing VIVO Data to Analyze Publications and Infer Expertise

Similaire à Repurposing VIVO Data to Analyze Publications and Infer Expertise (20)

Plus de Paul Albert

Plus de Paul Albert (6)

Dernier

Dernier (20)

Repurposing VIVO Data to Analyze Publications and Infer Expertise