SlideShare une entreprise Scribd logo
1  sur  48
Opportunities and challenges
presented by Wikidata in the
context of biocuration
Benjamin Good
BioCreative, Corvalis Oregon 2016
@bgood
bgood@scripps.edu
http://www.slideshare.net/goodb
Road Map
• Introduction to Wikidata
• Wikidata and biocuration
• Wikidata and BioCreative
Is to data
as Wikipedia is to text
“Giving more people more access to more knowledge”
A free and open repository of knowledge
• Initiated by WikiMedia Germany
• In transition to the WikiMedia Foundation
• Not a grant funded ‘project’… as stable as Wikipedia
It’s a knowledge
base!
• Anyone can edit
(human or robot)
• Anyone can use
(CC0)
Elements of the kb are called ‘items’
https://www.wikidata.org/wiki/Q146
Items are unique concepts,
used to link different language
Wikipedias together
Q146
Af:Kat
En:cat
Als:Hauskatze
Ang:Catte
Av:Keto
Items are described by “statements” that link
together to form the language-independent
wikidata knowledge graph
Cat
Domesticated
Animal
Animal
Subclass Of
Subclass Of
Animalia
Taxon name
Kingdom
Taxon rank
Item: Q84
Item: Q414043
RELN
Genomic start: 103471784
GenLoc assembly:
GRCh38
Stated in:
Ensembl Release 83
Retrieved:
19 January 2016
Value (numeric)
Property
Claim Qualifiers
References
https://www.wikidata.org/wiki/Q414043
Statement
Genomic position for Reelin gene
Item: Q414043
RELN
Encodes: Reelin (protein) Stated in:
NCBI homo sapiens
annotation release 107
Retrieved:
19 January 2016
Value (item)
Property
Claim Qualifiers
References
https://www.wikidata.org/wiki/Q414043
Statement
Linking the Reelin gene to a protein it encodes
Item: Q13561329
Reelin
Cell component: dendrite
Determination method:
• ISS (Sequence or structural
Similarity)
• IEA (Electronic annotation)
Stated in:
Uniprot
Retrieved:
21 March 2016
Value (item)
Property
Claim Qualifiers
References
https://www.wikidata.org/wiki/Q13561329
Statement
Gene ontology annotation for Reelin protein
with evidence codes modeled as qualifiers
graphical view
RELN
Reelin
encodes
dendrite
cellular component
claim
ISS IEA
Determination method:
qualifiers
UniProt
stated in
retrieved
21 March
2016
References
Statement
Inter-item links form a giant knowledge graph
Everything is connected
Reelin, Heart disease,
Barack Obama,
everything..
https://query.wikidata.org
SPARQL endpoint for Wikidata
Sample of current biomedical content
• All human, mouse genes and proteins (swissprot)
• All Gene Ontology terms
• All Human Disease Ontology terms
• All FDA approved drugs
• 109 reference microbial genomes
Burgstaller-Muelbacher et al (2016) Database
Mitraka et al (2015) Semantic Web Applications for the Life Sciences
Putman et al (2016) Database
http://tinyurl.com/biowiki-sparql
Sample queries that are currently possible:
• “GO cellular localization annotations for Reelin with
evidence code ISS”
• “Diseases treated by Metformin”
• “Diseases that might be treated by Metformin”
http://query.wikidata.org
Example question: repurposing Metformin
http://tinyurl.com/zem3oxz
Metformin
?disease
interacts
with
protein
SLC22A3encoded by genetic
association
Might
treat ?
Solute carrier
family 22
member 3
SLC22A3
prostate
cancer
Road Map
• Introduction to Wikidata
• Wikidata and biocuration
• Wikidata and BioCreative
API
Flatfiles
The dominant paradigm for open biocuration
API
Flatfiles
Your
Database
Your
Database
Your
Databasexrefs
Your
Database
Pain points
• API or flatfile parsing
• Ambiguous or non-existent xrefs
• Persistence of funding
• Too much information to curate
My Web
Application
My Database
My Database Curators
My Research Grants
$
Biomedical
knowledge
A new paradigm for open biocuration?
My
Application
Our Database?
Our Database Curators
And our community
Biomedical
knowledge
My
Application
My
Application
My Research Grants
$
Reducing the pain
• Reduces API/parser proliferation
• Forces up-front integration
• Facilitates coordination
• Ensures that if funding is lost,
data is not
• Invites community input
A new platform for open biocuration?
My
Application
Our Database Curators
And our community
Biomedical
knowledge
My
Application
My
Application
My Research Grants
$
• SPARQL = a common
API for accessing
content
• 1 endpoint to
maintain…
• Its working
The first application built on wikidata is Wikipedia
Our Database Curators
And our community
Biomedical
knowledge
Our
Applications
Our
Applications
Su, Schriml, Pavlidis R01 Grant…
$
Deeply integrated,
(incredible SEO)
Application #1
Burgstaller et al (2016)
Impact of wikidata on Wikipedia
Gene Wiki
Version 1.
{{GNF_Protein_box | Name = Reelin| image = |
image_source = | PDB = {{PDB2|4AD9}} | HGNCid = 18512 |
MGIid = | Symbol = LACTB2 | AltSymbols =; CGI-83 |
IUPHAR = | ChEMBL = | OMIM = None | ECnumber = |
Homologene = 9349 | GeneAtlas_image1 = |
GeneAtlas_image2 = | GeneAtlas_image3 = |
Protein_domain_image = | Function =
{{GNF_GO|id=GO:0005515 |text = protein binding}}
{{GNF_GO|id=GO:0016787 |text = hydrolase activity}}
{{GNF_GO|id=GO:0046872 |text = metal ion binding}} |
Component = {{GNF_GO|id=GO:0005739 |text =
mitochondrion}} | Process = {{GNF_GO|id=GO:0008152
|text = metabolic process}} | Hs_EntrezGene = 51110 |
Hs_Ensembl = ENSG00000147592 | Hs_RefseqmRNA =
NM_016027 | Hs_RefseqProtein = NP_057111 |
Hs_GenLoc_db = hg38 | Hs_GenLoc_chr = 8 |
Hs_GenLoc_start = 70635318 | Hs_GenLoc_end = 70669174
| Hs_Uniprot = Q53H82 | Mm_EntrezGene = 212442 |
Mm_Ensembl = ENSMUSG00000025937 |
Mm_RefseqmRNA = NM_145381 | Mm_RefseqProtein =
NP_663356 | Mm_GenLoc_db = mm10 | Mm_GenLoc_chr =
1 | Mm_GenLoc_start = 13623330 | Mm_GenLoc_end =
13660546 | Mm_Uniprot = Q99KR3 | path = PBB/51110}}
=
Gene Wiki
Version 2.
{{Infobox gene}}
• All data in
Wikidata
• 1 Lua script works
for all genes
=
(1 of these for every gene)
Wikidata use increasing on Wikipedia
• https://en.wikipedia.org/wiki/Category:
Templates_using_data_from_Wikidata
• 81 templates indicate that they use it
Application #2. Centralized Model Organism Database (CMOD)
http://sulab.scripps.edu/CMOD/
The next application built on wikidata, yours?
Our community
Biomedical
knowledge
CMOD
????
Road Map
• Introduction to Wikidata
• Wikidata and biocuration
• Wikidata and BioCreative
Challenges
• Community ontology building
• Establishing computable trust
• Expanding the knowledge base
“Dogs and cats living together!
Mass hysteria!”
(leave that for ICBO)
BioCreative
Challenges?
‘Statements’ on Wikidata
2013 2016
100M
Statements
Bad
Good
Ugly
60M
20M
https://tools.wmflabs.org/wikidata-todo/stats.php
Computable trust
RELN
Genomic start: 103471784
GenLoc assembly:
GRCh38
Claim
Add References
1. Add references
2. Check that references concur
with the claim or not
3. Estimate ‘truthiness’ of claim
4. Provide humans with
sources to follow up.
• References can come from databases,
articles in PubMed, etc.
BadUgly
Good
Expanding the knowledge base
RELN
? ?
?
New Claims
• Given external knowledge
source (text or database)
• Create claims and
references automatically
with very high precision
• Allow for human verification
PMID: 77901
PMID: 523070
Unique characteristics Wikidata w/ regard to
IE tasks.
• 16,000+ ‘active’ editors and growing
• Could be a powerful crowdsourcing resource
• Must be kept involved or will block progress
• Constrained data model and some limits on content type
• CC0 requirement
One known attempt: “StrepHit”
• Individual Engagement Grant (IEG) from the Wikimedia Foundation
(30k, Start Jan. 2016)
• Goal to:
• “Generate trust and reliability over Wikidata content”
• “Alleviate the burden of manual curation” (sounds familiar, right?)
• Ended up working on Biographical and Soccer data…
StrepHit NLP pipeline:
https://github.com/Wikidata/StrepHit
Text corpus Claim Extractor “Primary sources”
tool on wikidata
‘Primary Sources’ optional userscript that
wikidata users can install.
Approving a suggested reference for the claim
that came from German Wikipedia
https://www.wikidata.org/wiki/Wikidata:Primary_sources_tool
The Wikidata Game(s) (= microtasks…)
https://tools.wmflabs.org/wikidata-game/
https://tools.wmflabs.org/wikidata-game/distributed/
Code for making your own!
StrepHit, Primary Sources, Wikidata games…
All works in progress
https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Renewal
BioCreative Challenges with Wikidata ?
Our Database Curators
And our community
Biomedical
knowledge
????
?
Acknowledgements
Gene Wikidata Team
Andra Waagmeester (Micelio)
Sebastian Burgstaller (Scripps)
Tim Putman (Scripps)
Elvira Mitraka (U Maryland)
Julia Turner (Scripps)
Justin Leong (UBC)
Lynn Schriml (U Maryland)
Paul Pavlidis (UBC)
Andrew Su (Scripps)
Ginger Tsueng (Scripps)
Contact
bgood@scripps.edu
@bgood on twitter
Adapted logo
Su Laboratory at TSRI The 16,950 other active editors of
Wikidata and especially the 693 that
joined last month and the 809 that
joined the month before that and
the 721 that joined the month
before that..
This work was supported by the US National Institute of Health
(grants GM089820 and U54GM114833) and by the Scripps
Translational Science Institute with an NIH-NCATS Clinical and
Translational Science Award (CTSA; 5 UL1 TR001114).
Social controls
• Anyone can
• Add or edit labels, descriptions, statements, references etc. on existing items
• Create new items
• Link items to Wikipedia articles
• Query using https://query.wikidata.org
• Read and write small numbers of edits with
https://www.wikidata.org/w/api.php
• Propose a new property
• Request a bot account for high-volume automated editing
Here be dragons..
Properties (as of April 10, 2016)
• 2196 active properties
• 114 new properties that have been proposed but not yet approved
Proposal
https://www.wikidata.org/wiki/Wikidata:Property_proposal
After proposal, community discussion
• Each property is left open
for discussion by anyone
until
• An administrator or other
person blessed with the
power either creates it or
decides not to create it
based on the discussion
• People that enjoy ontology
arguments needed here!
Lengthy (cut-off) discussion of proposal for ‘extinct’ property
https://www.wikidata.org/wiki/Wikidata:Property_proposal/
Property proposal on wikidata
Proposal
Community discussion
Bot accounts
• https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot
• Same basic process.
Proposal discussions
• Can not be avoided
• The discussions are long and tiring but important
• Many of the people involved are quite experienced
• All are trying to make something great
• Persistence and patience required
RELN
Reelin
encodes
dendrite
cellular component
claim
ISS IEA
Determination method:
qualifiers
UniProt
stated in
retrieved
21 March
2016
References
Statement
SPARQL graph..
http://tinyurl.com/biowiki-sparql

Contenu connexe

Tendances

Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Michel Dumontier
 
Model Organism Linked Data
Model Organism Linked DataModel Organism Linked Data
Model Organism Linked Data
Michel Dumontier
 

Tendances (20)

Making it Easier, Possibly Even Pleasant, to Author Rich Experimental Metadata
Making it Easier, Possibly Even Pleasant, to Author Rich Experimental MetadataMaking it Easier, Possibly Even Pleasant, to Author Rich Experimental Metadata
Making it Easier, Possibly Even Pleasant, to Author Rich Experimental Metadata
 
Link Analysis of Life Sciences Linked Data
Link Analysis of Life Sciences Linked DataLink Analysis of Life Sciences Linked Data
Link Analysis of Life Sciences Linked Data
 
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
 
W3C HCLS Dataset Description Guidelines
W3C HCLS Dataset Description GuidelinesW3C HCLS Dataset Description Guidelines
W3C HCLS Dataset Description Guidelines
 
2016 bmdid-mappings
2016 bmdid-mappings2016 bmdid-mappings
2016 bmdid-mappings
 
Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration ...
Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration ...Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration ...
Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration ...
 
Model Organism Linked Data
Model Organism Linked DataModel Organism Linked Data
Model Organism Linked Data
 
The expansive reach of ChemSpider as a resource for the chemistry community
The expansive reach of ChemSpider as a resource for the chemistry communityThe expansive reach of ChemSpider as a resource for the chemistry community
The expansive reach of ChemSpider as a resource for the chemistry community
 
Data Citation: A Critical Role for Publishers
Data Citation: A Critical Role for PublishersData Citation: A Critical Role for Publishers
Data Citation: A Critical Role for Publishers
 
ContentMine: Mining the Scientific Literature
ContentMine: Mining the Scientific LiteratureContentMine: Mining the Scientific Literature
ContentMine: Mining the Scientific Literature
 
Better Data for a Better World
Better Data for a Better WorldBetter Data for a Better World
Better Data for a Better World
 
Automatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the LiteratureAutomatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the Literature
 
GigaScience: a new resource for the big-data community.
GigaScience: a new resource for the big-data community.GigaScience: a new resource for the big-data community.
GigaScience: a new resource for the big-data community.
 
SciDataCon 2014 Data Papers and their applications workshop - NPG Scientific ...
SciDataCon 2014 Data Papers and their applications workshop - NPG Scientific ...SciDataCon 2014 Data Papers and their applications workshop - NPG Scientific ...
SciDataCon 2014 Data Papers and their applications workshop - NPG Scientific ...
 
Can machines understand the scientific literature?
Can machines understand the scientific literature?Can machines understand the scientific literature?
Can machines understand the scientific literature?
 
Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literatureAmanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature
 
ContentMine + EPMC: Finding Zika!
ContentMine + EPMC: Finding Zika!ContentMine + EPMC: Finding Zika!
ContentMine + EPMC: Finding Zika!
 
Automatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the LiteratureAutomatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the Literature
 
High throughput mining of the scholarly literature; talk at NIH
High throughput mining of the scholarly literature; talk at NIHHigh throughput mining of the scholarly literature; talk at NIH
High throughput mining of the scholarly literature; talk at NIH
 
Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literatureAutomatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature
 

En vedette

Integrating NLP using Linked Data
Integrating NLP using Linked DataIntegrating NLP using Linked Data
Integrating NLP using Linked Data
Sebastian Hellmann
 
Light steel villa catalogue log
Light steel villa catalogue logLight steel villa catalogue log
Light steel villa catalogue log
eishimachinery
 
Gene Wiki at Phenotype RCN annual meeting
Gene Wiki at Phenotype RCN annual meetingGene Wiki at Phenotype RCN annual meeting
Gene Wiki at Phenotype RCN annual meeting
Benjamin Good
 
Short update on The Cure game first week
Short update on The Cure game first weekShort update on The Cure game first week
Short update on The Cure game first week
Benjamin Good
 
Microtask crowdsourcing for disease mention annotation in PubMed abstracts
Microtask crowdsourcing for disease mention annotation in PubMed abstractsMicrotask crowdsourcing for disease mention annotation in PubMed abstracts
Microtask crowdsourcing for disease mention annotation in PubMed abstracts
Benjamin Good
 
Resume 2009 Compatible V2 1
Resume 2009 Compatible V2 1 Resume 2009 Compatible V2 1
Resume 2009 Compatible V2 1
schelby
 

En vedette (20)

Integrating NLP using Linked Data
Integrating NLP using Linked DataIntegrating NLP using Linked Data
Integrating NLP using Linked Data
 
Light steel villa catalogue log
Light steel villa catalogue logLight steel villa catalogue log
Light steel villa catalogue log
 
The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Surviva...
The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Surviva...The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Surviva...
The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Surviva...
 
Gene Wiki at Phenotype RCN annual meeting
Gene Wiki at Phenotype RCN annual meetingGene Wiki at Phenotype RCN annual meeting
Gene Wiki at Phenotype RCN annual meeting
 
Buyer Remorse
Buyer RemorseBuyer Remorse
Buyer Remorse
 
Oslo Solr MeetUp March 2012 - Solr4 alpha
Oslo Solr MeetUp March 2012 - Solr4 alphaOslo Solr MeetUp March 2012 - Solr4 alpha
Oslo Solr MeetUp March 2012 - Solr4 alpha
 
2016 mem good
2016 mem good2016 mem good
2016 mem good
 
IMSafer Angel Round
IMSafer Angel RoundIMSafer Angel Round
IMSafer Angel Round
 
2015 6 bd2k_biobranch_knowbio
2015 6 bd2k_biobranch_knowbio2015 6 bd2k_biobranch_knowbio
2015 6 bd2k_biobranch_knowbio
 
The National Society For The Protection Of Hmmm
The National Society For The Protection Of HmmmThe National Society For The Protection Of Hmmm
The National Society For The Protection Of Hmmm
 
Gene wiki jamboree
Gene wiki jamboreeGene wiki jamboree
Gene wiki jamboree
 
Bio Logical Mass Collaboration3
Bio Logical Mass Collaboration3Bio Logical Mass Collaboration3
Bio Logical Mass Collaboration3
 
Short update on The Cure game first week
Short update on The Cure game first weekShort update on The Cure game first week
Short update on The Cure game first week
 
Welcome to Ukraine - SunCity Travel LLC
Welcome to Ukraine - SunCity Travel LLCWelcome to Ukraine - SunCity Travel LLC
Welcome to Ukraine - SunCity Travel LLC
 
Microtask crowdsourcing for disease mention annotation in PubMed abstracts
Microtask crowdsourcing for disease mention annotation in PubMed abstractsMicrotask crowdsourcing for disease mention annotation in PubMed abstracts
Microtask crowdsourcing for disease mention annotation in PubMed abstracts
 
Resume 2009 Compatible V2 1
Resume 2009 Compatible V2 1 Resume 2009 Compatible V2 1
Resume 2009 Compatible V2 1
 
Channeling Collaborative Spirit
Channeling Collaborative SpiritChanneling Collaborative Spirit
Channeling Collaborative Spirit
 
2to3
2to32to3
2to3
 
Dagens Næringslivs overgang til Lucene/Solr søk
Dagens Næringslivs overgang til Lucene/Solr søkDagens Næringslivs overgang til Lucene/Solr søk
Dagens Næringslivs overgang til Lucene/Solr søk
 
Gene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2KGene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2K
 

Similaire à Opportunities and challenges presented by Wikidata in the context of biocuration

RDMkit, a Research Data Management Toolkit. Built by the Community for the ...
RDMkit, a Research Data Management Toolkit.  Built by the Community for the ...RDMkit, a Research Data Management Toolkit.  Built by the Community for the ...
RDMkit, a Research Data Management Toolkit. Built by the Community for the ...
Carole Goble
 
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECAProject
 

Similaire à Opportunities and challenges presented by Wikidata in the context of biocuration (20)

Overview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data AnalysisOverview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data Analysis
 
BioThings API: Building a FAIR API Ecosystem for Biomedical Knowledge
BioThings API: Building a FAIR API Ecosystem for Biomedical KnowledgeBioThings API: Building a FAIR API Ecosystem for Biomedical Knowledge
BioThings API: Building a FAIR API Ecosystem for Biomedical Knowledge
 
BioWikis BSB10
BioWikis BSB10BioWikis BSB10
BioWikis BSB10
 
Scott Edmunds at OASP Asia: Open (and Big) Data – the next challenge
Scott Edmunds at OASP Asia: Open (and Big) Data – the next challengeScott Edmunds at OASP Asia: Open (and Big) Data – the next challenge
Scott Edmunds at OASP Asia: Open (and Big) Data – the next challenge
 
Scott Edmunds flashtalk slides from Beyond the PDF2
Scott Edmunds flashtalk slides from Beyond the PDF2Scott Edmunds flashtalk slides from Beyond the PDF2
Scott Edmunds flashtalk slides from Beyond the PDF2
 
Making Data FAIR on WikiData - Andra Waagmeester
Making Data FAIR on WikiData - Andra WaagmeesterMaking Data FAIR on WikiData - Andra Waagmeester
Making Data FAIR on WikiData - Andra Waagmeester
 
RDMkit, a Research Data Management Toolkit. Built by the Community for the ...
RDMkit, a Research Data Management Toolkit.  Built by the Community for the ...RDMkit, a Research Data Management Toolkit.  Built by the Community for the ...
RDMkit, a Research Data Management Toolkit. Built by the Community for the ...
 
dkNET Annual Meeting - June 2017
dkNET Annual Meeting - June 2017dkNET Annual Meeting - June 2017
dkNET Annual Meeting - June 2017
 
Ilik - Beyond the Manuscript: Using IRs for Non Traditional Content Types
Ilik - Beyond the Manuscript: Using IRs for Non Traditional Content TypesIlik - Beyond the Manuscript: Using IRs for Non Traditional Content Types
Ilik - Beyond the Manuscript: Using IRs for Non Traditional Content Types
 
HKU Data Curation MLIM7350 Class 10
HKU Data Curation MLIM7350 Class 10HKU Data Curation MLIM7350 Class 10
HKU Data Curation MLIM7350 Class 10
 
HathiTrust Research Center Secure Commons
HathiTrust Research Center Secure CommonsHathiTrust Research Center Secure Commons
HathiTrust Research Center Secure Commons
 
Ouellette elixir 2017
Ouellette elixir 2017Ouellette elixir 2017
Ouellette elixir 2017
 
Mphil Computational Biology Seminar Series Presentation (20201111)
Mphil Computational Biology Seminar Series Presentation (20201111)Mphil Computational Biology Seminar Series Presentation (20201111)
Mphil Computational Biology Seminar Series Presentation (20201111)
 
20200901 ECCB M. Kutmon
20200901 ECCB M. Kutmon20200901 ECCB M. Kutmon
20200901 ECCB M. Kutmon
 
A Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchA Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical Research
 
NCBO Web Services: Powering Semantically Aware Applications
NCBO Web Services: Powering Semantically Aware ApplicationsNCBO Web Services: Powering Semantically Aware Applications
NCBO Web Services: Powering Semantically Aware Applications
 
Nicole Nogoy at the Auckland BMC RoadShow
Nicole Nogoy at the Auckland BMC RoadShowNicole Nogoy at the Auckland BMC RoadShow
Nicole Nogoy at the Auckland BMC RoadShow
 
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
 
(Bio)Hackathons
(Bio)Hackathons(Bio)Hackathons
(Bio)Hackathons
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-Life
 

Plus de Benjamin Good

Integrating Pathway Databases with Gene Ontology Causal Activity Models
Integrating Pathway Databases with Gene Ontology Causal Activity ModelsIntegrating Pathway Databases with Gene Ontology Causal Activity Models
Integrating Pathway Databases with Gene Ontology Causal Activity Models
Benjamin Good
 
2016 bd2k bgood_wikidata
2016 bd2k bgood_wikidata2016 bd2k bgood_wikidata
2016 bd2k bgood_wikidata
Benjamin Good
 
(Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery
(Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery (Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery
(Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery
Benjamin Good
 
Building a massive biomedical knowledge graph with citizen science
Building a massive biomedical knowledge graph with citizen scienceBuilding a massive biomedical knowledge graph with citizen science
Building a massive biomedical knowledge graph with citizen science
Benjamin Good
 
Branch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiersBranch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiers
Benjamin Good
 
The Cure: Making a game of gene selection for breast cancer survival prediction
The Cure: Making a game of gene selection for breast cancer survival predictionThe Cure: Making a game of gene selection for breast cancer survival prediction
The Cure: Making a game of gene selection for breast cancer survival prediction
Benjamin Good
 
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
Benjamin Good
 
Mark2Cure: a crowdsourcing platform for biomedical literature annotation
Mark2Cure: a crowdsourcing platform for biomedical literature annotationMark2Cure: a crowdsourcing platform for biomedical literature annotation
Mark2Cure: a crowdsourcing platform for biomedical literature annotation
Benjamin Good
 
An online game for human phenotype prediction
An online game for human phenotype predictionAn online game for human phenotype prediction
An online game for human phenotype prediction
Benjamin Good
 

Plus de Benjamin Good (18)

Representing and reasoning with biological knowledge
Representing and reasoning with biological knowledgeRepresenting and reasoning with biological knowledge
Representing and reasoning with biological knowledge
 
Integrating Pathway Databases with Gene Ontology Causal Activity Models
Integrating Pathway Databases with Gene Ontology Causal Activity ModelsIntegrating Pathway Databases with Gene Ontology Causal Activity Models
Integrating Pathway Databases with Gene Ontology Causal Activity Models
 
Pathways2GO: Converting BioPax pathways to GO-CAMs
Pathways2GO: Converting BioPax pathways to GO-CAMsPathways2GO: Converting BioPax pathways to GO-CAMs
Pathways2GO: Converting BioPax pathways to GO-CAMs
 
Knowledge Beacons
Knowledge BeaconsKnowledge Beacons
Knowledge Beacons
 
Science Game Lab
Science Game LabScience Game Lab
Science Game Lab
 
Gene Wiki and Wikimedia Foundation SPARQL workshop
Gene Wiki and Wikimedia Foundation SPARQL workshopGene Wiki and Wikimedia Foundation SPARQL workshop
Gene Wiki and Wikimedia Foundation SPARQL workshop
 
Scripps bioinformatics seminar_day_2
Scripps bioinformatics seminar_day_2Scripps bioinformatics seminar_day_2
Scripps bioinformatics seminar_day_2
 
Computing on the shoulders of giants
Computing on the shoulders of giantsComputing on the shoulders of giants
Computing on the shoulders of giants
 
2016 bd2k bgood_wikidata
2016 bd2k bgood_wikidata2016 bd2k bgood_wikidata
2016 bd2k bgood_wikidata
 
(Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery
(Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery (Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery
(Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery
 
Citizen sciencepanel2015 pdf
Citizen sciencepanel2015 pdfCitizen sciencepanel2015 pdf
Citizen sciencepanel2015 pdf
 
Building a massive biomedical knowledge graph with citizen science
Building a massive biomedical knowledge graph with citizen scienceBuilding a massive biomedical knowledge graph with citizen science
Building a massive biomedical knowledge graph with citizen science
 
Branch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiersBranch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiers
 
Serious games for bioinformatics education. ISMB 2014 education workshop
Serious games for bioinformatics education.  ISMB 2014 education workshopSerious games for bioinformatics education.  ISMB 2014 education workshop
Serious games for bioinformatics education. ISMB 2014 education workshop
 
The Cure: Making a game of gene selection for breast cancer survival prediction
The Cure: Making a game of gene selection for breast cancer survival predictionThe Cure: Making a game of gene selection for breast cancer survival prediction
The Cure: Making a game of gene selection for breast cancer survival prediction
 
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
 
Mark2Cure: a crowdsourcing platform for biomedical literature annotation
Mark2Cure: a crowdsourcing platform for biomedical literature annotationMark2Cure: a crowdsourcing platform for biomedical literature annotation
Mark2Cure: a crowdsourcing platform for biomedical literature annotation
 
An online game for human phenotype prediction
An online game for human phenotype predictionAn online game for human phenotype prediction
An online game for human phenotype prediction
 

Dernier

Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
RizalinePalanog2
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET
 
Introduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxIntroduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptx
Bhagirath Gogikar
 

Dernier (20)

STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONSTS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
IDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicineIDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicine
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
Introduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxIntroduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptx
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 

Opportunities and challenges presented by Wikidata in the context of biocuration

  • 1. Opportunities and challenges presented by Wikidata in the context of biocuration Benjamin Good BioCreative, Corvalis Oregon 2016 @bgood bgood@scripps.edu http://www.slideshare.net/goodb
  • 2. Road Map • Introduction to Wikidata • Wikidata and biocuration • Wikidata and BioCreative
  • 3. Is to data as Wikipedia is to text “Giving more people more access to more knowledge” A free and open repository of knowledge • Initiated by WikiMedia Germany • In transition to the WikiMedia Foundation • Not a grant funded ‘project’… as stable as Wikipedia
  • 4. It’s a knowledge base! • Anyone can edit (human or robot) • Anyone can use (CC0)
  • 5. Elements of the kb are called ‘items’ https://www.wikidata.org/wiki/Q146
  • 6. Items are unique concepts, used to link different language Wikipedias together Q146 Af:Kat En:cat Als:Hauskatze Ang:Catte Av:Keto
  • 7. Items are described by “statements” that link together to form the language-independent wikidata knowledge graph Cat Domesticated Animal Animal Subclass Of Subclass Of Animalia Taxon name Kingdom Taxon rank
  • 9. Item: Q414043 RELN Genomic start: 103471784 GenLoc assembly: GRCh38 Stated in: Ensembl Release 83 Retrieved: 19 January 2016 Value (numeric) Property Claim Qualifiers References https://www.wikidata.org/wiki/Q414043 Statement Genomic position for Reelin gene
  • 10. Item: Q414043 RELN Encodes: Reelin (protein) Stated in: NCBI homo sapiens annotation release 107 Retrieved: 19 January 2016 Value (item) Property Claim Qualifiers References https://www.wikidata.org/wiki/Q414043 Statement Linking the Reelin gene to a protein it encodes
  • 11. Item: Q13561329 Reelin Cell component: dendrite Determination method: • ISS (Sequence or structural Similarity) • IEA (Electronic annotation) Stated in: Uniprot Retrieved: 21 March 2016 Value (item) Property Claim Qualifiers References https://www.wikidata.org/wiki/Q13561329 Statement Gene ontology annotation for Reelin protein with evidence codes modeled as qualifiers
  • 12. graphical view RELN Reelin encodes dendrite cellular component claim ISS IEA Determination method: qualifiers UniProt stated in retrieved 21 March 2016 References Statement
  • 13. Inter-item links form a giant knowledge graph Everything is connected Reelin, Heart disease, Barack Obama, everything.. https://query.wikidata.org SPARQL endpoint for Wikidata
  • 14. Sample of current biomedical content • All human, mouse genes and proteins (swissprot) • All Gene Ontology terms • All Human Disease Ontology terms • All FDA approved drugs • 109 reference microbial genomes Burgstaller-Muelbacher et al (2016) Database Mitraka et al (2015) Semantic Web Applications for the Life Sciences Putman et al (2016) Database
  • 15. http://tinyurl.com/biowiki-sparql Sample queries that are currently possible: • “GO cellular localization annotations for Reelin with evidence code ISS” • “Diseases treated by Metformin” • “Diseases that might be treated by Metformin” http://query.wikidata.org
  • 16. Example question: repurposing Metformin http://tinyurl.com/zem3oxz Metformin ?disease interacts with protein SLC22A3encoded by genetic association Might treat ? Solute carrier family 22 member 3 SLC22A3 prostate cancer
  • 17. Road Map • Introduction to Wikidata • Wikidata and biocuration • Wikidata and BioCreative
  • 18. API Flatfiles The dominant paradigm for open biocuration API Flatfiles Your Database Your Database Your Databasexrefs Your Database Pain points • API or flatfile parsing • Ambiguous or non-existent xrefs • Persistence of funding • Too much information to curate My Web Application My Database My Database Curators My Research Grants $ Biomedical knowledge
  • 19. A new paradigm for open biocuration? My Application Our Database? Our Database Curators And our community Biomedical knowledge My Application My Application My Research Grants $ Reducing the pain • Reduces API/parser proliferation • Forces up-front integration • Facilitates coordination • Ensures that if funding is lost, data is not • Invites community input
  • 20. A new platform for open biocuration? My Application Our Database Curators And our community Biomedical knowledge My Application My Application My Research Grants $ • SPARQL = a common API for accessing content • 1 endpoint to maintain… • Its working
  • 21. The first application built on wikidata is Wikipedia Our Database Curators And our community Biomedical knowledge Our Applications Our Applications Su, Schriml, Pavlidis R01 Grant… $
  • 24. Impact of wikidata on Wikipedia Gene Wiki Version 1. {{GNF_Protein_box | Name = Reelin| image = | image_source = | PDB = {{PDB2|4AD9}} | HGNCid = 18512 | MGIid = | Symbol = LACTB2 | AltSymbols =; CGI-83 | IUPHAR = | ChEMBL = | OMIM = None | ECnumber = | Homologene = 9349 | GeneAtlas_image1 = | GeneAtlas_image2 = | GeneAtlas_image3 = | Protein_domain_image = | Function = {{GNF_GO|id=GO:0005515 |text = protein binding}} {{GNF_GO|id=GO:0016787 |text = hydrolase activity}} {{GNF_GO|id=GO:0046872 |text = metal ion binding}} | Component = {{GNF_GO|id=GO:0005739 |text = mitochondrion}} | Process = {{GNF_GO|id=GO:0008152 |text = metabolic process}} | Hs_EntrezGene = 51110 | Hs_Ensembl = ENSG00000147592 | Hs_RefseqmRNA = NM_016027 | Hs_RefseqProtein = NP_057111 | Hs_GenLoc_db = hg38 | Hs_GenLoc_chr = 8 | Hs_GenLoc_start = 70635318 | Hs_GenLoc_end = 70669174 | Hs_Uniprot = Q53H82 | Mm_EntrezGene = 212442 | Mm_Ensembl = ENSMUSG00000025937 | Mm_RefseqmRNA = NM_145381 | Mm_RefseqProtein = NP_663356 | Mm_GenLoc_db = mm10 | Mm_GenLoc_chr = 1 | Mm_GenLoc_start = 13623330 | Mm_GenLoc_end = 13660546 | Mm_Uniprot = Q99KR3 | path = PBB/51110}} = Gene Wiki Version 2. {{Infobox gene}} • All data in Wikidata • 1 Lua script works for all genes = (1 of these for every gene)
  • 25. Wikidata use increasing on Wikipedia • https://en.wikipedia.org/wiki/Category: Templates_using_data_from_Wikidata • 81 templates indicate that they use it
  • 26. Application #2. Centralized Model Organism Database (CMOD) http://sulab.scripps.edu/CMOD/
  • 27. The next application built on wikidata, yours? Our community Biomedical knowledge CMOD ????
  • 28. Road Map • Introduction to Wikidata • Wikidata and biocuration • Wikidata and BioCreative
  • 29. Challenges • Community ontology building • Establishing computable trust • Expanding the knowledge base “Dogs and cats living together! Mass hysteria!” (leave that for ICBO) BioCreative Challenges?
  • 30. ‘Statements’ on Wikidata 2013 2016 100M Statements Bad Good Ugly 60M 20M https://tools.wmflabs.org/wikidata-todo/stats.php
  • 31. Computable trust RELN Genomic start: 103471784 GenLoc assembly: GRCh38 Claim Add References 1. Add references 2. Check that references concur with the claim or not 3. Estimate ‘truthiness’ of claim 4. Provide humans with sources to follow up. • References can come from databases, articles in PubMed, etc. BadUgly Good
  • 32. Expanding the knowledge base RELN ? ? ? New Claims • Given external knowledge source (text or database) • Create claims and references automatically with very high precision • Allow for human verification PMID: 77901 PMID: 523070
  • 33. Unique characteristics Wikidata w/ regard to IE tasks. • 16,000+ ‘active’ editors and growing • Could be a powerful crowdsourcing resource • Must be kept involved or will block progress • Constrained data model and some limits on content type • CC0 requirement
  • 34. One known attempt: “StrepHit” • Individual Engagement Grant (IEG) from the Wikimedia Foundation (30k, Start Jan. 2016) • Goal to: • “Generate trust and reliability over Wikidata content” • “Alleviate the burden of manual curation” (sounds familiar, right?) • Ended up working on Biographical and Soccer data…
  • 35. StrepHit NLP pipeline: https://github.com/Wikidata/StrepHit Text corpus Claim Extractor “Primary sources” tool on wikidata
  • 36. ‘Primary Sources’ optional userscript that wikidata users can install. Approving a suggested reference for the claim that came from German Wikipedia https://www.wikidata.org/wiki/Wikidata:Primary_sources_tool
  • 37. The Wikidata Game(s) (= microtasks…) https://tools.wmflabs.org/wikidata-game/ https://tools.wmflabs.org/wikidata-game/distributed/ Code for making your own!
  • 38. StrepHit, Primary Sources, Wikidata games… All works in progress https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Renewal
  • 39. BioCreative Challenges with Wikidata ? Our Database Curators And our community Biomedical knowledge ???? ?
  • 40. Acknowledgements Gene Wikidata Team Andra Waagmeester (Micelio) Sebastian Burgstaller (Scripps) Tim Putman (Scripps) Elvira Mitraka (U Maryland) Julia Turner (Scripps) Justin Leong (UBC) Lynn Schriml (U Maryland) Paul Pavlidis (UBC) Andrew Su (Scripps) Ginger Tsueng (Scripps) Contact bgood@scripps.edu @bgood on twitter Adapted logo Su Laboratory at TSRI The 16,950 other active editors of Wikidata and especially the 693 that joined last month and the 809 that joined the month before that and the 721 that joined the month before that.. This work was supported by the US National Institute of Health (grants GM089820 and U54GM114833) and by the Scripps Translational Science Institute with an NIH-NCATS Clinical and Translational Science Award (CTSA; 5 UL1 TR001114).
  • 41.
  • 42. Social controls • Anyone can • Add or edit labels, descriptions, statements, references etc. on existing items • Create new items • Link items to Wikipedia articles • Query using https://query.wikidata.org • Read and write small numbers of edits with https://www.wikidata.org/w/api.php • Propose a new property • Request a bot account for high-volume automated editing Here be dragons..
  • 43. Properties (as of April 10, 2016) • 2196 active properties • 114 new properties that have been proposed but not yet approved Proposal https://www.wikidata.org/wiki/Wikidata:Property_proposal
  • 44. After proposal, community discussion • Each property is left open for discussion by anyone until • An administrator or other person blessed with the power either creates it or decides not to create it based on the discussion • People that enjoy ontology arguments needed here! Lengthy (cut-off) discussion of proposal for ‘extinct’ property
  • 47. Proposal discussions • Can not be avoided • The discussions are long and tiring but important • Many of the people involved are quite experienced • All are trying to make something great • Persistence and patience required
  • 48. RELN Reelin encodes dendrite cellular component claim ISS IEA Determination method: qualifiers UniProt stated in retrieved 21 March 2016 References Statement SPARQL graph.. http://tinyurl.com/biowiki-sparql

Notes de l'éditeur

  1. I will spend a good portion of the talk explaining what wikidata is. With that, all of you smart people can start thinking on your own about how it might influence your work. Of course I will provide some possible ideas.
  2. Labels and descriptions in many languages
  3. about 25 million items, 100 million statements, resulting in about 1 billions triples in the sparql endoint
  4. What it is, to what can we do with it ?
  5. This is the central point I want to make. Wikidata can be used to to build knowledge-based applications, lowering the barrier to entry for building apps and reducing challenges of downstream data integration. Before coming back to this, I will explain why.
  6. This is the central point I want to make. Wikidata can be used to to build knowledge-based applications, lowering the barrier to entry for building apps and reducing challenges of downstream data integration. May
  7. By mixing the data into wikidata, we reduce API proliferation, easing application formation. Over 1 billion triples Fast 20-30 queries per second, Avg about 6 seconds to answer queries Stable since around September 2015
  8. By mixing the data into wikidata, we reduce API proliferation, easing application formation. Over 1 billion triples Fast Stable since around September 2015
  9. This is the first application of the work that we have done
  10. By mixing the data into wikidata, we reduce API proliferation, easing application formation. Over 1 billion triples Fast Stable since around September 2015
  11. Now that we have some ideas about how it can be used, consider the problems with it ways that the NLP community might help solve them.
  12. ontology – define properties and patterns for their use Trust – given a claim recorded on wikidata, verify that it matches (or conflicts with) statements made in other sources and provide references to those sources. Data – add more
  13. Given a claim, validate or invalidate and provide a reference Could easily come up with many thousands of claims in the biomedical domain of the ugly or bad nature.
  14. Given a reference, generate claims
  15. http://www.slideshare.net/MarcoFossati/strephit-ieg-kickoff-seminar
  16. Lexicographical analysis Relation extraction Frame semantics Machine learning Very ambitious goal of producing both the “A box and the T box” (ie both identifying new properties and extracting relations using them)
  17. Tool provided by Google project for loading data from Freebase that requires a human ‘thumbs up’. currently StrepHit team is proposing to shift their work to improving this tool https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Renewal https://meta.wikimedia.org/wiki/Grants_talk:IEG/StrepHit:_Wikidata_Statements_Validation_via_References#Support_from_ContentMine
  18. By mixing the data into wikidata, we reduce API proliferation, easing application formation. Over 1 billion triples Fast Stable since around September 2015