NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Translating unstructured, crowdsourced content
into structured data

Andrew Su, Ph.D.
The Scripps Research Institute

NCBO Webinar

February 20, 2013

2
Human genetics underlies human health
Molecular understanding of:
• Biological function
• Genetic variation
• Mutation
• Deletion
• Amplification
• …
Structured
gene
Gene
annotations

~3 billion ~20,000
bases genes

Molecular
diagnostics &
therapeutics

3
Structured gene annotations enable computation

Structured gene annotations

4
Few genes are well annotated

CTNNB1
VEGFA
SIRT1
FGFR2
GO Annotation

TGFB1
TP53
Counts

MEF2C
BMP4 65%
LEF1
WNT5A
TNF
41%

20,473 protein-
coding genes

Genes, sorted by decreasing counts

Data: NCBI, February 2013

5
GO Annotation
Counts

+ Electronic annotation (IEA)



6
GO Annotation
Counts

+ Electronic annotation (IEA)

Biological
Process only



7

311,696 articles (1.5% of PubMed)
have been cited by GO annotations

8

Sooner or later, the
research community will
need to be involved in the
0
annotation effort to scale
up to the rate of data
generation.

9

Crowdsourcing
empowers the entire
scientific community to
directly participate in the
gene annotation process.

10
From crowdsourcing to structured data

The Gene Wiki

GeneGames.org

11
10,000 gene “stubs” within Wikipedia

Protein structure
Gene
summary
Symbols and
identifiers

Gene Ontology
annotations
Protein
interactions

Tissue expression
Linked pattern
references

Links to structured
databases

Huss, PLoS Biol, 2008

12
Gene Wiki has a critical mass of readers
Total: 4.0 million views / month

Huss, PLoS Biol, 2008; Good, NAR, 2011

13
Gene Wiki has a critical mass of editors

Editor count Editors

Edit count
Edits

Increase of ~10,000 words / month from >1,000 edits
Currently 1.42 million words
Approximately equal to 230 full-length articles
Good, NAR, 2011

14
A review article for every gene is powerful

Reelin: 68 editors, 543 edits since July 2002
Heparin: 175 editors, 320 edits since June 2003
AMPK: 44 editors, 84 edits since March 2004
RNAi: 232 editors, 708 edits since October 2002
References to the literature
Hyperlinks to related concepts

Filtering, extracting, and summarizing PubMed

Documents

Concepts

16
Document- and concept-centric text mining
Predicate

Subject Object

17
Simple text mining for gene annotations

NCBI Entrez Gene: 334

Gene Wiki
mapping

Wikilink Candidate
assertion

GO:0006897

GO exact
match
6319 novel Gene Ontology annotations
2147 novel Disease Ontology annotations

Good, BMC Genomics, 2011.

18
Gene Wiki content improves enrichment analysis

More
p-value significant
(PubMed + GW) PubMed only

Muscle
contraction

More
significant
PubMed + GW

p-value (PubMed only)
Good, BMC Genomics, 2011.

19
Gene Wiki+ for integrative queries

mwsync

Good, J Biomed Semantics, 2012.
http://genewikiplus.org

20
Dynamic queries across genes, diseases, SNPs


21

mwsync

OMIM
PharmGKB

{{#ask:
[[Category:Human_proteins]]
[[is_associated_with::

<q>[[Category:Breast_cancer]
]</q>]]
[[HasSNP::
…

<q>[[is_associated_with::

22

mwsync

OMIM
PharmGKB


23
Wikidata

Provide a database of the
world‟s knowledge that
anyone can edit
- Denny Vrandečić

24
Wikidata
Q414043

Reelin

Protein Q8054
Property:P31 is a
Glycoprotein Q187126

Neural
Property:P128 regulates Q1345738
development

VLDL receptor Q1979313
Property:P129 Interacts Amyloid
with precursor Q423510
protein
http://www.wikidata.org/wiki/Q414043

25
Wikidata
Q414043

Q8054
Property:P31
Q187126

Property:P128 Q1345738

Q1979313
Property:P129
Q423510

http://wikidata.org/w/api.php?action=wbgetentities&ids=Q414043&languages=en

26
Wikidata

http://www.wikidata.org/wiki/Wikidata:Molecular_Biology_task_force

27
Wikidata

http://www.wikidata.org/wiki/Wikidata:Molecular_Biology_task_force

28

The Gene Wiki

GeneGames.org

29
Not just the biomedical literature…

30
BioGPS aggregates gene-centric information

http://biogps.org
Wu, NAR, 2013; Wu, Genome Biology, 2009.

31
The plugin interface is simple and universal

Pubmed
http://www.ncbi.nlm.nih.gov/sites/entrez?...&Term={{Symbol}}

STRING
http://string-db.org/newstring_cgi?...&identifier={{EnsemblGene}}

KEGG
http://www.genome.jp/dbget-bin/www_bget?hsa:{{EntrezGene}}

URL template
Rendered URL
Gene entity

32

33

34

35

36

Total of 389 gene-centric online
databases registered as BioGPS plugins

37
BioGPS has a critical mass of users
Daily pageviews

• > 4100 registered users Top 10 organizations
• 4000 unique visitors per week 1. Harvard 6. Cambridge
2. NIH 7. U Penn
• 40,000 page views per week
3. UCSD 8. Stanford
4. Scripps 9. Wash U
5. MIT 10. UNC

38
All resources should provide RDF…

39
Mining structured content from HTML

40
Defining a data extraction template
TP53 TNF APOE IL6 VEGF EGFR TGFB1 …
…

41
The BioGPS Semantic Annotator

http://54.244.135.254:8000/

42

The Gene Wiki

GeneGames.org

43

Seven million human hours

http://www.flickr.com/photos/archana3k1/4124330493/

44

Twenty million human hours

http://www.flickr.com/photos/ableman/2171326385/

45
-
150 billion human hours
per year

http://www.flickr.com/photos/rvp-cw/6243289302/

46
Using games to fold proteins

Fold.it players have successfully:
• Outperformed state of the art protein
folding algorithms (Cooper, Nature, 2010)
• Solved a previously-intractable crystal
structure (Khatib, Nat Struct Mol Biol, 2011)
• Designed an improved protein folding
algorithm (Khatib, PNAS, 2011)
• Improved enzyme activity of de novo
designed enzyme (Eiben, Nat Biotechnol, 2011)

47
Using games to fold RNAs

http://eterna.cmu.edu/

48
Using games to align sequences

http://phylo.cs.mcgill.ca
Kawrykow, PLOS ONE, 2012.

49
Using games to annotate genes?

http://genegames.org

50
No good gene-disease annotation database
Query: Apolipoprotein E

Alzheimer's disease (AD)
Lipoprotein glomerulopathy
Sea-blue histiocyte disease

51

Alzheimer's disease (AD)
Lipoprotein glomerulopathy
Sea-blue histiocyte disease
Hyperlipoproteinemia, type III
Macular degeneration, age-related
Myocardial infarction susceptibility

52

? Alzheimer's disease (AD)
? Lipoprotein glomerulopathy
? Sea-blue histiocyte disease
Hyperlipoproteinemia, type III
? Macular degeneration, age-related
? Myocardial infarction susceptibility
HIV
Psoriasis
Vascular Diseases

53

Alzheimer's disease (AD) Memory
Coronary Artery Disease
Neuropsychological Tests Hypertension
Cognition Disorders Mental Status Schedule
Psychiatric Status Rating
Dementia Scales
Cognition Hyperlipidemias
Atrophy
Disease Progression Dementia, Vascular
Cardiovascular Diseases Parkinson Disease
Brain Injuries
Coronary Disease Myocardial Infarction
Diabetes Mellitus, Type 2 …

Memory Disorders 477 diseases!

54
Play Dizeez to annotate gene-disease links
6. Play to win!
5. Hurry!
4. Then on to the
next question…

3. If it‟s „right‟, you get points

1. Read the clue (gene)

2. Click the related disease
(only one is “right”)

55
Dizeez players seem pretty smart…

In total (since Dec 2011):
• 230 unique gamers
• 1045 games played
• 8525 guesses

# Occurrences Gene Disease Gene Wiki OMIM PharmGKB PubMed

11 NBPF3 neuroblastoma
11 SOX8 mental retardation
9 ABL1 leukemia
9 SSX1 synovial sarcoma
8 APC colorectal cancer
8 FES sarcoma
8 RBP3 retinoblastoma
8 GAST gastrinoma
8 DCC colorectal cancer
8 MAP3K5 cancer

56
Using games to predict phenotype from genotype?

http://genegames.org

57
Classification problems in genome biology

Classify new
cancer normal samples

find patterns
cancer
100,000s features

normal
SVM
Neural
networks
Naïve
Bayes
KNN
…
100s samples

58
Random forests
Sample subset
of cases and Train decision
cancer normal features tree
100,000s features

100s samples

59
Random forests

cancer normal
100,000s features

100s samples

60
Random forests

Classify new
cancer normal samples

cancer
100,000s features

normal

How to interject
biological
100s samples knowledge?

61
Network-guided forests

Dutkowski & Ideker (2011). PLoS Computational Biology

62
Network-guided forests
Sample
features by PPI Train decision
cancer normal network tree
100,000s features

100s samples

63
Human-guided forests
Sample
features by Train decision
cancer normal human tree
intelligence
100,000s features

100s samples

65
The Cure: Genomic predictors for disease

66

67

68

69

70

71
Human-guided forests

Classify new
samples

cancer
normal

72
“Critical Assessment”-style challenge

73
Results

• 214 registered players
– 50% declared knowledge of cancer
biology
– 40% self-identified as having Ph.D.
• Prediction results
– 70% correct on survival concordance
index
– Best scoring model was 76%
– Player registrations still increasing!

74

Crowdsourcing
empowers the entire
scientific community to
directly participate in the
gene annotation process.

75
Collaborators Group members
Doug Howe, ZFIN Katie Fisch Max Nanis
John Hogenesch, U Penn
Luca de Alfaro, UCSC
Ben Good Chunlei Wu
Angel Pizzaro, U Penn Salvatore Loguercio
Faramarz Valafar, SDSU
Pierre Lindenbaum,
Fondation Jean Dausset Key group alumni
Michael Martone, Rush
Konrad Koehler, Karo Bio Erik Clarke
Warren Kibbe, Simon Lim, Northwestern Jon Huss
Many Wikipedia editors Marc Leglise
WP:MCB Project Maximilian Ludvigsson
Ian MacLeod
Camilo Orozco

Contact
http://sulab.org
asu@scripps.edu
@andrewsu
+Andrew Su

Funding and Support

(BioGPS: GM83924, Gene Wiki: GM089820)

NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (10)

En vedette

En vedette (8)

Similaire à NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Similaire à NCBO Webinar: Translating unstructured, crowdsourced content into structured data (20)

Plus de Andrew Su

Plus de Andrew Su (15)

Dernier

Dernier (20)

NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Notes de l'éditeur