Exploiting biomedical literature to mine out a large multimodal dataset of rare cancer studies

Exploiting biomedical literature to mine out a large
multimodal dataset of rare cancer studies
Anjani K. Dhrangadhariya et al.
MedGIFT group
University of Applied Sciences Western Switzerland (HES-SO)
Project supported by European Union
Horizon 2020 grant agreement 825292
SPIE Medical Imaging 2020, 02.16.2020

Motivation
Puca, Loredana, et al. "Patient derived organoids to model rare prostate cancer phenotypes." Nature communications 9.1 (2018): 1-10.
2
> Rare Cancer - 25% cancer-related deaths
> Affect less than 15 out of 100,000 / year
> Lower prevalence = fewer patients
> Less tumor samples for research
> Lack of robust clinical models

Data resource
• Challenges
1) Private datasets
2) Limited size
3) Single center / scanner
4) Small variability
5) Some contain only images / only text
6) No or small subsets of manual annotations
7) Difficult to compare results
3
Large
database
Open Access Annotation

Medline/PubMed
PubMed / Medline
PubMed
Central
PubMed
Central
Open-
Access
(PMC-OA)
https://www.nlm.nih.gov/bsd/difference.html
30 million articles
~ 80 million images
5.9 million full texts
2.09 million full texts
6.73 million images
4
Rare cancer image
harvesting through
automated
knowledge
aggregation and
data mining
approaches?
2019

Individual record
Medical Subject Headings (MeSH)
Title
+
Abstract
Images
1
2
3
5
✓
✓
✓
✓

Medical Subject Headings (MeSH)
• Hierarchically organized
Controlled Vocabulary
• Cataloguing biomedical
information
• 16 thematic categories
• A = anatomy
• B = organism
• C = diseases …
• Subcategories
MeSH
term
MeSH
code
Lipscomb, Carolyn E. "Medical subject headings (MeSH)." Bulletin of the Medical Library Association 88.3 (2000): 265.
6

MeSH as annotation
• Manually annotated by National library of
Medicine (NLM) staff
• For e.g., All the studies about
benign cancer are indexed
under MeSH annotation “Neoplasm”
• Groundtruth annotation
• Not all PMC / PMCOA have annotations
7

Visual classification
• ImageCLEF medical image annotation
challenge (since 2013)
• Small subset of annotated PMC-OA >
train CNNs
• Classify into 31 modalities - PET, light
microscopy, CT, etc.
• State of the art: Superficial modality
classification
8
Deep Multimodal Classification of Image Types in Biomedical Journal Figures”, Andrearczyk and Müller, CLEF 2018
2000 Annotated PMC-OA
> 90% accuracy

Method: Pipeline
99
Getting DLMI images
Getting “human” images
Getting “neoplastic” images
Getting “rare cancer” images
PMC-OA all images
1
2
3
4
5
DLMI
Diagnostic Light
Microscopy Images

10
Method: Pipeline
Getting DLMI images
PMC-OA all images
1
2
3
4
5
Title +
Abstract
MeSH MeSH
vs
Visual Textual
DLMI
Diagnostic Light
Microscopy Images

Method: Visual approach
11
Class_1
Class_0
Model training and evaluation
• VGG19
• ImageNet weights
• With and without image
augmentation

Method: Visual approach
12
Class_1
Class_0
No Class
Class_1Class_0
• VGG19
augmentation

Title +
Abstract
Title +
Abstract
Method: Textual approach
Title +
Abstract
Model
training &
evaluation
Best
performing
model
13
Class_0
Class_1
Title +
Abstract
Class_0
Class_1
Title +
Abstract
No Class

14
Method: Vectors
Count vector Word vector Document vector
• Documents
represented by
weighted word
counts.
• No semantics
• Multidimensional,
numerical vectors
• Semantically
similar words are
projected in
proximity in a
geometric space
• Learn to associate
words with
document labels

15
Experiments: Pipeline
Getting DLMI images
PMC-OA all images
1
2
3
4
5
Title +
Abstract
MeSH MeSH
vs

- 0.5467
0.1111
0.5789
- 0.3789
- 0.4999
0.6687
- 0.1167
0.9976
Title +
Abstract
Title +
Abstract
Title +
Abstract
{MeSH}
DLMI
human
1. Logistic regression
2. Support Vector Machine
3. K-nearest neighbor
1. Count vectors
2. Word vectors,
3. document vector
Not human
20%
80%
Training set
Test set
human
Not human
Title +
Abstract
Title +
Abstract
=
= ⇔ B01.050.150.900.649.313.988.400.112.400.400 ∉ {MeSH}
⇔ B01.050.150.900.649.313.988.400.112.400.400 ∈ {MeSH} & other B01 codes ∉ {MeSH}
16

Title +
Abstract
Title +
Abstract
human
not human
Best performing
Model, hyper-params and
vectors
SVM, count vectors
No MeSH
Title +
Abstract
DLMI
Title +
Abstract
Title +
Abstract
Title +
Abstract
{MeSH}
DLMI
human
1. Count vectors
2. Word vectors,
3. document vector
not human
20%
80%
Training set
Test set
- 0.5467
0.1111
0.5789
- 0.3789
- 0.4999
0.6687
- 0.1167
0.9976
17

18
Getting DLMI images
PMC-OA all images
1
2
3
4
5
Title +
Abstract
MeSH MeSH
vs

19
neoplastic
not neoplastic
Title +
Abstract
Title +
Abstract
=
= ⇔ C04 ∉ {MeSH}
⇔ C04 ∈ {MeSH}
Title +
Abstract
Title +
Abstract
Title +
Abstract
{MeSH}
DLMI
1. Count vectors
2. Word vectors,
3. document vector
20%
80%
Training set
Test set
human
neoplastic
not neoplastic
- 0.5467
0.1111
0.5789
- 0.3789
- 0.4999
0.6687
- 0.1167
0.9976

Title +
Abstract
Title +
Abstract
Title +
Abstract
{MeSH}
DLMI
1. Count vectors
2. Word vectors,
3. document vector
20%
80%
Training set
Test set
human
neoplastic
not neoplastic
Title +
Abstract
Title +
Abstract
Best performing
Model, hyper-params and
vectors
SVM, count vectors
No MeSH
Title +
Abstract
DLMI
human
neoplastic
not neoplastic
- 0.5467
0.1111
0.5789
- 0.3789
- 0.4999
0.6687
- 0.1167
0.9976
20

21
Getting DLMI images
PMC-OA all images
1
2
3
4
5
Title +
Abstract
MeSH MeSH
vs

• No MeSH terms for “rare” cancer class
• Set of 495 {rare cancer} terms by National Center for Advancing
Translational Sciences (NCATS)
https://rarediseases.info.nih.gov/diseases/diseases-by-category/1
22

• No MeSH terms for “rare” cancer class
• Set of 495 {rare cancer} terms by National Center for Advancing
Translational Sciences (NCATS)
https://rarediseases.info.nih.gov/diseases/diseases-by-category/1
23
Title +
Abstract
Title +
Abstract
DLMI
humanNo MeSH
{MeSH}
DLMI
neoplastic
human
neoplastic
Title +
Abstract
rare
cancer
Title +
Abstract
rare
cancer
= ⇔
Title +
Abstract ∩ {rare cancer} ≠
Ø Title +
Abstract
non-rare
cancer

Visual: “rare cancer”
25
rare cancer
• VGG19
augmentation
non-rare cancer

Visual: “rare cancer”
26
No Class
• VGG19
augmentation
rare cancer
non-rare cancer
rare cancer non-rare cancer

Results
“human” vs. “non-human” classification
Data type Classifier Feature Precision Recall F1-score
Visual VGG19 With data augmentation 0.69 0.71 0.68
Textual SVM Count vectors 0.89 0.90 0.90
27

Results
“neoplastic” vs. “non-neoplastic” classification
28

Results
“neoplastic” vs. “non-neoplastic” classification
“rare cancer” vs. “non-rare cancer” classification
29

Discussion: Textual vs. Visual
30
Textual approach
Outperformed visual approach
for all tasks
Count vectors with SVM
performed the excellent for
both tasks.
Visual approach
Correctly classify some
“human” test instances with
recall of 0.71
Worse performance for
“neoplastic” identification
“rare cancer” classification had
a recall of 0.77

Conclusion
• First study targeting automatic rare cancer
image extraction
• Used approach relies on visual deep
learning and textual NLP
• 15,028 light microscopy (DLMI), human,
rare cancer images + corresponding journal
articles
Getting DLMI images
PMC-OA all data
31
1
2
3
4
5

Thank you for your attention
32
More information:
http://medgift.hevs.ch
Contact:
anjani.dhrangadhariya@hevs.ch
Follow us:
https://twitter.com/MedGIFT_group

Exploiting biomedical literature to mine out a large multimodal dataset of rare cancer studies

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Exploiting biomedical literature to mine out a large multimodal dataset of rare cancer studies

Similaire à Exploiting biomedical literature to mine out a large multimodal dataset of rare cancer studies (20)

Plus de Institute of Information Systems (HES-SO)

Plus de Institute of Information Systems (HES-SO) (20)

Dernier

Dernier (20)

Exploiting biomedical literature to mine out a large multimodal dataset of rare cancer studies

Notes de l'éditeur