Exploiting biomedical literature to mine out a large multimodal dataset of rare cancer studies. Presentation of Anjani K. Dhrangadhariya (Institute of Information Systems, HES-SO Valais-Wallis, Sierre) at SPIE Medical Imaging 2020.
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Exploiting biomedical literature to mine out a large multimodal dataset of rare cancer studies
1. Exploiting biomedical literature to mine out a large
multimodal dataset of rare cancer studies
Anjani K. Dhrangadhariya et al.
MedGIFT group
University of Applied Sciences Western Switzerland (HES-SO)
Project supported by European Union
Horizon 2020 grant agreement 825292
SPIE Medical Imaging 2020, 02.16.2020
2. Motivation
Puca, Loredana, et al. "Patient derived organoids to model rare prostate cancer phenotypes." Nature communications 9.1 (2018): 1-10.
2
> Rare Cancer - 25% cancer-related deaths
> Affect less than 15 out of 100,000 / year
> Lower prevalence = fewer patients
> Less tumor samples for research
> Lack of robust clinical models
3. Data resource
• Challenges
1) Private datasets
2) Limited size
3) Single center / scanner
4) Small variability
5) Some contain only images / only text
6) No or small subsets of manual annotations
7) Difficult to compare results
3
Large
database
Open Access Annotation
6. Medical Subject Headings (MeSH)
• Hierarchically organized
Controlled Vocabulary
• Cataloguing biomedical
information
• 16 thematic categories
• A = anatomy
• B = organism
• C = diseases …
• Subcategories
MeSH
term
MeSH
code
Lipscomb, Carolyn E. "Medical subject headings (MeSH)." Bulletin of the Medical Library Association 88.3 (2000): 265.
6
7. MeSH as annotation
• Manually annotated by National library of
Medicine (NLM) staff
• For e.g., All the studies about
benign cancer are indexed
under MeSH annotation “Neoplasm”
• Groundtruth annotation
• Not all PMC / PMCOA have annotations
7
8. Visual classification
• ImageCLEF medical image annotation
challenge (since 2013)
• Small subset of annotated PMC-OA >
train CNNs
• Classify into 31 modalities - PET, light
microscopy, CT, etc.
• State of the art: Superficial modality
classification
8
Deep Multimodal Classification of Image Types in Biomedical Journal Figures”, Andrearczyk and Müller, CLEF 2018
2000 Annotated PMC-OA
> 90% accuracy
13. Title +
Abstract
Title +
Abstract
Method: Textual approach
Title +
Abstract
Model
training &
evaluation
Best
performing
model
13
Class_0
Class_1
Title +
Abstract
Class_0
Class_1
Title +
Abstract
No Class
14. 14
Method: Vectors
Count vector Word vector Document vector
• Documents
represented by
weighted word
counts.
• No semantics
• Multidimensional,
numerical vectors
• Semantically
similar words are
projected in
proximity in a
geometric space
• Learn to associate
words with
document labels
16. - 0.5467
0.1111
0.5789
- 0.3789
- 0.4999
0.6687
- 0.1167
0.9976
Getting “human” images
Title +
Abstract
Title +
Abstract
Title +
Abstract
{MeSH}
DLMI
human
Model training and evaluation
1. Logistic regression
2. Support Vector Machine
3. K-nearest neighbor
1. Count vectors
2. Word vectors,
3. document vector
Not human
20%
80%
Training set
Test set
human
Not human
Title +
Abstract
Title +
Abstract
=
= ⇔ B01.050.150.900.649.313.988.400.112.400.400 ∉ {MeSH}
⇔ B01.050.150.900.649.313.988.400.112.400.400 ∈ {MeSH} & other B01 codes ∉ {MeSH}
16
17. Getting “human” images
Title +
Abstract
Title +
Abstract
human
not human
Best performing
Model, hyper-params and
vectors
SVM, count vectors
No MeSH
Title +
Abstract
DLMI
Title +
Abstract
Title +
Abstract
Title +
Abstract
{MeSH}
DLMI
human
Model training and evaluation
1. Logistic regression
2. Support Vector Machine
3. K-nearest neighbor
1. Count vectors
2. Word vectors,
3. document vector
not human
20%
80%
Training set
Test set
- 0.5467
0.1111
0.5789
- 0.3789
- 0.4999
0.6687
- 0.1167
0.9976
17
19. 19
Getting “neoplastic” images
neoplastic
not neoplastic
Title +
Abstract
Title +
Abstract
=
= ⇔ C04 ∉ {MeSH}
⇔ C04 ∈ {MeSH}
Title +
Abstract
Title +
Abstract
Title +
Abstract
{MeSH}
DLMI
Model training and evaluation
1. Logistic regression
2. Support Vector Machine
3. K-nearest neighbor
1. Count vectors
2. Word vectors,
3. document vector
20%
80%
Training set
Test set
human
neoplastic
not neoplastic
- 0.5467
0.1111
0.5789
- 0.3789
- 0.4999
0.6687
- 0.1167
0.9976
20. Getting “neoplastic” images
Title +
Abstract
Title +
Abstract
Title +
Abstract
{MeSH}
DLMI
Model training and evaluation
1. Logistic regression
2. Support Vector Machine
3. K-nearest neighbor
1. Count vectors
2. Word vectors,
3. document vector
20%
80%
Training set
Test set
human
neoplastic
not neoplastic
Title +
Abstract
Title +
Abstract
Best performing
Model, hyper-params and
vectors
SVM, count vectors
No MeSH
Title +
Abstract
DLMI
human
neoplastic
not neoplastic
- 0.5467
0.1111
0.5789
- 0.3789
- 0.4999
0.6687
- 0.1167
0.9976
20
22. Getting “rare cancer” images
• No MeSH terms for “rare” cancer class
• Set of 495 {rare cancer} terms by National Center for Advancing
Translational Sciences (NCATS)
https://rarediseases.info.nih.gov/diseases/diseases-by-category/1
22
23. Getting “rare cancer” images
• No MeSH terms for “rare” cancer class
• Set of 495 {rare cancer} terms by National Center for Advancing
Translational Sciences (NCATS)
https://rarediseases.info.nih.gov/diseases/diseases-by-category/1
23
Title +
Abstract
Title +
Abstract
DLMI
humanNo MeSH
{MeSH}
DLMI
neoplastic
human
neoplastic
Title +
Abstract
rare
cancer
Title +
Abstract
rare
cancer
= ⇔
Title +
Abstract ∩ {rare cancer} ≠
Ø Title +
Abstract
non-rare
cancer
25. Visual: “rare cancer”
25
rare cancer
Model training and evaluation
• VGG19
• ImageNet weights
• With and without image
augmentation
non-rare cancer
26. Visual: “rare cancer”
26
No Class
Model training and evaluation
• VGG19
• ImageNet weights
• With and without image
augmentation
rare cancer
non-rare cancer
rare cancer non-rare cancer
27. Results
“human” vs. “non-human” classification
Data type Classifier Feature Precision Recall F1-score
Visual VGG19 With data augmentation 0.69 0.71 0.68
Textual SVM Count vectors 0.89 0.90 0.90
27
28. Results
“human” vs. “non-human” classification
Data type Classifier Feature Precision Recall F1-score
Visual VGG19 With data augmentation 0.69 0.71 0.68
Textual SVM Count vectors 0.89 0.90 0.90
“neoplastic” vs. “non-neoplastic” classification
Data type Classifier Feature Precision Recall F1-score
Visual VGG19 With data augmentation 0.68 0.65 0.64
Textual SVM Count vectors 0.99 0.99 0.99
28
29. Results
“human” vs. “non-human” classification
Data type Classifier Feature Precision Recall F1-score
Visual VGG19 With data augmentation 0.69 0.71 0.68
Textual SVM Count vectors 0.89 0.90 0.90
“neoplastic” vs. “non-neoplastic” classification
Data type Classifier Feature Precision Recall F1-score
Visual VGG19 With data augmentation 0.68 0.65 0.64
Textual SVM Count vectors 0.99 0.99 0.99
“rare cancer” vs. “non-rare cancer” classification
Data type Classifier Feature Precision Recall F1-score
Visual VGG19 With data augmentation 0.62 0.77 0.69
29
30. Discussion: Textual vs. Visual
30
Textual approach
Outperformed visual approach
for all tasks
Count vectors with SVM
performed the excellent for
both tasks.
Visual approach
Correctly classify some
“human” test instances with
recall of 0.71
Worse performance for
“neoplastic” identification
“rare cancer” classification had
a recall of 0.77
31. Conclusion
• First study targeting automatic rare cancer
image extraction
• Used approach relies on visual deep
learning and textual NLP
• 15,028 light microscopy (DLMI), human,
rare cancer images + corresponding journal
articles
Getting DLMI images
Getting “human” images
Getting “neoplastic” images
Getting “rare cancer” images
PMC-OA all data
31
1
2
3
4
5
32. Thank you for your attention
32
More information:
http://medgift.hevs.ch
Contact:
anjani.dhrangadhariya@hevs.ch
Follow us:
https://twitter.com/MedGIFT_group
Notes de l'éditeur
2
3
4
How are these biomedical publications stored in PubMed?
A PubMed record consists of Title and Abstract from the publication followed by the publication images as shown in thumbnails.
And a list of Medical Subject Headings or MeSH terms that are like annotations describing something about the publication.
All these multimodal elements, the text, the images and MeSH annotations are stringed together by the unique PubMed Identifier or PMID.
Thus sharing a 1 to 1 association with each other.
6
Publications stored in PubMed are annotated with MeSH to enforce a kind of uniformity and consistency across the database structure in a way that all articles about benign cancer are indexed under MeSH term “Neoplasm”, all the articles or studies involving patients are annotated under MeSH term “Humans”
PubMed records are manually annotated with MeSH terms by staff at NLM.
So MeSH terms could be considered as gold standard annotations or groundtruth annotations for a publication.
Not all publications in PubMed have these manually attached MeSH terms.
A small annotated subset of PMC-OA has already been used in ImageCLEF medical image annotation challenge which is a public challenge that has been taking place since 2013.
This small annotated subset of 2000 images was used to train CNNs for image classification into 31 image modality classes…
Including PET, CT images, light microscopy images, et cetera.
This classification approach achieved an overall 90% accuracy for modality classification.
However, this approach only goes till superficial modality classification task.
What about going beyond this generic modality classification into more specialized image sets?
So what we did for navigating towards rare cancer sets was this:
Take all the PMC-OA images of unknown type and classify them using ImageCLEF setup into 31 modality types.
Retain all the images classified as DLMI or diagnostic light microscopy images.
We focus only upon DLMI images because they are fundamental to rare cancer diagnostics.
All the retained DLMI images are linked to their respective title, abstract and MeSH annotations if available.
With this multimodal annotated dataset in hand, we propose an approach for sequential curation of article abstracts and images using MeSH terms to eventually mine-out a large multimodal set of rare cancer images and full-texts.
This involves three subsequent binary classification tasks where we first filter “human” from “non-human” set, followed by separating “neoplastic” from “non-neoplastic” set and finally separating “rare cancer“ from the “non-rare cancer“.
It has to be noticed that at each binary classification step we compare visual vs. textual approach separately and use MeSH terms as the groundtruth labels for the datasets.
For the visual classification tasks, images were separated into two different classes based on MeSH terms. These class annotated images were used to train and evaluate a VGG19 model using pretrained ImageNet weights and fine-tuned with and without image augmentation
Data augmentation: image mirroring and cropping.
Why do we use VGG?
These fine-tuned models were then used to classify unlabeled images into their respective MeSH classes.
13
Just like images, text is not understandable by the ML algorithms. So numerical vectors need to be extracted from text.
Count vectors are extracted based on word counts of a document and do not take into account semantics.
Word vectors are
and finally paragraph vectors learn to associate words with document labels.
Lets get back to the pipeline for further curating the previously retrieved DLMI dataset.
«human» records were first filtered out from «non-human records» in following way.
16
Best performing model setup was used to classify the un-annotated DLMI records into “human” and “non-human”.
This was about the annotated text dataset. Similarly, the annotated image dataset classified using VGG19 setup.
Next, «neoplastic» or tumor-related records were separated from «non-neoplastic» records in similar manner.
19
Best performing model setup was used to classify the un-annotated records into “neoplasm” and “non-neoplasm”.
This was about the annotated text dataset. Similarly, the annotated image dataset classified using VGG19 setup.
Finally, we chaff out rare cancer dataset from the non-rare cancer dataset.
Unfortunately, there are no MeSH terms pertaining to “rare cancer”, so we used a pre-defined set of rare cancer terms available from NCATS.
All the records recognized as “neoplasm” were retained and filtered out as “rare cancer” only if rare cancer term from NCATS set was present in the title and the abstract.
After getting «rare cancer» and the «non-rare cancer» labels for the individual publications from the previous text classification, we used them to train and evaluate a VGG19 model for this binary classification task.
And the fine-tuned VGG19 setup was used to classify the images into «rare cancer» and «non-rare cancer»
For the «human» classification task, textual approach performed far better than visual approach.
However, a recall of 0.71 hints that the visual classification model does learn something about retaining human images.
For the neoplasm classification task too, textual performed better than visual.
Visual approach did not have good results for this task.
For the final task, a recall of 0.77 does hint that VGG19 model did learn something by better retaining the «rare cancer» images, but it has much room for improvement.
Classification: Individual images ≠ full-texts
In conclusion
This is the first study targeting automatic extraction of rare cancer datasets
Compares both, the visual and the textual approaches
One outcome of this work is a large dataset containing about 15000 rare cancer images