8. Text Mining in Biomedicine
Biomedical Scientific Literature
>17M articles from >5K journals
since 1950s adding 2000 every day
Impossible for humans to manage
Specific (rather peculiar) language
13. Example
quot;Our results indicate that gp120 from two
different strains of HIV binds to a larger
region of the CD4 protein than previously
described.quot;
15. Example
Binds is almost the same as
− Interacts with
− Frictionates
− Associates with
− Activates
− Colocalizes with
− Cleaves
16. Example
CD4 can be expressed as
CD4+ T T4(CD) CD4+ (T) CD4(+) T cell
CD4, T CD4 (T) CD4(T) CD4 Tcell
T CD4 CD4(+)T CD(4+) T CD4(+) Tcell
CD4(+) T CD4+T CD4 T CD4(+)T cell
CD4 T CD4(+)T CD4+ T cell CD4+Tcell
T4+ (CD) CD4+T CD4, T cell CD4(+)Tcell
T4 (CD) T (CD4) CD4+ Tcell CD4 T cell
17. Even after all this...
The chimpanzeebased CD4(8192) peptide,
however, which differs from the human peptide
by a single amino acid substitution (E for G) at
position 87, was considerably less potent than
the human CD4(8192)based peptide congener
to inhibit HIV1induced cellcell fusion.
18. Contradiction and Contrasts
Author A reports p
Author B reports ¬p
We have p under conditions q
But we have ¬p under conditions q'
19. Negations
Linguistic
− quot;Protein A does not interact with protein B.quot;
− quot;We lack evidence that A interacts with B.quot;
Biological
− quot;Protein A inhibits protein B.quot;
− dephosphorylates / depolymerizes
− downregulates (vs. upregulates)
− etc.
21. Framework
HIV1 and Human ProteinProtein interactions
− Manually over 7 years
− >3000 journal papers
− >5000 tuples
− Gold standard
Other negative reports
− Journal of Negative Results in BioMedicine
Other gold standards
22. Detecting ProteinProtein Interactions
Recognize gene/protein names
− State of the art ~ 87%
Identify gene/protein names
Detect the interaction and its qualities
− 70 quot;differentquot; interactions in reference DB
23. Protein Name Identification
1500 human proteins
− State of the art ~ 87%
− Available tools ~ 15%
− Our method ~ 35%
20 HIV proteins
− No available tool
− Our method ~ 95%
24. Applications
Contradictions and Negations
Contrast
Other diseases New HIV1 literature
25. Achieved so far & plan for future
Reproduce the HIV1 interactions database
Designed an interaction ontology
Identify patterns of negation, contradiction,
contrast
Use the above data to increase the annotation
accuracy
26. Evaluation
Widely used evalutation measures
− Precision, Recall, FScore
− Sensitivity and Specificity
Benchmarks and datasets used in challenges
Manually annotated gold standards