Nathan from Imperial College London, gave a presentation at London Biogeeks on Thursday 24 Feb, between 6 - 6.30pm at King’s College London, Rm 1.20, Franklin Wilkins Building, Waterloo Campus, Stamford Street, London, SE1 9NH, see: biogeeks.wordpress.com/2011/02/16/ february-tech-meet-24th-kcl/
His presentation was about identifying genes and proteins in text: a short review of available tools and resources
Abstract below:
The ever-increasing publication rate now means that manually extracting information from biological papers is now intractable. This situation has led to a sustained interest in the application of text mining (TM) methods to the biological literature. The first stage in any text-mining pipeline is to recognise named entities in text (a process called Named Entity Recognition or NER). I will discuss the basic concepts behind these methods and provide a basic evaluation of some of the freely available software (standalone and web services).
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Identifying genes and proteins in text: a short review of available tools and resources
1. Identifying genes and proteins in text: a short review
of available tools and resources
Nathan Harmston
Theoretical Systems Biology
Centre for Bioinformatics
Centre for Integrative Systems Biology at Imperial College London
24/02/2011
Nathan Harmston Review of Gene NER 24/02/2011 1 / 15
2. Deluge/Flood/Tsunami of publications
Literature contains important knowledge which is generated by researchers and
ideally not just something to promote their career.
Nathan Harmston Review of Gene NER 24/02/2011 2 / 15
3. Named Entity Recognition
Selection of sup1 and sup2 mutants in the yeast Saccharomyces cerevisiae on
cycloheximide containing media revealed classes of mutants that either are
completely unable to grow on YAPD without cycloheximide or need this drug
under high temperature incubation (30 or 36 degrees C). Some of these mutants
also exhibit the growth dependence on another antibiotic– trichodermin, and, at
the same time, the osmotic dependence. A hypothesis claiming that sup1 and
sup2 mutations cause conformational lability of yeast cytoplasmic ribosomes has
been put forward. It is also proposed that binding of cycloheximide and
trichodermin to the mutant ribosomes cause their conformational shift, which
compensates the functional defects.
Nathan Harmston Review of Gene NER 24/02/2011 3 / 15
4. Named Entity Recognition
Selection of sup1 and sup2 mutants in the yeast Saccharomyces cerevisiae on
cycloheximide containing media revealed classes of mutants that either are
completely unable to grow on YAPD without cycloheximide or need this drug
under high temperature incubation (30 or 36 degrees C). Some of these mutants
also exhibit the growth dependence on another antibiotic– trichodermin, and, at
the same time, the osmotic dependence. A hypothesis claiming that sup1 and
sup2 mutations cause conformational lability of yeast cytoplasmic ribosomes has
been put forward. It is also proposed that binding of cycloheximide and
trichodermin to the mutant ribosomes cause their conformational shift, which
compensates the functional defects.
Nathan Harmston Review of Gene NER 24/02/2011 3 / 15
5. Named Entity Recognition
Selection of sup1 and sup2 mutants in the yeast Saccharomyces cerevisiae on
cycloheximide containing media revealed classes of mutants that either are
completely unable to grow on YAPD without cycloheximide or need this drug
under high temperature incubation (30 or 36 degrees C). Some of these mutants
also exhibit the growth dependence on another antibiotic– trichodermin, and, at
the same time, the osmotic dependence. A hypothesis claiming that sup1 and
sup2 mutations cause conformational lability of yeast cytoplasmic ribosomes has
been put forward. It is also proposed that binding of cycloheximide and
trichodermin to the mutant ribosomes cause their conformational shift, which
compensates the functional defects.
Nathan Harmston Review of Gene NER 24/02/2011 3 / 15
6. Named Entity Recognition
Selection of sup1 and sup2 mutants in the yeast Saccharomyces cerevisiae on
cycloheximide containing media revealed classes of mutants that either are
completely unable to grow on YAPD without cycloheximide or need this drug
under high temperature incubation (30 or 36 degrees C). Some of these mutants
also exhibit the growth dependence on another antibiotic– trichodermin, and, at
the same time, the osmotic dependence. A hypothesis claiming that sup1 and
sup2 mutations cause conformational lability of yeast cytoplasmic ribosomes has
been put forward. It is also proposed that binding of cycloheximide and
trichodermin to the mutant ribosomes cause their conformational shift, which
compensates the functional defects.
Nathan Harmston Review of Gene NER 24/02/2011 3 / 15
7. Named Entity Recognition
Selection of sup1 and sup2 mutants in the yeast Saccharomyces cerevisiae on
cycloheximide containing media revealed classes of mutants that either are
completely unable to grow on YAPD without cycloheximide or need this drug
under high temperature incubation (30 or 36 degrees C). Some of these mutants
also exhibit the growth dependence on another antibiotic– trichodermin, and, at
the same time, the osmotic dependence. A hypothesis claiming that sup1 and
sup2 mutations cause conformational lability of yeast cytoplasmic ribosomes has
been put forward. It is also proposed that binding of cycloheximide and
trichodermin to the mutant ribosomes cause their conformational shift, which
compensates the functional defects.
Genes have many different names e.g. { P53, TP53, Hs.1845, TRP53 }
Gene names are subject to morphological (transcription factor,
transcriptional factor), orthographic (NF kappa B, NF kappaB),
combinatorial (homolog of actin, actin homolog) and inflectional variation
(antibody, antibodies).
Some names overlap with normal english breathless, Not, That
Deciding when a term refers to a gene, RNA or a protein is difficult: pspA,
PspA
Nathan Harmston Review of Gene NER 24/02/2011 3 / 15
8. Problems
HUNK is associated with expression of Frizzled 2
HUman Natural Killer
Nathan Harmston Review of Gene NER 24/02/2011 4 / 15
9. Problems
HUNK is associated with expression of Frizzled 2
HUman Natural Killer
Large piece of something without definite shape
Nathan Harmston Review of Gene NER 24/02/2011 4 / 15
10. Problems
HUNK is associated with expression of Frizzled 2
HUman Natural Killer
Large piece of something without definite shape
A well built sexually attractive man
Nathan Harmston Review of Gene NER 24/02/2011 4 / 15
11. Problems
HUNK is associated with expression of Frizzled 2
HUman Natural Killer
Large piece of something without definite shape
A well built sexually attractive man
Hormonally Upregulated Neu-associated Kinase
Nathan Harmston Review of Gene NER 24/02/2011 4 / 15
12. Methods
dictionary
BioThesaurus
fuzzy matching techniques (Levenshtein, Jaro, Jaro-Winkler)
BLAST
Whatizit, Reflect.WS
rule/pattern based matching
good for things like Yeast genes, but rubbish for fruitfly
ABGENE
Machine learning
Classification
Support Vector Machines - NLProt
Logistic Regression -
Sequence Labelling
Conditional Random Fields - ABNER, BANNER, JNET
Hidden Markov Models - GENIA
Hybrid methods
Nathan Harmston Review of Gene NER 24/02/2011 5 / 15
13. Corpus
A corpus is a collection of manually annotated documents which have had
NEs marked up by a human expert.
serve as a benchmark to compare methods.
serve as development/training sets for methods.
Size, Inter-Annotator Agreement (IAA), Scope, Evaluation scheme
BioCreative I GM, BioCreative II GM, NLPBA, GENIA
.
.
.
P07642544A0868 Conversely, treatment of human protein-tyrosine phosphatase
alpha-overexpressing cells with phenylarsine oxide led to a loss
of the constitutive NF-kappa B activity.
.
.
.
P07642544A0868|127 135| NF-kappa B
Nathan Harmston Review of Gene NER 24/02/2011 6 / 15
14. Classification-based approaches
Conversely, treatment of human protein-tyrosine phosphatase alpha-overexpressing
cells with phenylarsine oxide led to a loss of the constitutive NF-kappa B activity.
xi = training data gene after
0
1, if xi belongs to class 1 kappa 1
yi =
−1, if xi belongs to class 2 constitutive 1
noun phrase 1
surface clues, syntactic properties of NEs, Part of Speech
surrounding words
matches against dictionary
typically binary decision (SVMs only work well for binary problems)
Maximum Entropy, SVM, Naive Bayes
order-independent vector
Nathan Harmston Review of Gene NER 24/02/2011 7 / 15
15. Sequence labelling approaches
Conversely, treatment of human protein-tyrosine phosphatase alpha-overexpressing
cells with phenylarsine oxide led to a loss of the constitutive NF-kappa B activity.
y1 y2 y3 y4
x1 x2 x3 x4
constitutive NF-kappa B activity
consider the complete ordered sequence of tokens in a sentence
predict the most probable sequence of tags for a given sequence of
words in a sentence
using semantic and lexical features
takes order into account
Nathan Harmston Review of Gene NER 24/02/2011 8 / 15
19. Availability
Most are easily available and released under open source licenses.
Variety of languages (primarily Java and C++)
Most require hacking to get them working
OSCAR3 is a beast
GENIA - very easy to write a SWIG access so you can call it from Python
JNET - few hacks
ReflectWS (REST/SOAP) Whatizit (SOAP)
http://pages.cs.wisc.edu/~bsettles/abner/
http://banner.sourceforge.net/
http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/
http://linnaeus.sourceforge.net/
http://cubic.bioc.columbia.edu/services/nlprot/
http://www.ebi.ac.uk/webservices/whatizit/
http://sourceforge.net/projects/oscar3-chem/
http://julielab.de/
Nathan Harmston Review of Gene NER 24/02/2011 12 / 15
22. Literature based discovery - CRPS
NF-κB
Outcome
NF-κB is involved in CRPS
allows generation of new mechanistic hypotheses
new drug target
Hettne et al - 2007 Applied information retrieval and multidisciplinary research:
new mechanistic hypotheses in Complex Regional Pain Syndrome
Nathan Harmston Review of Gene NER 24/02/2011 13 / 15
23. Finally........
for standalone - BANNER
web services - who knows?
Chemical NER - OSCAR (make sure you use the PubMed models)
Species NER - Linnaeus
Nathan Harmston Review of Gene NER 24/02/2011 14 / 15
24. Finally........
for standalone - BANNER
web services - who knows?
Chemical NER - OSCAR (make sure you use the PubMed models)
Species NER - Linnaeus
So now you have the named entities - you need to map them to canonical
identifiers - called gene normalisation (GN).
.... but thats for another talk
What are they doing? PPI extraction - is there a physical interaction
between two genes in an abstract - Binding between Akt2 and APPL
Nathan Harmston Review of Gene NER 24/02/2011 14 / 15
25. Finally........
for standalone - BANNER
web services - who knows?
Chemical NER - OSCAR (make sure you use the PubMed models)
Species NER - Linnaeus
So now you have the named entities - you need to map them to canonical
identifiers - called gene normalisation (GN).
.... but thats for another talk
What are they doing? PPI extraction - is there a physical interaction
between two genes in an abstract - Binding between Akt2 and APPL
Text mining is noisy and imperfect - but then so is manual curation (IAA)
Nathan Harmston Review of Gene NER 24/02/2011 14 / 15
26. Finally........
for standalone - BANNER
web services - who knows?
Chemical NER - OSCAR (make sure you use the PubMed models)
Species NER - Linnaeus
So now you have the named entities - you need to map them to canonical
identifiers - called gene normalisation (GN).
.... but thats for another talk
What are they doing? PPI extraction - is there a physical interaction
between two genes in an abstract - Binding between Akt2 and APPL
Text mining is noisy and imperfect - but then so is manual curation (IAA)
Text mining is a noisy (and biased) way of extracting information from noisy
(and biased) text which represents the results of noisy (and biased)
experiments carried out by researchers (who are probably noisy and biased).
Nathan Harmston Review of Gene NER 24/02/2011 14 / 15
27. Shameless self-promotion.......
Harmston, N., Filsell, W., and Stumpf, M. P. H. (2010) What the papers
say: text mining for genomics and systems biology. Hum Genomics, 5(1),
17-29
nathan.harmston07@imperial.ac.uk
Nathan Harmston Review of Gene NER 24/02/2011 15 / 15
28. Shameless self-promotion.......
Harmston, N., Filsell, W., and Stumpf, M. P. H. (2010) What the papers
say: text mining for genomics and systems biology. Hum Genomics, 5(1),
17-29
nathan.harmston07@imperial.ac.uk
Questions?
Nathan Harmston Review of Gene NER 24/02/2011 15 / 15