Named Entity Recognition is one of the vast techniques in Natural Language Processing. NER techniques can be applied on biomedical data but there are some problems which are mentioned in the presentation.
2. Tools and techniques to help researchers cope with the information overload
are therefore needed.
NER tools can be applied to find all kind of entities, such as gene or protein
names, diseases and drugs, mutations or properties of protein structures.
Medline database contained approx. 15 million scientific abstracts with a
growth rate of about 400,000 articles per year.
Identification of proteins or genes is important to find out protein
interaction networks.
3. Concepts, meaning
and representation
Names in text
represent real-life
concepts in our mind
Concept denoted by a
gene name is usually
not clearly defined
No community-wide
agreement to name
particular gene
Supermarket
Sonic
Hedgehog gene
in human
p53
2WRU
4. • Clone during mapping phase in Human GENOME Project had
up to 15 different names
• FLT4 has four names: PCL; FLT41; LMPH1A;VEGFR3
Many genes and
proteins have more
than one name
• Cbp/p300- interactive transactivator
• CCAAT/enhancer binding protein, C/EBP alpha
Inconsistent use of
variations of names
• BioCreative Corpus of expert tagged gene names consist of
53% of all names consist of more than one token
• HumanT-cell leukaemia lymphotropic virus type 1Tax protein
Multi-word names
Acronyms are
homonyms
• SEC stands for
• surface epithelial cell
• size exclusion chromatography
• Selenocystein
5. Lesar, U. and Hakenberg, J. (2005), ‘What makes a gene name? Named entity
recognition in the biomedical literature’, Briefings in Bioinformatics,Vol. 6(4), pp.
357-369.
http://www.bioinformatics.org/textknowledge/acronym.php?textfield=SEC&sub
=search
http://www.rcsb.org/pdb/explore/explore.do?structureId=2WRU