Automatic extraction of microorganisms and their habitats from free text using text-mining workflows

Automa'c
extrac'on
of

microorganisms
and
their
habitats

from
free
text
using
text-‐mining

workﬂows

BalaKrishna
Kolluru,
Sirintra
Nakjang,

Robert.
P.
Hirt,
Anil
Wipat
and
Sophia

Ananiadou

Outline
of
the
talk

•  Mo'va'on

•  Experiments

•  Results
&
inferences

•  Discussion

•  Current
work

Mo'va'on

•  In
the
study
of
symbio'c
rela'onships,
host-‐
microbe
interac'ons
play
an
important
role

•  To
date,
there
is
no
comprehensive
database

regarding
microbe—habitat
rela'on,
but
there

is
an
explosion
in
the
numbers
of
taxa

•  With
this,
there
is
an
urgent
need
for

automated
host-‐microbe
rela'on
extrac'on

Experiments:
relevant
work

•  Iden'fica'on
of
named
en''es
such
as

microorganisms,
diseases,
genes
etc.,
has

received
sufficient
importance
from
the

scien'fic
community
at
large
[Sasaki,
Hanisch,

Chikashi]

•  Researchers
have
also
used
ontology
based

approaches
to
iden'fy
concepts
such
as
public

health
rumors
etc
[Biocaster]

Experiments:
our
approach

Named
en'ty

recogni'on

• Free
text
• Habitats
&

ar'cles
organisms

• pdf

Text
Rela'on

processing
mining

Employ
text
mining
workﬂows
consis'ng
of

• 
text/pdf
processor

• 
Named
en'ty
recognizer
to
iden'fy
microorganisms

and
their
habitats

• 
Rela'on
mining
component
to
extract
sentences

which
express
this
rela'on

Experiments:
our
approach

•  The
named
en'ty
recognizer
used
a
hybrid

dic'onary-‐machine
learning
based
approach

–  It
combined
the
informa'on
dic'onaries
with
a

feature
set
for
a
condi'onal
random
ﬁeld
(CRF)

based
classiﬁer
[Mallet]

–  The
CRFs
used
a
linear
chain
model
and
were

trained
on
a
corpus
consis'ng
of
32
full
papers

Experiments:
our
approach

–  The
feature
set
included

•  lexical
informa'on
of
the
word
e.g.,
word,
POS
tag
etc

•  Orthographic
informa'on
e.g.
any
uppercase
le^ers,

numbers

•  Contextual
informa'on;
informa'on
about
two
word

preceding
and
succeeding
the
word

•  For
the
rela'on
mining
component,
a
linear
chain
CRF

was
trained
using

–  Occurrence
of
organisms
and
habitats

–  Contextual
informa'on
of
all
the
en''es
in
a
sentence

Results
and
inference

Performance
of
our
named
en'ty
recognizer

on
a
9-‐fold
cross-‐valida'on

Class
of
Precision(%)
Recall(%)
F-‐score(%)

en**es
2PR/(P+R)

Organisms

84

79

81

Habitats

68

55

61

improved
results
from
the
'me
of
submission

• 
Microorganisms
have
been
recognized
quite
well.

• 
Habitat
recogni'on
is
modest

• 
One
of
the
observa'ons
is
that
in
a
free
text,
the

descrip'on
of
habitats/host
is
devoid
any
salient
features

such
as
uppercase
le^ers,
hyphens
etc.

• 
Instances
such
as
abscess,
lung
were
typical
misses

Results
and
inference

Rela'on
mining
results

•  For
the
rela'on
mining
experiment,
the
CRF-‐based

classiﬁer
achieved
a
precision
of
~
80%

•  Most
of
the
false
nega'ves
(
sentences
which
should

have
been
picked
up,
but
were
not)
due
to
the
noise

in
pdf
to
text
conversion

•  Another
reason
for
false
nega'ves
is
the
modest

performance
of
habitat
recogni'on
which
aﬀected

the
rela'on
mining
algorithm

Discussion

•  The
workﬂows
we
have
developed
bring

together
pdf-‐conversion,
machine
learning

and
dic'onaries
together

–  Performance
of
individual
components
obviously

has
an
impact
its
overall
performance

–  Pdf
conversion
is
not
trivial
by
any
means
and
this

component
is
the
most
limi'ng
factor
for
any

sentence-‐based
classiﬁca'on
task

Discussion

•  Pdf-‐to-‐text
sentence
examples

 
These
mechanisms
may
have
evolved
in
bacterial

pathogens
to
increase
the
frequency
of
phenotypic

varia'on
in
genes
involved
in

1
100,000
200,000
300,000
1,600,00
Figure
2
Circular

representa'on
of
the
H.
pylori
26695
chromosome.

[Clearly,
data
from
a
table
and
ﬁgure
corrupted
the

sentence]

 
airborne
pigs
[noisy
conversion
of
table
discussing

airborne
diseases
in
pigs
]

Discussion

•  The
CRF
model
for
habitats
is
evidently
weak

–  There
is
a
need
to
augment
the
features
to

alleviate
this
weakness.
We
are
currently

enhancing
model
to
include
more
features
such
as

character-‐level
n-‐grams

– 
Results
reﬂect
ini'al
success

•  Rela'on
mining
is
a
hyper-‐classiﬁca'on
task

and
perhaps
it
is
prone
to
cascading
errors

Current
work

•  Work
is
underway
to
improve
the
rela'on

mining
component
using
bag-‐of-‐words
and

character
level
n-‐grams
to
augment
the

feature
space

•  We
are
also
working
on
less
noisy
conversion

techniques
for
pdf-‐to-‐text

•  Export
the
workﬂows
to
the
public
domain
so

that
scien'sts
across
the
spectrum
can
use
our

workﬂows

Snapshot
of
rela'on
miner

References

• 
Hanisch,
D.
et
al.
ProMiner:
Organism
speciﬁc
protein
name
detec'on
using

approximate
string
matching.
Embo
Workshop
Granada,
Spain,
2004

• Sasaki,
Y.
et
al.
(2008).
How
to
make
the
most
of
NE
dic'onaries
in
sta's'cal
NER?

In:
BMC
Bioinforma'cs,
9(Suppl
11),
S5

• 
Collier,
N.
et
al.
BioCaster:
detec'ng
public
health
rumors
with
a
Web-‐based
text

mining
system.
Bioinforma'cs,
24(24),
2008.

• 
Nobata,
C.
et
al
Mining
Metabolites:
Extrac'ng
the
Yeast
Metabolome
from
the
Literature.

Metabolomics,
2010.

Automatic extraction of microorganisms and their habitats from free text using text-mining workflows

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (9)

En vedette

En vedette (20)

Similaire à Automatic extraction of microorganisms and their habitats from free text using text-mining workflows

Similaire à Automatic extraction of microorganisms and their habitats from free text using text-mining workflows (20)

Plus de Catherine Canevet

Plus de Catherine Canevet (6)

Dernier

Dernier (20)

Automatic extraction of microorganisms and their habitats from free text using text-mining workflows