Powerpoint exploring the locations used in television show Time Clash
Automatic extraction of microorganisms and their habitats from free text using text-mining workflows
1. Automa'c
extrac'on
of
microorganisms
and
their
habitats
from
free
text
using
text-‐mining
workflows
BalaKrishna
Kolluru,
Sirintra
Nakjang,
Robert.
P.
Hirt,
Anil
Wipat
and
Sophia
Ananiadou
2. Outline
of
the
talk
• Mo'va'on
• Experiments
• Results
&
inferences
• Discussion
• Current
work
3. Mo'va'on
• In
the
study
of
symbio'c
rela'onships,
host-‐
microbe
interac'ons
play
an
important
role
• To
date,
there
is
no
comprehensive
database
regarding
microbe—habitat
rela'on,
but
there
is
an
explosion
in
the
numbers
of
taxa
• With
this,
there
is
an
urgent
need
for
automated
host-‐microbe
rela'on
extrac'on
4. Experiments:
relevant
work
• Iden'fica'on
of
named
en''es
such
as
microorganisms,
diseases,
genes
etc.,
has
received
sufficient
importance
from
the
scien'fic
community
at
large
[Sasaki,
Hanisch,
Chikashi]
• Researchers
have
also
used
ontology
based
approaches
to
iden'fy
concepts
such
as
public
health
rumors
etc
[Biocaster]
5. Experiments:
our
approach
Named
en'ty
recogni'on
• Free
text
• Habitats
&
ar'cles
organisms
• pdf
Text
Rela'on
processing
mining
Employ
text
mining
workflows
consis'ng
of
•
text/pdf
processor
•
Named
en'ty
recognizer
to
iden'fy
microorganisms
and
their
habitats
•
Rela'on
mining
component
to
extract
sentences
which
express
this
rela'on
6. Experiments:
our
approach
• The
named
en'ty
recognizer
used
a
hybrid
dic'onary-‐machine
learning
based
approach
– It
combined
the
informa'on
dic'onaries
with
a
feature
set
for
a
condi'onal
random
field
(CRF)
based
classifier
[Mallet]
– The
CRFs
used
a
linear
chain
model
and
were
trained
on
a
corpus
consis'ng
of
32
full
papers
7. Experiments:
our
approach
– The
feature
set
included
• lexical
informa'on
of
the
word
e.g.,
word,
POS
tag
etc
• Orthographic
informa'on
e.g.
any
uppercase
le^ers,
numbers
• Contextual
informa'on;
informa'on
about
two
word
preceding
and
succeeding
the
word
• For
the
rela'on
mining
component,
a
linear
chain
CRF
was
trained
using
– Occurrence
of
organisms
and
habitats
– Contextual
informa'on
of
all
the
en''es
in
a
sentence
8. Results
and
inference
Performance
of
our
named
en'ty
recognizer
on
a
9-‐fold
cross-‐valida'on
Class
of
Precision(%)
Recall(%)
F-‐score(%)
en**es
2PR/(P+R)
Organisms
84
79
81
Habitats
68
55
61
improved
results
from
the
'me
of
submission
•
Microorganisms
have
been
recognized
quite
well.
•
Habitat
recogni'on
is
modest
•
One
of
the
observa'ons
is
that
in
a
free
text,
the
descrip'on
of
habitats/host
is
devoid
any
salient
features
such
as
uppercase
le^ers,
hyphens
etc.
•
Instances
such
as
abscess,
lung
were
typical
misses
9. Results
and
inference
Rela'on
mining
results
• For
the
rela'on
mining
experiment,
the
CRF-‐based
classifier
achieved
a
precision
of
~
80%
• Most
of
the
false
nega'ves
(
sentences
which
should
have
been
picked
up,
but
were
not)
due
to
the
noise
in
pdf
to
text
conversion
• Another
reason
for
false
nega'ves
is
the
modest
performance
of
habitat
recogni'on
which
affected
the
rela'on
mining
algorithm
10. Discussion
• The
workflows
we
have
developed
bring
together
pdf-‐conversion,
machine
learning
and
dic'onaries
together
– Performance
of
individual
components
obviously
has
an
impact
its
overall
performance
– Pdf
conversion
is
not
trivial
by
any
means
and
this
component
is
the
most
limi'ng
factor
for
any
sentence-‐based
classifica'on
task
11. Discussion
• Pdf-‐to-‐text
sentence
examples
These
mechanisms
may
have
evolved
in
bacterial
pathogens
to
increase
the
frequency
of
phenotypic
varia'on
in
genes
involved
in
1
100,000
200,000
300,000
1,600,00
Figure
2
Circular
representa'on
of
the
H.
pylori
26695
chromosome.
[Clearly,
data
from
a
table
and
figure
corrupted
the
sentence]
airborne
pigs
[noisy
conversion
of
table
discussing
airborne
diseases
in
pigs
]
12. Discussion
• The
CRF
model
for
habitats
is
evidently
weak
– There
is
a
need
to
augment
the
features
to
alleviate
this
weakness.
We
are
currently
enhancing
model
to
include
more
features
such
as
character-‐level
n-‐grams
–
Results
reflect
ini'al
success
• Rela'on
mining
is
a
hyper-‐classifica'on
task
and
perhaps
it
is
prone
to
cascading
errors
13. Current
work
• Work
is
underway
to
improve
the
rela'on
mining
component
using
bag-‐of-‐words
and
character
level
n-‐grams
to
augment
the
feature
space
• We
are
also
working
on
less
noisy
conversion
techniques
for
pdf-‐to-‐text
• Export
the
workflows
to
the
public
domain
so
that
scien'sts
across
the
spectrum
can
use
our
workflows
14. Snapshot
of
rela'on
miner
References
•
Hanisch,
D.
et
al.
ProMiner:
Organism
specific
protein
name
detec'on
using
approximate
string
matching.
Embo
Workshop
Granada,
Spain,
2004
• Sasaki,
Y.
et
al.
(2008).
How
to
make
the
most
of
NE
dic'onaries
in
sta's'cal
NER?
In:
BMC
Bioinforma'cs,
9(Suppl
11),
S5
•
Collier,
N.
et
al.
BioCaster:
detec'ng
public
health
rumors
with
a
Web-‐based
text
mining
system.
Bioinforma'cs,
24(24),
2008.
•
Nobata,
C.
et
al
Mining
Metabolites:
Extrac'ng
the
Yeast
Metabolome
from
the
Literature.
Metabolomics,
2010.