SlideShare une entreprise Scribd logo
1  sur  14
Télécharger pour lire hors ligne
Automa'c	
  extrac'on	
  of	
  
microorganisms	
  and	
  their	
  habitats	
  
 from	
  free	
  text	
  using	
  text-­‐mining	
  
                  workflows	
  
      BalaKrishna	
  Kolluru,	
  Sirintra	
  Nakjang,	
  
      Robert.	
  P.	
  Hirt,	
  Anil	
  Wipat	
  and	
  Sophia	
  
                          Ananiadou	
  
Outline	
  of	
  the	
  talk	
  
•    Mo'va'on	
  
•    Experiments	
  
•    Results	
  &	
  inferences	
  
•    Discussion	
  
•    Current	
  work	
  
Mo'va'on	
  
•  In	
  the	
  study	
  of	
  symbio'c	
  rela'onships,	
  host-­‐
   microbe	
  interac'ons	
  play	
  an	
  important	
  role	
  
•  To	
  date,	
  there	
  is	
  no	
  comprehensive	
  database	
  	
  
   regarding	
  microbe—habitat	
  rela'on,	
  but	
  there	
  
   is	
  an	
  explosion	
  in	
  the	
  numbers	
  of	
  taxa	
  	
  
•  With	
  this,	
  there	
  is	
  an	
  urgent	
  need	
  for	
  
   automated	
  host-­‐microbe	
  rela'on	
  extrac'on	
  
Experiments:	
  relevant	
  work	
  
•  Iden'fica'on	
  of	
  named	
  en''es	
  such	
  as	
  
   microorganisms,	
  diseases,	
  genes	
  etc.,	
  has	
  
   received	
  sufficient	
  importance	
  from	
  the	
  
   scien'fic	
  community	
  at	
  large	
  [Sasaki,	
  Hanisch,	
  
   Chikashi]	
  
•  Researchers	
  have	
  also	
  used	
  ontology	
  based	
  
   approaches	
  to	
  iden'fy	
  concepts	
  such	
  as	
  public	
  
   health	
  rumors	
  etc	
  [Biocaster]	
  
Experiments:	
  our	
  approach	
  
                                        Named	
  en'ty	
  
                                         recogni'on	
  
               • Free	
  text	
                              • Habitats	
  &	
  
                 ar'cles	
                                     organisms	
  
               • pdf	
  
                          Text	
                                     Rela'on	
  
                       processing	
                                   mining	
  




Employ	
  text	
  mining	
  workflows	
  consis'ng	
  of	
  	
  
  • 	
  text/pdf	
  processor	
  
  • 	
  Named	
  en'ty	
  recognizer	
  to	
  iden'fy	
  microorganisms	
  	
  
  	
  	
  and	
  their	
  habitats	
  
  • 	
  Rela'on	
  mining	
  component	
  to	
  extract	
  sentences	
  	
  
  	
  	
  which	
  express	
  this	
  rela'on	
  	
  
Experiments:	
  our	
  approach	
  
•  The	
  named	
  en'ty	
  recognizer	
  used	
  a	
  hybrid	
  
   dic'onary-­‐machine	
  learning	
  based	
  approach	
  
   –  It	
  combined	
  the	
  informa'on	
  dic'onaries	
  with	
  a	
  
      feature	
  set	
  for	
  a	
  condi'onal	
  random	
  field	
  (CRF)	
  
      based	
  classifier	
  [Mallet]	
  
   –  The	
  CRFs	
  used	
  a	
  linear	
  chain	
  model	
  and	
  were	
  
      trained	
  on	
  a	
  corpus	
  consis'ng	
  of	
  32	
  full	
  papers	
  
Experiments:	
  our	
  approach	
  
    –  The	
  feature	
  set	
  included	
  	
  
        •  lexical	
  informa'on	
  of	
  the	
  word	
  e.g.,	
  word,	
  POS	
  tag	
  etc	
  
        •  Orthographic	
  informa'on	
  e.g.	
  any	
  uppercase	
  le^ers,	
  
           numbers	
  
        •  Contextual	
  informa'on;	
  informa'on	
  about	
  two	
  word	
  
           preceding	
  and	
  succeeding	
  the	
  word	
  	
  

•  For	
  the	
  rela'on	
  mining	
  component,	
  a	
  linear	
  chain	
  CRF	
  
   was	
  trained	
  using	
  	
  
    –  Occurrence	
  of	
  organisms	
  and	
  habitats	
  
    –  Contextual	
  informa'on	
  of	
  all	
  the	
  en''es	
  in	
  a	
  sentence	
  	
  	
  
Results	
  and	
  inference	
  
Performance	
  of	
  our	
  named	
  en'ty	
  recognizer	
  	
  
on	
  a	
  9-­‐fold	
  cross-­‐valida'on	
  	
  
            Class	
  of	
     Precision(%)	
                           Recall(%)	
                      F-­‐score(%)	
  
            en**es	
                                                                                    2PR/(P+R)	
  
            Organisms	
       	
  	
  	
  	
  	
  	
  	
  	
  84	
     	
  	
  	
  	
  	
  	
  79	
     	
  	
  	
  	
  	
  	
  	
  81	
  
            Habitats	
        	
  	
  	
  	
  	
  	
  	
  	
  68	
   	
  	
  	
  	
  	
  	
  55	
   	
  	
  	
  	
  	
  	
  	
  61	
  
                                                improved	
  results	
  from	
  the	
  'me	
  of	
  submission	
  
• 	
  Microorganisms	
  have	
  been	
  recognized	
  quite	
  well.	
  
• 	
  Habitat	
  recogni'on	
  is	
  modest	
  
• 	
  One	
  of	
  the	
  observa'ons	
  is	
  that	
  in	
  a	
  free	
  text,	
  the	
  	
  
	
  	
  	
  descrip'on	
  of	
  habitats/host	
  is	
  devoid	
  any	
  salient	
  features	
  	
  
	
  	
  	
  such	
  as	
  uppercase	
  le^ers,	
  hyphens	
  etc.	
  
• 	
  Instances	
  such	
  as	
  abscess,	
  lung	
  were	
  typical	
  misses	
  	
  
Results	
  and	
  inference	
  
Rela'on	
  mining	
  results	
  
•  For	
  the	
  rela'on	
  mining	
  experiment,	
  the	
  CRF-­‐based	
  
   classifier	
  achieved	
  a	
  precision	
  of	
  ~	
  80%	
  
•  Most	
  of	
  the	
  false	
  nega'ves	
  (	
  sentences	
  which	
  should	
  
   have	
  been	
  picked	
  up,	
  but	
  were	
  not)	
  due	
  to	
  the	
  noise	
  
   in	
  pdf	
  to	
  text	
  conversion	
  
•  Another	
  reason	
  for	
  false	
  nega'ves	
  is	
  the	
  modest	
  
   performance	
  of	
  habitat	
  recogni'on	
  which	
  affected	
  
   the	
  rela'on	
  mining	
  algorithm	
  
Discussion	
  	
  
•  The	
  workflows	
  we	
  have	
  developed	
  bring	
  
   together	
  pdf-­‐conversion,	
  machine	
  learning	
  
   and	
  dic'onaries	
  together	
  
   –  Performance	
  of	
  individual	
  components	
  obviously	
  
      has	
  an	
  impact	
  its	
  overall	
  performance	
  
   –  Pdf	
  conversion	
  is	
  not	
  trivial	
  by	
  any	
  means	
  and	
  this	
  
      component	
  is	
  the	
  most	
  limi'ng	
  factor	
  for	
  any	
  
      sentence-­‐based	
  classifica'on	
  task	
  
Discussion	
  
•  Pdf-­‐to-­‐text	
  sentence	
  examples	
  
     	
  These	
  mechanisms	
  may	
  have	
  evolved	
  in	
  bacterial	
  
                    pathogens	
  to	
  increase	
  the	
  frequency	
  of	
  phenotypic	
  
                    varia'on	
  in	
  genes	
  involved	
  in	
  
    	
  	
  	
  	
  1	
  100,000	
  200,000	
  300,000	
  1,600,00	
  Figure	
  2	
  Circular	
  
                    representa'on	
  of	
  the	
  H.	
  pylori	
  26695	
  chromosome.	
  
                    [Clearly,	
  data	
  from	
  a	
  table	
  and	
  figure	
  corrupted	
  the	
  
                    sentence]	
  
     	
  airborne	
  pigs	
  [noisy	
  conversion	
  of	
  table	
  discussing	
  
                    airborne	
  diseases	
  in	
  pigs	
  ]	
  
Discussion	
  
•  The	
  CRF	
  model	
  for	
  habitats	
  is	
  evidently	
  weak	
  
    –  There	
  is	
  a	
  need	
  to	
  augment	
  the	
  features	
  to	
  
       alleviate	
  this	
  weakness.	
  We	
  are	
  currently	
  
       enhancing	
  model	
  to	
  include	
  more	
  features	
  such	
  as	
  
       character-­‐level	
  n-­‐grams	
  
    –  	
  Results	
  reflect	
  ini'al	
  success	
  
•  Rela'on	
  mining	
  is	
  a	
  hyper-­‐classifica'on	
  task	
  
   and	
  perhaps	
  it	
  is	
  prone	
  to	
  cascading	
  errors	
  
Current	
  work	
  
•  Work	
  is	
  underway	
  to	
  improve	
  the	
  rela'on	
  
   mining	
  component	
  using	
  bag-­‐of-­‐words	
  and	
  
   character	
  level	
  n-­‐grams	
  to	
  augment	
  the	
  
   feature	
  space	
  
•  We	
  are	
  also	
  working	
  on	
  less	
  noisy	
  conversion	
  
   techniques	
  for	
  pdf-­‐to-­‐text	
  
•  Export	
  the	
  workflows	
  to	
  the	
  public	
  domain	
  so	
  
   that	
  scien'sts	
  across	
  the	
  spectrum	
  can	
  use	
  our	
  
   workflows	
  
Snapshot	
  of	
  rela'on	
  miner	
  




References	
  
• 	
  Hanisch,	
  D.	
  et	
  al.	
  ProMiner:	
  Organism	
  specific	
  protein	
  name	
  detec'on	
  using	
  	
  
	
  	
  	
  approximate	
  string	
  matching.	
  Embo	
  Workshop	
  Granada,	
  Spain,	
  2004	
  
• Sasaki,	
  Y.	
  et	
  al.	
  (2008).	
  How	
  to	
  make	
  the	
  most	
  of	
  NE	
  dic'onaries	
  in	
  sta's'cal	
  NER?	
  
	
  	
  In:	
  BMC	
  Bioinforma'cs,	
  9(Suppl	
  11),	
  S5	
  	
  
• 	
  Collier,	
  N.	
  et	
  al.	
  BioCaster:	
  detec'ng	
  public	
  health	
  rumors	
  with	
  a	
  Web-­‐based	
  text	
  	
  
	
  	
  	
  mining	
  system.	
  Bioinforma'cs,	
  24(24),	
  2008.	
  	
  
• 	
  Nobata,	
  C.	
  et	
  al	
  Mining	
  Metabolites:	
  Extrac'ng	
  the	
  Yeast	
  Metabolome	
  from	
  the	
  Literature.	
  	
  
	
  	
  	
  Metabolomics,	
  2010.	
  	
  

Contenu connexe

Tendances

A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...Jan Aerts
 
RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1BITS
 
wings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualizewings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualizeAnn Loraine
 
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...NU_I_TODALAB
 
The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022NU_I_TODALAB
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAGRF_Ltd
 

Tendances (9)

A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
 
RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1
 
Rna seq
Rna seqRna seq
Rna seq
 
presentation
presentationpresentation
presentation
 
wings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualizewings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualize
 
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
 
speech enhancement
speech enhancementspeech enhancement
speech enhancement
 
The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
 

En vedette

Shibuya.el
Shibuya.elShibuya.el
Shibuya.eluk-ar
 
Text Mining with R for Social Science Research
Text Mining with R for Social Science ResearchText Mining with R for Social Science Research
Text Mining with R for Social Science ResearchRyan Wesslen
 
SUNG PARK PREDICT 422 Group Project Presentation
SUNG PARK PREDICT 422 Group Project PresentationSUNG PARK PREDICT 422 Group Project Presentation
SUNG PARK PREDICT 422 Group Project PresentationSung Park
 
R user group presentation
R user group presentationR user group presentation
R user group presentationTom Liptrot
 
Twitter Hashtag #appleindia Text Mining using R
Twitter Hashtag #appleindia Text Mining using RTwitter Hashtag #appleindia Text Mining using R
Twitter Hashtag #appleindia Text Mining using RNikhil Gadkar
 
Quantifying Text Sentiment in R
Quantifying Text Sentiment in RQuantifying Text Sentiment in R
Quantifying Text Sentiment in RRajarshi Guha
 
Computing Probabilities With R: mining the patterns in lottery
Computing Probabilities With R: mining the patterns in lotteryComputing Probabilities With R: mining the patterns in lottery
Computing Probabilities With R: mining the patterns in lotteryChia-Chi Chang
 
Text mining with R-studio
Text mining with R-studioText mining with R-studio
Text mining with R-studioAshley Lindley
 
My Data Analysis Portfolio (Text Mining)
My Data Analysis Portfolio (Text Mining)My Data Analysis Portfolio (Text Mining)
My Data Analysis Portfolio (Text Mining)Vincent Handara
 
Data mining with R- regression models
Data mining with R- regression modelsData mining with R- regression models
Data mining with R- regression modelsHamideh Iraj
 
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng Richard Sheng
 
Data Exploration and Visualization with R
Data Exploration and Visualization with RData Exploration and Visualization with R
Data Exploration and Visualization with RYanchang Zhao
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RYanchang Zhao
 
hands on: Text Mining With R
hands on: Text Mining With Rhands on: Text Mining With R
hands on: Text Mining With RJahnab Kumar Deka
 
R Reference Card for Data Mining
R Reference Card for Data MiningR Reference Card for Data Mining
R Reference Card for Data MiningYanchang Zhao
 
An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with RYanchang Zhao
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012Gigaom
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with RYanchang Zhao
 

En vedette (20)

Shibuya.el
Shibuya.elShibuya.el
Shibuya.el
 
R and data mining
R and data miningR and data mining
R and data mining
 
Text Mining with R for Social Science Research
Text Mining with R for Social Science ResearchText Mining with R for Social Science Research
Text Mining with R for Social Science Research
 
SUNG PARK PREDICT 422 Group Project Presentation
SUNG PARK PREDICT 422 Group Project PresentationSUNG PARK PREDICT 422 Group Project Presentation
SUNG PARK PREDICT 422 Group Project Presentation
 
R user group presentation
R user group presentationR user group presentation
R user group presentation
 
Predictshine
PredictshinePredictshine
Predictshine
 
Twitter Hashtag #appleindia Text Mining using R
Twitter Hashtag #appleindia Text Mining using RTwitter Hashtag #appleindia Text Mining using R
Twitter Hashtag #appleindia Text Mining using R
 
Quantifying Text Sentiment in R
Quantifying Text Sentiment in RQuantifying Text Sentiment in R
Quantifying Text Sentiment in R
 
Computing Probabilities With R: mining the patterns in lottery
Computing Probabilities With R: mining the patterns in lotteryComputing Probabilities With R: mining the patterns in lottery
Computing Probabilities With R: mining the patterns in lottery
 
Text mining with R-studio
Text mining with R-studioText mining with R-studio
Text mining with R-studio
 
My Data Analysis Portfolio (Text Mining)
My Data Analysis Portfolio (Text Mining)My Data Analysis Portfolio (Text Mining)
My Data Analysis Portfolio (Text Mining)
 
Data mining with R- regression models
Data mining with R- regression modelsData mining with R- regression models
Data mining with R- regression models
 
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
 
Data Exploration and Visualization with R
Data Exploration and Visualization with RData Exploration and Visualization with R
Data Exploration and Visualization with R
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in R
 
hands on: Text Mining With R
hands on: Text Mining With Rhands on: Text Mining With R
hands on: Text Mining With R
 
R Reference Card for Data Mining
R Reference Card for Data MiningR Reference Card for Data Mining
R Reference Card for Data Mining
 
An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with R
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
 

Similaire à Automatic extraction of microorganisms and their habitats from free text using text-mining workflows

Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003robertstevens65
 
Machine reading for cancer biology
Machine reading for cancer biologyMachine reading for cancer biology
Machine reading for cancer biologyLaura Berry
 
150219 agbt giab_poster_marc
150219 agbt giab_poster_marc150219 agbt giab_poster_marc
150219 agbt giab_poster_marcGenomeInABottle
 
2007 03-16 modeling and static analysis of complex biological systems dsr
2007 03-16 modeling and static analysis of complex biological systems dsr2007 03-16 modeling and static analysis of complex biological systems dsr
2007 03-16 modeling and static analysis of complex biological systems dsrDebora Da Rosa
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017philippbayer
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...RajkiranVeluri
 
Formal languages to map Genotype to Phenotype in Natural Genomes
Formal languages to map Genotype to Phenotype in Natural GenomesFormal languages to map Genotype to Phenotype in Natural Genomes
Formal languages to map Genotype to Phenotype in Natural Genomesmadalladam
 
Improving Your Literature Reviews with NVivo 10 for Windows
Improving Your Literature Reviews with NVivo 10 for WindowsImproving Your Literature Reviews with NVivo 10 for Windows
Improving Your Literature Reviews with NVivo 10 for WindowsQSR International
 
Giab ashg webinar 160224
Giab ashg webinar 160224Giab ashg webinar 160224
Giab ashg webinar 160224GenomeInABottle
 
P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...
P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...
P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...Natalio Krasnogor
 
Text Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesText Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesDan Sullivan, Ph.D.
 
Luciano pr 08-849_ontology_evaluation_methods_metrics
Luciano pr 08-849_ontology_evaluation_methods_metricsLuciano pr 08-849_ontology_evaluation_methods_metrics
Luciano pr 08-849_ontology_evaluation_methods_metricsJoanne Luciano
 
The Taverna Workflow Management Software Suite - Past, Present, Future
The Taverna Workflow Management Software Suite - Past, Present, FutureThe Taverna Workflow Management Software Suite - Past, Present, Future
The Taverna Workflow Management Software Suite - Past, Present, FuturemyGrid team
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Prof. Wim Van Criekinge
 
Acs denver dirks potenzone 30 aug2011
Acs denver dirks potenzone 30 aug2011Acs denver dirks potenzone 30 aug2011
Acs denver dirks potenzone 30 aug2011Rudy Potenzone
 

Similaire à Automatic extraction of microorganisms and their habitats from free text using text-mining workflows (20)

Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003
 
Machine reading for cancer biology
Machine reading for cancer biologyMachine reading for cancer biology
Machine reading for cancer biology
 
150219 agbt giab_poster_marc
150219 agbt giab_poster_marc150219 agbt giab_poster_marc
150219 agbt giab_poster_marc
 
2007 03-16 modeling and static analysis of complex biological systems dsr
2007 03-16 modeling and static analysis of complex biological systems dsr2007 03-16 modeling and static analysis of complex biological systems dsr
2007 03-16 modeling and static analysis of complex biological systems dsr
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
Ontology at Manchester
Ontology at ManchesterOntology at Manchester
Ontology at Manchester
 
2013-01-17 Research Object
2013-01-17 Research Object2013-01-17 Research Object
2013-01-17 Research Object
 
Formal languages to map Genotype to Phenotype in Natural Genomes
Formal languages to map Genotype to Phenotype in Natural GenomesFormal languages to map Genotype to Phenotype in Natural Genomes
Formal languages to map Genotype to Phenotype in Natural Genomes
 
Improving Your Literature Reviews with NVivo 10 for Windows
Improving Your Literature Reviews with NVivo 10 for WindowsImproving Your Literature Reviews with NVivo 10 for Windows
Improving Your Literature Reviews with NVivo 10 for Windows
 
CV_10/17
CV_10/17CV_10/17
CV_10/17
 
Cv long
Cv longCv long
Cv long
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
Giab ashg webinar 160224
Giab ashg webinar 160224Giab ashg webinar 160224
Giab ashg webinar 160224
 
P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...
P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...
P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...
 
Text Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesText Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious Diseases
 
Luciano pr 08-849_ontology_evaluation_methods_metrics
Luciano pr 08-849_ontology_evaluation_methods_metricsLuciano pr 08-849_ontology_evaluation_methods_metrics
Luciano pr 08-849_ontology_evaluation_methods_metrics
 
The Taverna Workflow Management Software Suite - Past, Present, Future
The Taverna Workflow Management Software Suite - Past, Present, FutureThe Taverna Workflow Management Software Suite - Past, Present, Future
The Taverna Workflow Management Software Suite - Past, Present, Future
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
 
Acs denver dirks potenzone 30 aug2011
Acs denver dirks potenzone 30 aug2011Acs denver dirks potenzone 30 aug2011
Acs denver dirks potenzone 30 aug2011
 

Plus de Catherine Canevet

Using the Ondex system for exploring Arabidopsis regulatory networks
Using the Ondex system for exploring Arabidopsis regulatory networksUsing the Ondex system for exploring Arabidopsis regulatory networks
Using the Ondex system for exploring Arabidopsis regulatory networksCatherine Canevet
 
Creating an integrated Ondex knowledge base for comparative gene function ana...
Creating an integrated Ondex knowledge base for comparative gene function ana...Creating an integrated Ondex knowledge base for comparative gene function ana...
Creating an integrated Ondex knowledge base for comparative gene function ana...Catherine Canevet
 
BioPAX for semantic web based data integration
BioPAX for semantic web based data integrationBioPAX for semantic web based data integration
BioPAX for semantic web based data integrationCatherine Canevet
 
Enhancing Data Integration with Text Analysis to Find Genes Implicated in Pla...
Enhancing Data Integration with Text Analysis to Find Genes Implicated in Pla...Enhancing Data Integration with Text Analysis to Find Genes Implicated in Pla...
Enhancing Data Integration with Text Analysis to Find Genes Implicated in Pla...Catherine Canevet
 
From data to knowledge – the Ondex System for integrating Life Sciences data ...
From data to knowledge – the Ondex System for integrating Life Sciences data ...From data to knowledge – the Ondex System for integrating Life Sciences data ...
From data to knowledge – the Ondex System for integrating Life Sciences data ...Catherine Canevet
 
Investigating plant systems using data integration and network analysis
Investigating plant systems using data integration and network analysisInvestigating plant systems using data integration and network analysis
Investigating plant systems using data integration and network analysisCatherine Canevet
 

Plus de Catherine Canevet (6)

Using the Ondex system for exploring Arabidopsis regulatory networks
Using the Ondex system for exploring Arabidopsis regulatory networksUsing the Ondex system for exploring Arabidopsis regulatory networks
Using the Ondex system for exploring Arabidopsis regulatory networks
 
Creating an integrated Ondex knowledge base for comparative gene function ana...
Creating an integrated Ondex knowledge base for comparative gene function ana...Creating an integrated Ondex knowledge base for comparative gene function ana...
Creating an integrated Ondex knowledge base for comparative gene function ana...
 
BioPAX for semantic web based data integration
BioPAX for semantic web based data integrationBioPAX for semantic web based data integration
BioPAX for semantic web based data integration
 
Enhancing Data Integration with Text Analysis to Find Genes Implicated in Pla...
Enhancing Data Integration with Text Analysis to Find Genes Implicated in Pla...Enhancing Data Integration with Text Analysis to Find Genes Implicated in Pla...
Enhancing Data Integration with Text Analysis to Find Genes Implicated in Pla...
 
From data to knowledge – the Ondex System for integrating Life Sciences data ...
From data to knowledge – the Ondex System for integrating Life Sciences data ...From data to knowledge – the Ondex System for integrating Life Sciences data ...
From data to knowledge – the Ondex System for integrating Life Sciences data ...
 
Investigating plant systems using data integration and network analysis
Investigating plant systems using data integration and network analysisInvestigating plant systems using data integration and network analysis
Investigating plant systems using data integration and network analysis
 

Dernier

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 

Dernier (20)

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 

Automatic extraction of microorganisms and their habitats from free text using text-mining workflows

  • 1. Automa'c  extrac'on  of   microorganisms  and  their  habitats   from  free  text  using  text-­‐mining   workflows   BalaKrishna  Kolluru,  Sirintra  Nakjang,   Robert.  P.  Hirt,  Anil  Wipat  and  Sophia   Ananiadou  
  • 2. Outline  of  the  talk   •  Mo'va'on   •  Experiments   •  Results  &  inferences   •  Discussion   •  Current  work  
  • 3. Mo'va'on   •  In  the  study  of  symbio'c  rela'onships,  host-­‐ microbe  interac'ons  play  an  important  role   •  To  date,  there  is  no  comprehensive  database     regarding  microbe—habitat  rela'on,  but  there   is  an  explosion  in  the  numbers  of  taxa     •  With  this,  there  is  an  urgent  need  for   automated  host-­‐microbe  rela'on  extrac'on  
  • 4. Experiments:  relevant  work   •  Iden'fica'on  of  named  en''es  such  as   microorganisms,  diseases,  genes  etc.,  has   received  sufficient  importance  from  the   scien'fic  community  at  large  [Sasaki,  Hanisch,   Chikashi]   •  Researchers  have  also  used  ontology  based   approaches  to  iden'fy  concepts  such  as  public   health  rumors  etc  [Biocaster]  
  • 5. Experiments:  our  approach   Named  en'ty   recogni'on   • Free  text   • Habitats  &   ar'cles   organisms   • pdf   Text   Rela'on   processing   mining   Employ  text  mining  workflows  consis'ng  of     •   text/pdf  processor   •   Named  en'ty  recognizer  to  iden'fy  microorganisms        and  their  habitats   •   Rela'on  mining  component  to  extract  sentences        which  express  this  rela'on    
  • 6. Experiments:  our  approach   •  The  named  en'ty  recognizer  used  a  hybrid   dic'onary-­‐machine  learning  based  approach   –  It  combined  the  informa'on  dic'onaries  with  a   feature  set  for  a  condi'onal  random  field  (CRF)   based  classifier  [Mallet]   –  The  CRFs  used  a  linear  chain  model  and  were   trained  on  a  corpus  consis'ng  of  32  full  papers  
  • 7. Experiments:  our  approach   –  The  feature  set  included     •  lexical  informa'on  of  the  word  e.g.,  word,  POS  tag  etc   •  Orthographic  informa'on  e.g.  any  uppercase  le^ers,   numbers   •  Contextual  informa'on;  informa'on  about  two  word   preceding  and  succeeding  the  word     •  For  the  rela'on  mining  component,  a  linear  chain  CRF   was  trained  using     –  Occurrence  of  organisms  and  habitats   –  Contextual  informa'on  of  all  the  en''es  in  a  sentence      
  • 8. Results  and  inference   Performance  of  our  named  en'ty  recognizer     on  a  9-­‐fold  cross-­‐valida'on     Class  of   Precision(%)   Recall(%)   F-­‐score(%)   en**es   2PR/(P+R)   Organisms                  84              79                81   Habitats                  68              55                61   improved  results  from  the  'me  of  submission   •   Microorganisms  have  been  recognized  quite  well.   •   Habitat  recogni'on  is  modest   •   One  of  the  observa'ons  is  that  in  a  free  text,  the          descrip'on  of  habitats/host  is  devoid  any  salient  features          such  as  uppercase  le^ers,  hyphens  etc.   •   Instances  such  as  abscess,  lung  were  typical  misses    
  • 9. Results  and  inference   Rela'on  mining  results   •  For  the  rela'on  mining  experiment,  the  CRF-­‐based   classifier  achieved  a  precision  of  ~  80%   •  Most  of  the  false  nega'ves  (  sentences  which  should   have  been  picked  up,  but  were  not)  due  to  the  noise   in  pdf  to  text  conversion   •  Another  reason  for  false  nega'ves  is  the  modest   performance  of  habitat  recogni'on  which  affected   the  rela'on  mining  algorithm  
  • 10. Discussion     •  The  workflows  we  have  developed  bring   together  pdf-­‐conversion,  machine  learning   and  dic'onaries  together   –  Performance  of  individual  components  obviously   has  an  impact  its  overall  performance   –  Pdf  conversion  is  not  trivial  by  any  means  and  this   component  is  the  most  limi'ng  factor  for  any   sentence-­‐based  classifica'on  task  
  • 11. Discussion   •  Pdf-­‐to-­‐text  sentence  examples      These  mechanisms  may  have  evolved  in  bacterial   pathogens  to  increase  the  frequency  of  phenotypic   varia'on  in  genes  involved  in          1  100,000  200,000  300,000  1,600,00  Figure  2  Circular   representa'on  of  the  H.  pylori  26695  chromosome.   [Clearly,  data  from  a  table  and  figure  corrupted  the   sentence]      airborne  pigs  [noisy  conversion  of  table  discussing   airborne  diseases  in  pigs  ]  
  • 12. Discussion   •  The  CRF  model  for  habitats  is  evidently  weak   –  There  is  a  need  to  augment  the  features  to   alleviate  this  weakness.  We  are  currently   enhancing  model  to  include  more  features  such  as   character-­‐level  n-­‐grams   –   Results  reflect  ini'al  success   •  Rela'on  mining  is  a  hyper-­‐classifica'on  task   and  perhaps  it  is  prone  to  cascading  errors  
  • 13. Current  work   •  Work  is  underway  to  improve  the  rela'on   mining  component  using  bag-­‐of-­‐words  and   character  level  n-­‐grams  to  augment  the   feature  space   •  We  are  also  working  on  less  noisy  conversion   techniques  for  pdf-­‐to-­‐text   •  Export  the  workflows  to  the  public  domain  so   that  scien'sts  across  the  spectrum  can  use  our   workflows  
  • 14. Snapshot  of  rela'on  miner   References   •   Hanisch,  D.  et  al.  ProMiner:  Organism  specific  protein  name  detec'on  using          approximate  string  matching.  Embo  Workshop  Granada,  Spain,  2004   • Sasaki,  Y.  et  al.  (2008).  How  to  make  the  most  of  NE  dic'onaries  in  sta's'cal  NER?      In:  BMC  Bioinforma'cs,  9(Suppl  11),  S5     •   Collier,  N.  et  al.  BioCaster:  detec'ng  public  health  rumors  with  a  Web-­‐based  text          mining  system.  Bioinforma'cs,  24(24),  2008.     •   Nobata,  C.  et  al  Mining  Metabolites:  Extrac'ng  the  Yeast  Metabolome  from  the  Literature.          Metabolomics,  2010.