Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Scalable Text Mining

Good dictionaries are a key for text mining. We present an idea to build a platform where users can create their own dictionary and text-mining pipeline.

  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Scalable Text Mining

  1. 1. Scalable Text Mining Jee-Hyub Kim Text-Mining Pipeline Builder Literature Services Team 2 Feb 2016
  2. 2. A Text-Mining Pipeline Text
  3. 3. Contents ● Text-Mining Pipeline Crisis ● Session 1: Build Your Own Pipeline ● Session 2: Build Your Own Dictionary ● Wrap Up
  4. 4. Use case Semantic type Dictionary type Document type Section Metadata Delivery method OpenAIRE accession numbers pattern (e.g, [0-9][A- Za-z0-9]{3}) patents Title, Claim, Description, Abstract, Figure, Table Pubyear, IPCR summary table ERC grant identifiers pattern articles Acknowledgements search index CTTV gene, disease term (e.g., IBD) articles, abstracts json ELIXIR-EXCELERTAE resource names term articles summary table 1000 Genomes cell line names pattern articles !Acknowledgements REST API Wikipedia accession numbers pattern wikipages summary table KEW Garden species names (muitilingual) term articles summary table ChEMBL resource name term articles Author, Journal summary table Ensembl genomic range pattern articles summary table A long list of requests
  5. 5. Scalable Text Mining ● For the last few years, we’re having a pipeline crisis! ● A long list of requests and our slow responses ○ Makes you unhappy. ● Even worse, it’s a long tail! ○ Never the same pipeline used for each request. ○ Every time, we have to build a new pipeline. ○ We need a new approach to solve this crisis.
  6. 6. Objective ● We want to build a LEGO-like platform that helps you to build your own text-mining pipeline and your own dictionary.
  7. 7. A Key Block: Dictionary-Based Tagger ● Role: To identify names (e.g., proteins, species, accession numbers, etc.) ● Dictionary-based approach for mining names. ○ Simple ○ Readable ○ Interactive ● Building a dictionary is a VERY iterative process ○ 20% for building an initial dictionary and the rest for refining it. ● Good dictionaries are a key for text-mining success stories.
  8. 8. Agile Revision Process
  9. 9. Session 1 Build Your Own Pipeline As …, I want a pipeline to do ...
  10. 10. Pipeline Stories ● CTTV ○ As a researcher, I want to find articles with supporting evidence from drug discovery ● ERC ○ As a funder, I want to funded articles more searchable. ● ELIXIR-EXCELERATE ○ As a resource manager, I want to know impacts of resources.
  11. 11. Second, Find & Describe Blocks You Need When you want You can use to extract a sentence Sentence splitter to limit your mining to an article section Section tagger to identify disease names to identify database idetifiers Dictionary-based tagger to find relations between genes and diseases Relation extractor to get some analytics Summary table generator to get article meta data Europe PMC REST API to produce text-mined data in RDF RDF generator
  12. 12. Then, Build a Pipeline using Blocks
  13. 13. Session 2 Build Your Own Dictionary Designing filtering rules
  14. 14. How to Revise a Dictionary? ● We want to build an expressive language for filtering. ● Global filtering rule ○ A length of term > 2 ○ Case sensitive ● Per-entry filtering rule ○ A term should be tagged when it is mentioned in Methods section. ○ A pattern should be tagged when it follows a term “omim” ● Blacklist: e.g., stop words
  15. 15. Per-Entry Rules ● A spreadsheet per entry ● Definitions ○ Context: should (not) be after a tem. ○ Section: should (not) be mentioned a section. ○ URI: check if http://www.ebi.ac. uk/efo/EFO_0001997 exists Entry information Filtering rules Term/Pattern Entry ID DB Context Section URI Pattern HG[0-9]{5} 1000 genomes ! (grant|fun d) !ACK Term basal cell EFO_0001997 efo Methods Yes
  16. 16. Analytics ● Summary table ● Top 100 frequent terms PMCID Term ID Frequency PMCID4698870 Nutlin-3 ChEBI:46742 16 PMCID4698870 cell cycle arrests GO:0007050 6 Top Name Document Freq. Collection Freq. 1 protein 678,987 1,823,783 2 water 563,234 1,233,332
  17. 17. Spreadsheet for Filtering Rules http://tinyurl.com/zlwbx2y
  18. 18. Wrap Up ● What is your pipeline story? ● Have you managed to create your own dictionary? ● What service blocks are missing? ● What should be the interfaces? ● How should we deliver?