1. Text and Data Mining at CCC
Solving the Content Retrieval and Licensing Conundrums for TDM
Dr. Haralambos Marmanis
CTO & VP, Engineering
Copyright Clearance Center
5. What Is Text and Data Mining?
• Automate the extraction of “Entities” from Text
• Find Relationships and Patterns
• Produce hypotheses of interest
• Drive decision making
4/22/20155
8. “Drug Discovery” Process
• Goal: Develop new treatments for diseases
through hypothesis formation.
• Methodology:
– Keyword/Database Searching
– Review Literature
– Find relationships
– Develop hypothesis
– Test
– Product development
Etc.
4/22/20158
9. General Overview of the Process
1. Identify a set of resources that are relevant to a
particular research objective
2. Analyze and extract information specific to the
research objective
3. Develop and explore the various relations between
extracted objects of interest
4/22/20159
10. Data Processing Workflow:
Information Retrieval and Knowledge Discovery
4/22/201510 *http://www.jisc.ac.uk/reports/value-and-benefits-of-text-mining
Software Platforms for TDM
Information
Retrieval
Knowledge
Discovery
11. Problem: Too Much Research
• 53M Records in Scopus
• 800,000 Journal Articles published per year
4/22/201511
12. More Problems…
• Many sources of content
• Many formats
• Difficult to obtain full-text in XML
• Difficult to integrate content into TDM software.
• Hard to negotiate and manage licenses and feeds from
all publishers.
4/22/201512
13. The DirectPath Solution
• Speed up time to obtain properly licensed content for
text mining
• Discover and download full-text in XML, not just
abstracts
• Main corpus includes Subscribed and Not-Subscribed
content
• Normalize XML format across many publishers
• Provide a Web UI and RESTful API services
4/22/201513
14. 4/22/201514
2. Researchers create
content sets by using
search or other
discovery criteria
XML
Article
corpus
TDM Software
3. Researchers slice and
dice results and identify
an appropriate corpus for
their project
4. XML corpus
can be
imported into
various TDM
tools
1. Publishers
provide
content
and rights
<XML>
<XML>
<XML>
Publishers Researchers
24. Unique Features
• Custom analysis/indexing for each Project
– Custom stop-word lists; synonyms/dictionaries
– Custom analyzers
– The finest granularity at the analysis and indexing level
• Build by design with multilingual support in mind
– Based on Lucene
• Search beyond TFIDF (e.g. document ranking by citation)
• Retrieval beyond Search (e.g. nearest neighbors)
• Cost and Quality Optimization (roadmap/patent pending)
• Integration with text mining tools like Linguamatics I2E
4/22/201524
25. TDM Product Roadmap
• Augment and Enrich the Inventory
• Workflow Integrations with 3rd Party Support
• Expand and enhance Metadata Normalization
• Introduce Content Metrics for Retrieval
• Cost Optimization
• Information Content Optimization
4/22/201525