Unstructured Text in Big Data: the Elephant in the Room
David Milward (Linguamatics, UK)
A traditional approach has been to extract structured data from the unstructured text. When it comes to big data, manual or semi-automated methods just don't scale. Even with fully automated methods, it is infeasible to extract all the facts and relationships buried in the text to produce a 'knowledge base of everything' that answers every possible end-user question. In recent years, there has therefore been a trend towards using text mining over unstructured data to directly answer questions, effectively creating novel databases on the fly. However, this has widened the gap between information professionals and end users.
In this talk I will explain how multiple approaches are being adopted to exploit the skills of information professionals to improve decision support for end users. These include specialised real-time querying, alerting based on semantic queries, semantic enrichment of documents, and population of semantic stores or linked data.
3. Click toto edit Master Data style
From edit Master title style
Click Unstructured title to Structured
Identify
concepts and
relationships
Structured, relational data
Normalized concept
identifiers
Traditional text mining works in batch mode
Unstructured or
Semi-Structured
Content Sources
Linguamatics Confidential
Requires each extraction to be programmed in,
or learnt from a large amount of annotated
material
4. Click Agile Text Mining
I2E to edit Master title style
Click to edit Master title style
Querying
Results
Information
Consumers
Confidential
5. Click toto edit Master title style
Increasing Range style
Click edit Master titleOf Applications
Sentiment
Analysis in
Social
Media
Vocabulary
Development
Mining
Conference
FDA Drug
In the beginning...
Labels Abstract
Mining
Electronic
Medical
Records
Mining
SAR Competitive
ProteinExtractionIntelligenceProtein
Target
Interactions
Identification
&
Prioritization
Safety/Tox
Confidential
Plant
Metabolic
Pathways
Patent
Analysis
Extracting
Chemical Numerical
Search
and
Experimental
Data
Biomarker
Discovery
In-licensing
Opportunities
Key Opinion
Leader
Identification Patent FTO
Systems
Biology
Drug
Repositioning
12. Click toto edit Master title style
Workflow Automation
Click edit Master title style
• Successful queries converted into regular
workflows using Pipeline Pilot or KNIME
• Real-time workflows for up-to-date dashboards,
alerts
• Further analysis, visualization, integration with
other data
Pipeline Pilot
12
14. Click toto edit I2Etitle style Web Apps
Embedding Master title
Click edit Master within style
10/21/2013
14
4734_8.pptx
15. Click toto edit Master title style
Click edit Master title style
Form-Based Querying
• Create web interfaces
connecting to I2E server
• Information
professionals develop
sophisticated queries
• Queries are
parameterized to allow
end users to customize
• Javascript allows
portability to mobile
devices such as iPads
19. Click toto edit Master title style
Dictionary Matching Click edit Master title style Companies
20. Click toto edit MasterInstitutions
Pattern Matching title
Click edit Master title -style style
• Not a fixed list: can find previously unknown institutions
21. Click toto edit MasterMutations
Pattern Matching title
Click edit Master title -style style
22. Click toto edit MasterChemicals
Pattern Matching title
Click edit Master title -style style
• Assign structure to novel chemicals
• Distinguish exemplified compounds
• Distinguish compounds given properties
23. Click toto edit Master title style
Whatever the title style
Click edit MasterContent...
• ... I2E can mine and extract with precision
Scientific literature
Twitter
24. Click totoWherever style style
…. and edit Master
Click edit Master title title
Internal Content
MEDLINE Patents
Drug Labels
Clinical Trials
Content
Provider
Publisher
I2E OnDemand
I2E Enterprise
I2E Platform
External
Content
I2E Platform
• I2E 4.1 Linked Servers
• Users can query local information or information on the cloud
• Proprietary content can reside within Enterprise
• Standard sources e.g. patents can be hosted on the cloud, and shared by all users
• Data from content providers can reside on their sites
24