Stelios Piperidis talks about new opportunities for improved analytics that arise from Open Science and the broad use of interoperable text and data mining resources, tools and services on homogeneously accessible research literature
Training title:TDM unlocking a goldmine of information
Training overview:
Text and Data Mining (TDM) is a natural ‘next step’ in open science. It can lead to new and unexpected discoveries and increase the impact of publications and repositories. This workshop showcases examples of successful TDM and infrastructural solutions for researchers. We will also discuss what is needed to make most of infrastructures and how publishers and repositories can open up their content.
DAY 2 - PARALLEL SESSION 4 & 5
2. The global research community generates ~2.5 million new scholarly articles
per year (English only)
… one paper published every 12seconds
… 70,000 papers published on a single protein, the tumor suppressor p53
3. ● 1,8 billion websites & 3,46 billion internet users, on 25 September 2016.
● 24 million wireless sensors and actuators worldwide (553% up, between
2011 and 2016)
● 16 zettabytes of useful data (16 Trillion GB) by 2020
● YouTube claims to upload 24 hours of video every minute, making the site a
hugely significant data aggregator.
● Every second, on average, around 6,000 tweets are tweeted on Twitter,
which corresponds to over 350,000 tweets sent per minute, >500 million
tweets per day and around 200 billion tweets per year.
● 74,200,000 pages existed on Facebook, with 7 million apps and websites
integrated with Facebook on 30/5/2016
4. process textual sources, organise and classify in various dimensions, extract
main (indexical) information items,
identify and extract entities and relations between entities, facilitate the
transformation of unstructured textual sources into structured data
enable the multidimensional analysis of structured data to extract meaningful
insights and improve the ability to predict
5. Text Types
Newswire
Scientific Literature
Tweets/blogs
Patents
Clinical/medical records
Textbooks, monographs
Online forums
….
Languages
English
French
German
Spanish
Portuguese
Italian
Polish
….
Tasks
Translation
Information Extraction
Semantic Search
Question Answering
Sentiment Analysis
Summarization
Knowledge Discovery
….
Domains
Finance/Business
Health
Biology
Social Sciences
Humanities
….
6.
7.
8. Establish an open and sustainable Text and Data Mining
(TDM) platform and infrastructure where researchers
can discover, collaboratively create, share and re-use
knowledge from a wide range of text based scientific
and scholarly related sources.
9. OpenMinted sets out to create an
open, service-oriented e-Infrastructure
for Text and Data Mining (TDM) of
scientific and scholarly content.
Content/Corpora Services/tools Annotated
corpora
and CORE
12. Scientific pubs =
Research data
ANNOTATED
DATASET
DERIVED
KNOWLEDGE
1st layer
2nd layer
at the level of
licensing conditions
SCIENTIFIC
DATA
PROCESSING TOOLS/SERVICES
13.
14. production Level
IaaS Cloud
Open Source
OpenStack compliant
software stack
Pithos+
Object
storage
Cyclades
Virtual Machines
Management
Builds on
GRNET DataCenters
3 locations (Athens, Crete,
Epirus)
1000+ servers,
16PB Raw Storage,
x10G InterconnectsMember of
EGI Federated
Cloud
15. • Model for describing content, language/knowledge
resources, tools/services (aka components)
• OMTD-SHARE schema
• Allows search and browse of publications
• Maps local schemata to OMTD-SHARE schema
• Provides access to full text
• In cooperation with respective projects (OpenAIRE, CORE)
16. • Provides access to external sources for metadata of TDM
related resources:
• Maven for tools (e.g. GATE, UIMA, uimaFIT, etc.) or machine
learning models.
• LR repositories (e.g. META-SHARE) for metadata of language
resources.
• Docker for dockerized tools and services.
17.
18. • Provides content to OpenMinTeD
• Search on publication sources (OpenAIRE, CORE)
• Builds corpora of publications
• Stores archives of content
• Different storage backends
• Pithos+
• Local filesystem
20. • Create/modify workflows of TDM components
• Execute workflows with user supplied content
• Provide friendly UI
• Used in biomedical research
• Cooperation with 4 international projects
• Ingest TDM tool descriptions from the registry
• Start/stop/monitor workflows
• Integrate with OMTD Store Service to supply content and store
results
21.
22. Rather novice users who want to find services (end to end) that fill their needs in an
off-the-shelf type of situation.
Understand basic usage of NLP and TDM services, but not the details. They know
how to connect components, which content they must work on to get the required
results. They need to develop end to end applications.
agnostic to the internal specifics of TDM, but they need to integrate and operate
TDM services into daily workflows.
23. Publishers and repository managers (research libraries).
Expert language technology oriented people, who are using
specific technologies and frameworks to develop and enhance
their services.
Non NLP expert developers, creating TDM modules based on
off the shelf libraries and tools (e.g. Python, Jupyter). Not
familiar with NLP frameworks and terminology but eager to
publish their services.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52. Feature extraction
Data citation
Research analytics
Curation of
databases and
lexica in
Chembolomics &
neuroinformatics
Extracting
information from
tables for food
safety alerts
Data citation
From the very beginning…
Requirements, content, barriers, expected outcomes.
… to the very end
Create applications, validate and evaluate the results.