OpenMinTeD aims to build an open text and data mining infrastructure for the research community. It will provide accessible scientific and scholarly content, discoverable text mining services, and efficient processing capabilities. This will help different research communities extract meaningful insights from textual sources and improve predictive abilities. The infrastructure seeks to establish standards and address legal issues to enable linking text mining tools, services, and resources to create new workflows that advance science.
How is Real-Time Analytics Different from Traditional OLAP?
OpenMinteD Project - building a TDM infrastructure
1. OpenMinTeD
Building an Open
Text and Data Mining
Infrastructure
• Stelios Piperidis
• spip@ilsp.gr
• Institute for Language & Speech Processing
• Athena Research & Innovation Centre
2. ● > 1,08 billion websites and 3,46 billion internet users, on 25 September 2016.
● > 24 million wireless sensors and actuators worldwide (553% up, between 2011 and
2016).
● > 16 zettabytes of useful data (16 Trillion GB) by 2020.
● YouTube claims to upload 24 hours of video every minute, making the site a hugely
significant data aggregator.
● “Every second, on average, around 6,000 tweets are tweeted on Twitter, which
corresponds to over 350,000 tweets sent per minute, >500 million tweets per day
and around 200 billion tweets per year”.
● 74,200,000 pages existed on Facebook, with 7 million apps and websites integrated
with Facebook on 30/5/2016.
3. The global research community generates over 1.5 million new scholarly
articles per annum.
e STM report (2009)
… some 90% of papers … are never cited.
… 50% of papers are never read by anyone other than their authors,
referees and journal editors
… one paper published every 30 seconds
… 70,000 papers published on a single protein, the tumor suppressor p53
e STM report (2009)
3
4. process textual sources, organise and classify in various dimensions, extract
main (indexical) information items
identify and extract entities and relations between entities, facilitate the
transformation of unstructured textual sources into structured data
enable the multidimensional analysis of structured data to extract meaningful
insights and improve the ability to predict
5. Text Types
Newswire
Scientific Literature
Tweets/blogs
Patents
Clinical/medical records
Textbooks, monographs
Online forums
….
Languages
English
French
German
Spanish
Portuguese
Italian
Polish
Tasks
Translation
Information Extraction
Semantic Search
Question Answering
Sentiment Analysis
Summarization
Knowledge Discovery
Domains
Finance/Business
Health
Biology
Social Sciences
Humanities
….
7. Establish an open and sustainable Text and Data
Mining (TDM) platform and infrastructure where
researchers can collaboratively create, discover, share
and re-use knowledge from a wide range of text based
scientific and scholarly related sources.
7
10. 10
ACCESSIBLE
CONTENT
DISCOVERABLE
SERVICES
EFFICIENT
PROCESSING
RESEARCH
COMMUNITIES
VALUE ADDED APPS
Via standardised programmatic
interfaces and access rules
Well-documented easily
discoverable text mining services
and workflows which process,
analyse and annotate text
Operate on public e-Infrastructures
via standarized APIs
Different scientific communities
have different challenges
Community-driven applications to
illustrate the value of the
infastructure. Engage with industry.
10
11. From the very beginning…
Requirements, content, barriers, expected outcomes.
… to the very end
Create applications, validate and evaluate the results.
12. • Document literature content, language/knowledge resources, data categories taxonomies,
provenance information
• Document language processing/text mining services and workflows
• Generic and domain-specific metadata descriptions
• Combine services into workflows
• Combine content and language resources with services and workflows
• Combine automatic and manual/crowdsourcing annotation services
• Study IPR restrictions for reuse of sources as well as possible exceptions
• Promote clarity and standardisation of legal rights and obligations
• Translate the legal & policy aspects into specifications for lawful user-to-service and
service-to-service interactions
13. •
documenting, depositing, managing, publishing and sharing scientific content and
data, text and data mining software tools, services and workflows, language and
knowledge resources
•
to enable both technically but also legally the linking and pipelining of text mining
tools, services and workflows, as well as language and knowledge resources
•
automatic analysis, annotation and extraction of important information out of
scientific content
•
composing, scheduling and orchestrating new processing workflows by combining
existing text mining services and language/knowledge resources
•
services for advising on lawful use and combination of content, language resource
and text mining services
14. 1. End users
- Researchers, data base curators, …
- Novice: use services to advance their science
- Advanced: use TDM services into complex workflows
14
2. Content and service providers
- Publishers, libraries, scientific data base centres, …
- TDM researchers
- SME’s