Presentation as presented at the ITOC workshop in Philadelphia, 20 February 2016.
Uses and Benefits for the Social Sciences research community.
By GESIS - Leibniz Institute for the Social Sciences
2. 2
Goal of Text Mining
This is where the footer goes
implementation of transformational processes
that …
uncover knowledge in unstructured text
salient content items
hidden relationships between content items
…to assist researchers and scientific data
curators in making sense of the textual data
3. • 1
• 2
• 3
• 4
• 5
• 6
• 7
3
The phases of text mining
taken from ICT2015 presentation (N. Manola)@openminted_eu
NLP Analysis
Entity
Recognition
Data Mining
Knowledge
Discovery
Information
Extraction
STAGE 1 STAGE 2 STAGE 3 STAGE 4
Information
Retrieval
OPENMINTED - The Open Mining Infrastructure for Text and Data
4. 4
Challenges
This is where the footer goes
Text Mining (TM)
remains a fragmented set of tools
TM requires particular technological and analytical skills
as well as domain knowledge
no shared knowledge how to apply
lack of a central infrastructure
(may rule out use of TM for small research groups)
high entry costs:
need to share infrastructure costs
5. 5
Putting it all together
This is where the footer goes
OpenMinTeD
Establish an open and sustainable Text and
Data Mining (TDM) platform and infrastructure
where researchers can collaboratively create,
discover, share and re-use knowledge from a
wide range of text based scientific and
scholarly related sources
6. • 1
• 2
• 3
• 4
• 5
• 6
• 7
6
OpenMinTeD – working on
many fronts
@openminted_eu
6
ACCESSIBLE
CONTENT
DISCOVERABLE
SERVICES
EFFICIENT
PROCESSING
TDM
COMMUNITIES
VALUE ADDED
APPS
Via standardised programmatic
interfaces and access rules
Well-documented easily
discoverable text mining services
and workflows which process,
analyse and annotate text
Operate on public e-Infrastructures
via standarized APIs
Different scientific communities
have different challenges
Community-driven applications to
illustrate the value of the
infastructure. Engage with industry.
OPENMINTED - The Open Mining Infrastructure for Text and Data
taken from ICT2015 presentation (N. Manola)
7. • 1
• 2
• 3
• 4
• 5
• 6
• 7
7
Bridging the gap between different
communities
@openminted_eu
8. • 1
• 2
• 3
• 4
• 5
• 6
• 7
8
The project
Starts: June 2015
Duration: 3 years
16 Partners:
- 6 mining research groups
- 3 content providers
- 1 data center
- 1 library association
- 2 legal experts
- 6 community related partners
- 2 SMEs
Athena RIC
Univ. of Manchester (NacTem)
Univ. of Darmstadt
INRA
EMBL-EBI
Agro-Know
LIBER
Univ. of Amsterdam
Open University UK
EPFL
CNIO
Univ. of Sheffield (GATE)
GESIS
GRNET
Frontiers
Univ. of Stirling
PARTNERS
@openminted_eu
OPENMINTED = The Open Mining Infrastructure for Text and Data
taken from ICT2015 presentation (N. Manola)
9. 9
OpenMinTeD users
This is where the footer goes
TM consumer to advance their science
Service Providers to enhance their
tools
TM researcher to share their algorithms
Content providers to enrich their
content
10. 10
Infrastructural approach
This is where the footer goes
OpenMinted does not build new services,
but adopts and adapts existing services for
new communities
Focuses on interoperability across text
mining services and content providers
Creates an open & collaborative space for
researchers to use the best fitting textmining
services available
11. • 1
• 2
• 3
• 4
• 5
• 6
• 7
11@openminted_eu
Data centre Data centre Data centre Data centre
in public cloud
Publisher text
corpus
OpenAIRE/CORE text
corpus
PMC text
corpus
Other text
corpora
Other text
corpora
Other text
corpora
Other types of text
corpora
Layer 3:
Interoperability
to shared storage and
computing resources
Language resources
Language resources
Language resources Language resources
Layer 2:
Interoperability of
language resources
& corpora
Layer 1:
Interoperability
of text mining services
(platforms or
components)
Language resources and corpora registry service
Platform services
Users: researchers, curators, text-miners and new services developers
Registry Workflow ManagementAuth2 & Policy management Annotator Accounting
Mining Platforms Mining Platforms Mining Platforms
Proprietary architectures
Mining Platforms
OPENMINTED = The Open Mining Infrastructure for Text and Data
The architecture
taken from ICT2015 presentation (N. Manola)
12. • 1
• 2
• 3
• 4
• 5
• 6
• 7
12@openminted_eu
RESEARCH
ANALYTICS
SOCIAL
SCIENCES
AGRICULTURELIFE
SCIENCES
Bottom-up approach
OpenMinTeD works with 4 use cases, which
give their requirements and evaluate the results.
OPENMINTED = The Open Mining Infrastructure for Text and Data
taken from ICT2015 presentation (N. Manola)
19. • 1
• 2
• 3
• 4
• 5
• 6
• 7
19
Social Science Use case
Develop and evaluate methods for
automatic detection and linking of named
entities in Social Science publications in
order to advance reliable and context-
sensitive retrieval and linking of relevant
entities
1@openminted_eu
20. 20
Enhancing Search in Text and Data
This is where the footer goes
classical named entity recognition and
disambiguation of relevant entities (names,
places, organizations, terms) to enhance
automatic indexing
recognition of vague variable mentions to
enhance linking of data and publications
enrich data with context information from text
to enhance retrievability of data sets
21. 21
Identifying references to survey variables
This is where the footer goes
OLGA NEŠPOROVÁ, ZDENĚK
R. NEŠPOR (2009). “Religion: An
Unsolved Problem for the Modern
Czech Nation”
ISSP 2008
Link Database
v39: Believe in life after death
v40: Believe in Heaven
22. 22
Benefits from user perspective
This is where the footer goes
semantic search: understanding the contextual
meaning of (search) terms
fuzzy phrase search: search for attitudes,
survey questions in texts (under vagueness)
link retrieval: search and retrieval of links
between text and data
dataset retrieval: facilitating search for research
data in data catalogues at the level of items and
variables