Text and Data Mining Challenges in Europe

Text and Data Mining in Europe: Defining the Challenges and Actions @ The Hague
Infrastructure
crossroads
Richard Eckart de Castilho
UKP LAB
Technische Universität Darmstadt
...and the way we walked them in dkpro

PRESENTER
Dr. Richard
Eckart de Castilho
•  Interoperability WP lead @ OpenMinTeD
•  Technical Lead @ UKP
•  Java developer
•  Open source guy
•  NLP software infrastructure researcher
•  Apache UIMA developer
•  DKPro person
@i_am_rec
https://github.com/reckart
Ubiquitous Knowledge Processing Lab

Ubiquitous knowledge
Processing LAB
•  Argumentation Mining
•  Language Technology for Digital Humanities
•  Lexical-Semantic Resources &Algorithms
•  Text Mining & Analytics
•  Writing Assistance and Language Learning
@UKPLab
http://www.ukp.tu-darmstadt.de
Prof. Dr. Iryna Gurevych

DKPro – reuse not reinvent
•  What?
•  Collection of open-source projects related to NLP
•  Community of communities
•  Interoperability between projects
•  Target group: programmers, researchers, application developers
•  Why?
•  Flexibility and control – liberal licensing and redistributable software
•  Sustainability – open community not bound to specific grants
•  Replicability – portable software distributed through repositories
•  Usability – the the edge out of installation
•  Projects
•  DKPro Core – linguistic preprocessing, interoperable third-party tools
•  DKPro TC – text classification experimentation suite
•  UBY – unified semantic resource
•  CSniper – integrated search and annotaton
•  … https://github.io/dkpro

… but why like this?
… how else could it be done?

Analytics
•  Analytics layer
•  Analytics tools (tagger, parser, etc.)
•  Interoperability layer
•  Input/output conversion
•  Tool wrappers
•  Pivot data model
•  Workflow layer
•  Workflow descriptions
•  Workflow engines
•  UI layer
•  Workflow editors
•  Annotation editors
•  Exploration / visualization
Complete!
Solution!
Analytics stack

Automatic text analysis
•  pragmatic
•  Gain insight about a particular ﬁeld of interest
•  Investigate data
•  Use latest data available
•  Results relevant for the moment
•  No need for reproducibility
•  principled
•  Interest in reproducibility
•  Investigate methods
•  Use a ﬁxed data set
•  Results should be reproducible

Manual text analysis
•  pragmatic
•  Collaborative analysis
•  Get as much done as quickly as possible
•  All see/edit the same data / annotations
•  No means of measuring quality / single truth
•  Principled
•  Training data for supervised machine learning
•  Evaluation of automatic methods
•  Distributed analysis
•  Guideline-driven process
•  Multiple independent analyses/annotations
•  Inter-annotator agreement as quality indicator
•  Human in the loop
•  Analytics make suggestions / guide human
•  Human input guides analystics
Human!Machine!

deployment
•  Distributed / static
•  Service oriented
•  High network trafﬁc
•  Running cost
•  Risk of decay / limited availability of older versions
•  More control to providers
•  Localized / dynamic
•  Cloud computing
•  Reduced cost
•  Data locality
•  Scalability
•  Large freedom choosing a version
•  More control to users
•  Gateways
•  Make dynamic setup appear static
•  Handle input/output and workﬂow management
•  Walled garden vs. convenience
Software!
Repository!
Gateway!
Gateway!

“openness”
•  Open
•  Liberal licensing
•  Freedom to choose deployment
•  Integrate custom resources/analytics
•  Control to the user
•  Not open/closed
•  Copyleft/proprietary licensing
•  Prescribed deployment
•  Difﬁcult to customize for the user
•  Control to the provider

A peek at the landscape
Service-based
•  ARGO*
•  Pipeline builder, annotation editor
•  Online platform accessible through
gateway
•  Internally dynamic deployment (afaik)
•  Closed source
•  Weblicht / Alveo / LAPPS
•  Pipeline builder
•  Online platform accessible through
gateway
•  Many services distributed over multiple
locations/stakeholders
•  Some offer access to non-public
content/analytics
•  Some are partially open source
Software-based
•  DKPro Core* / ClearTK
•  Component collection
•  Pipeline scripting / programming
•  Repository-based
•  Easy to deploy/embed anywhere
•  Open source
•  GATE workbench*
•  Pipeline builder, annotation editor,
+++
•  Desktop application
•  GATE Cloud
•  Open source
•  …

DKPro Core – Runnable example
#!/usr/bin/env groovy
@Grab(group='de.tudarmstadt.ukp.dkpro.core',
module='de.tudarmstadt.ukp.dkpro.core.opennlp-asl',
version='1.5.0')
import de.tudarmstadt.ukp.dkpro.core.opennlp.*;
import org.apache.uima.fit.factory.JCasFactory;
import org.apache.uima.fit.pipeline.SimplePipeline;
import de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.*;
import de.tudarmstadt.ukp.dkpro.core.api.syntax.type.*;
import static org.apache.uima.fit.util.JCasUtil.*;
import static org.apache.uima.fit.factory.AnalysisEngineFactory.*;
def jcas = JCasFactory.createJCas();
jcas.documentText = "This is a test";
jcas.documentLanguage = "en";
SimplePipeline.runPipeline(jcas,
createEngineDescription(OpenNlpSegmenter),
createEngineDescription(OpenNlpPosTagger),
createEngineDescription(OpenNlpParser,
OpenNlpParser.PARAM_WRITE_PENN_TREE, true));
select(jcas, Token).each { println "${it.coveredText} ${it.pos.posValue}" }
select(jcas, PennTree).each { println it.pennTree }
Fetches all required!
dependencies!
No manual installation!!
Input!
Analytics pipeline.!
Language-speciﬁc!
resources fetched !
automatically!
Output!

DKPro Core – Runnable example
#!/usr/bin/env groovy
@Grab(group='de.tudarmstadt.ukp.dkpro.core',
module='de.tudarmstadt.ukp.dkpro.core.opennlp-asl',
version='1.5.0')
import de.tudarmstadt.ukp.dkpro.core.opennlp.*;
import org.apache.uima.fit.factory.JCasFactory;
import org.apache.uima.fit.pipeline.SimplePipeline;
import de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.*;
import de.tudarmstadt.ukp.dkpro.core.api.syntax.type.*;
import static org.apache.uima.fit.util.JCasUtil.*;
import static org.apache.uima.fit.factory.AnalysisEngineFactory.*;
def jcas = JCasFactory.createJCas();
jcas.documentText = "This is a test";
jcas.documentLanguage = "en";
SimplePipeline.runPipeline(jcas,
createEngineDescription(OpenNlpSegmenter),
createEngineDescription(OpenNlpPosTagger),
createEngineDescription(OpenNlpParser,
OpenNlpParser.PARAM_WRITE_PENN_TREE, true));
select(jcas, Token).each { println "${it.coveredText} ${it.pos.posValue}" }
select(jcas, PennTree).each { println it.pennTree }
Fetches all required!
dependencies!
No manual installation!!
Input!
Analytics pipeline.!
Language-speciﬁc!
resources fetched !
automatically!
Output!
Why is this cool?!
This is an actual running example!!
Requires only !
JVM + Groovy (+ Internet connection)!
Easy to parallelize / scale!
Trivial to embed in applications!
Trivial to wrap as a service!

Conclusion / Challenges
•  Data is growing / analytics get more complex
•  Need more powerful systems to process it
•  Human in the loop
•  Human interaction inﬂuences analytics and vice versa
•  Need to move data and analytics around
•  Often conﬂicts with interest in protection of investment
•  Need interoperability
•  To discover data, resources, and analytics
•  To access data and resources
•  To deploy analytics
•  To retrieve and further use results

What comes next?

tomorrow @ the hague: interoperability
Data
Conversion!
Analysis!
Automatic Step /!
Analysis
Component /!
Nested workﬂow!
Human Annotation/
Human Correction!
Resource
repository!
(Auxiliary Data)!
Data!
Source!
Data!
Sink!
Provenance!
WG1!
WG2!
WG3!
WG4!
Data
Conversion!
A
P
I!
A
P
I!
API!
Software
repository!
API!
ID / Version!
ID / Version!
New ID / Version!
Desktop / Server!
Cloud !
resource!
| | | | | | | | | | | | | | | | !
Cluster!
Portability / Scalability / Sustainability!
Analysis!
Service!
API!
Rights and restrictions aggregation!

Thanks

References
•  Alveo http://alveo.edu.au/
•  Argo http://argo.nactem.ac.uk
•  CLEARTK http://cleartk.github.io/cleartk/
•  DKPro https://dkpro.github.io
•  Gate https://gate.ac.uk
•  Lapps http://www.lappsgrid.org
•  UIMA http://uima.apache.org
•  Weblicht https://weblicht.sfs.uni-tuebingen.de/

Text and Data Mining Challenges in Europe

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (13)

Similaire à Text and Data Mining Challenges in Europe

Similaire à Text and Data Mining Challenges in Europe (20)

Plus de openminted_eu

Plus de openminted_eu (6)

Dernier

Dernier (20)

Text and Data Mining Challenges in Europe