SlideShare une entreprise Scribd logo
1  sur  18
Télécharger pour lire hors ligne
The Perfect Swell:
Workshop on Text and Data Mining
for Data Driven Innovation
The research infrastructure perspective
Dieter Van Uytvanck
Max Planck Institute for Psycholinguistics
Dieter.VanUytvanck@mpi.nl
TDM workshop, London
2013-09-27
CLARIN?
§  Common Language Resources and Technology
Infrastructure
§  aims at providing easy and sustainable access for scholars
in the humanities and social sciences
§  to digital language data (in written, spoken, video or
multimodal form)
§  to advanced tools to discover, explore, exploit, annotate,
analyse or combine them
§  independent of where they are located: a shared
distributed infrastructure
§  More information: www.clarin.eu
TDM workshop
London
2013-09-27
www.clarin.eu
Language resources: rich variety
§  Modality: written, spoken, signed
§  Additional channels: eye movements, gestures, neuro-
imaging data (EEG, fMRI, …), etc.
TDM workshop
London
2013-09-27
www.clarin.euAnnotations
Data: the basis for research
Language resources: rich variety
§  Location:
§  data from all over the world (including
some very remote corners)
§  … and from the world wide web,
smartphones, …
§  Time:
§  old historic collections (hieroglyphs,
manuscripts, rock carvings, …), often
OCR’ed, digitised and annotated
§  up to real-time data gathered from
social networks
§  Origin:
§  elicited (experiments)
§  natural language use (“in the wild”)
TDM workshop
London
2013-09-27
www.clarin.eu
Annotations
a: the basis for research
Data mining in CLARIN
§  very important paradigm in language resource processing
§  major shift from rule-based to data-driven systems
§  not only text, also multimedia
§  importance of
§  access to primary data for fellow researchers: need access to
whole works and not only to snippets and sentences in order
to do TDM.
§  replicating experiments utterly important
§  technical support: virtual collections allow to refer to large online
data sets
§  safe legal setting for researchers (license signing does not scale
to 500.000 texts that are automatically collected from thousands
of websites)
TDM workshop
London
2013-09-27
www.clarin.eu
Data mining in CLARIN
§  some examples to demonstrate the variation and nature of
data mining based on language resources
TDM workshop
London
2013-09-27
www.clarin.eu
Some examples (1)
TDM workshop
London
2013-09-27
www.clarin.eu
§  Mass text analysis (Petersen et al., 2012):
doi:10.1038/srep00313
Some examples (2)
TDM workshop
London
2013-09-27
www.clarin.eu
§  AUVIS face/hand tracking analysis: http://tla.mpi.nl/
projects_info/auvis/
Head/Hands Tracking
Some examples (3)
TDM workshop
London
2013-09-27
www.clarin.eu
§  Stylometry and plagiarism detection
http://www.clips.ua.ac.be/category/projects/stylometry
§  e.g. Mike Kestemont, http://www.mike-kestemont.org/?p=362
Some examples (4)
TDM workshop
London
2013-09-27
www.clarin.eu
§  Language evolution analysis with phylogenetic trees (Bouckaert
et al., 2012) – doi:10.1126/science.1219669
At the other extreme, we fit a “sailor” model with
no reluctance to move into water and rapid move-
ment across water. Consistent with the findings
based on the RRW model, each of the landscape-
based models supports the Anatolian farming
theory of Indo-European origin (Table 1).
Our results strongly support an Anatolian
homeland for the Indo-European language family.
The inferred location (Fig. 1) and timing [95%
highest posterior density (HPD) interval, 7116 to
10,410 years ago] of Indo-European origin is con-
gruent with the proposal that the family began
to diverge with the spread of agriculture from
Fig. 2. Map and maximum clade credibility tree showing the diversification
of the major Indo-European subfamilies. The tree shows the timing of the
emergence of the major branches and their subsequent diversification. The
inferred location at the root of each subfamily is shown on the map, colored
to match the corresponding branches on the tree. Albanian, Armenian, and
Greek subfamilies are shown separately for clarity (inset). Contours represent
the 95% (largest), 75%, and 50% HPD regions, based on kernel density
estimates (15).
Phylogeographic analysis
Bayes factor
Anatolian vs. steppe I Anatolian vs. steppe II
RRW: All languages 175.0 159.3
RRW: Ancient languages only 1404.2 1582.6
RRW: Contemporary languages only 12.0 11.4
Landscape aware: Diffusion 298.2 141.9
Landscape aware: Migration from land into water less
likely than from land to land by a factor of 10
197.7 92.3
Landscape aware: Migration from land into water less
likely than from land to land by a factor of 100
337.3 161.0
Landscape aware: Sailor 236.0 111.7
onAugust24,2012www.sciencemag.orgDownloadedfrom
The research infrastructure role
§  Data sets:
§  Long-term preservation (archiving)
§  Making them citable (persistent identifiers) and findable
(metadata)
§  Making access easier with federated login
§  Lowering the threshold to use advanced software
§  offer web front-ends, web service chains
§  cooperation with computing centres for heavy tasks
§  Know-how building & support
§  about the nature of the resources and tools
§  technical matters
§  legal issues
TDM workshop
London
2013-09-27
www.clarin.eu
Legal perspective on resources
TDM workshop
London
2013-09-27
www.clarin.eu
§  Rough classification of language resources
available via the CLARIN centres:
§  Public
§  full access, no restrictions at all
§  e.g. parallel corpora from the EU Parliament
§  Academic
§  available for all academic users
§  e.g. corpus spoken Dutch (radio recordings, …)
§  Restricted
§  everything more restricted than Academic >
personalised access rules
§  e.g. video from doctor-patient interaction
Examples of each process
Resource
2.12.2010
Figure 6 Three main cont
the additional requiremen
3.3 The prerequisit
The CLARIN prototype s
Examples of each process
Resource
2.12.2010
Figure 6 Three main cont
the additional requiremen
The summary of the class
Figure 5 above.
The CLARIN prototype s
Examples of each process
Resource
2.12.2010
Figure 6 Three main cont
the additional requiremen
Legal perspective on resources
§  CLARIN recommends CC licenses for new resources as
this is the least problematic for all in the long run. Such
resources can be made publicly available.
§  For older material, we try to distribute them as freely as
can be negotiated. For these we offer two categories:
§  resources free for researchers
§  resources requiring individual permission by the owner.
§  It is good to note that not everything is about copyright.
§  We also have to deal with personal data which can only be
provided for a limited time to individual researchers unless
they are anonymized.
§  Also ethical perspectives should be taken into account. (e.g.
asking participants if they are ok with data mining/processing
at the time of recording)
TDM workshop
London
2013-09-27
www.clarin.eu
Technical Perspective (1)
§  The above restrictions can be realized by requiring:
§  PUB - no identification of the user and no individual
permission, i.e. the resources are free for all and publicly
available.
§  ACA - identification of the user, but no individual
permission, e.g. CLARIN-distributed resources for academic
use.
§  RES - identification of the user and individual usage
permission, i.e. the resources are restrictedly available to
individual researchers, e.g. resources containing personal
data.
TDM workshop
London
2013-09-27
www.clarin.eu
Technical Perspective (2)
§  Federated Identity Management (“Shibboleth”)
§  allows to access resources at a remote server
§  with institutional credentials
§  makes it relatively straight-forward to recognize academic
users and grant them access to restricted resources
§  details: http://clarin.eu/node/3788
TDM workshop
London
2013-09-27
www.clarin.eu
Future perspective for legal
exception framework
§  As we in CLARIN are capable of
§  identifying researchers and
§  protecting the resources from other users,
§  CLARIN already has all the technical prerequisites needed
for implementing and supervising a broad research
exception in the EU such as the one already in effect in the
Netherlands.
TDM workshop
London
2013-09-27
www.clarin.eu
Conclusion
§  Datamining plays an increasingly important role in
(language resource-based) research
§  Research infrastructures try to assist academics to make
efficiently use of the existing resources and tools
§  Many technical issues have been addressed already
(e.g. authentication of researchers)
§  We hope remaining legal (copyright) issues could be
addressed by a research exception (or likewise a concept
of fair use)
TDM workshop
London
2013-09-27
www.clarin.eu
Acknowledgement
§  Thanks to Krister Lindén and Erik Ketzan from the
CLARIN legal issues committee for their valuable
input!
§  Thank you for your attention!
TDM workshop
London
2013-09-27
www.clarin.eu

Contenu connexe

Similaire à The research infrastructure perspective, Dieter Van Uytvanck, CLARIN

Open Research Data: Licensing | Standards | Future
Open Research Data: Licensing | Standards | FutureOpen Research Data: Licensing | Standards | Future
Open Research Data: Licensing | Standards | FutureRoss Mounce
 
Disciplinary RDM
Disciplinary RDMDisciplinary RDM
Disciplinary RDMSarah Jones
 
VREs and Research Tools - supporting collaborative research
VREs and Research Tools - supporting collaborative researchVREs and Research Tools - supporting collaborative research
VREs and Research Tools - supporting collaborative researchChristopher Brown
 
Research data spring: a consortial approach to RDM within SaS
Research data spring: a consortial approach to RDM within SaSResearch data spring: a consortial approach to RDM within SaS
Research data spring: a consortial approach to RDM within SaSJisc RDM
 
Sarah Jones RDM from a disciplinary perspective
Sarah Jones RDM from a disciplinary perspectiveSarah Jones RDM from a disciplinary perspective
Sarah Jones RDM from a disciplinary perspectiveJisc
 
Open science and its advocacy
Open science and its advocacyOpen science and its advocacy
Open science and its advocacySarah Jones
 
Scholarship in a connected world: New ways to know, new ways to show
Scholarship in a connected world: New ways to know, new ways to showScholarship in a connected world: New ways to know, new ways to show
Scholarship in a connected world: New ways to know, new ways to showDerek Keats
 
ICL09 - iClould Paper 'A fish called Guido'
ICL09 - iClould Paper 'A fish called Guido'ICL09 - iClould Paper 'A fish called Guido'
ICL09 - iClould Paper 'A fish called Guido'Leo Gaggl
 
New challenges for digital scholarship and curation in the era of ubiquitous ...
New challenges for digital scholarship and curation in the era of ubiquitous ...New challenges for digital scholarship and curation in the era of ubiquitous ...
New challenges for digital scholarship and curation in the era of ubiquitous ...Derek Keats
 
Agile resources on the open web …. a global digital library
Agile resources on the open web …. a global digital libraryAgile resources on the open web …. a global digital library
Agile resources on the open web …. a global digital libraryJisc
 
Benefits and practice of open science
Benefits and practice of open scienceBenefits and practice of open science
Benefits and practice of open scienceSarah Jones
 
Research Data Management: a gentle introduction
Research Data Management: a gentle introductionResearch Data Management: a gentle introduction
Research Data Management: a gentle introductionMartin Donnelly
 
Managing and sharing data
Managing and sharing dataManaging and sharing data
Managing and sharing dataSarah Jones
 
Clipper jisc rdn cambridge 2016
Clipper jisc rdn cambridge 2016Clipper jisc rdn cambridge 2016
Clipper jisc rdn cambridge 2016John Casey
 
Clipper, research data network
Clipper, research data networkClipper, research data network
Clipper, research data networkJisc RDM
 
Semantic Web / Linked Data Technologies
Semantic Web / Linked Data TechnologiesSemantic Web / Linked Data Technologies
Semantic Web / Linked Data TechnologiesMathieu d'Aquin
 
SENESCHAL: Semantic ENrichment Enabling Sustainability of arCHAeological Link...
SENESCHAL: Semantic ENrichment Enabling Sustainability of arCHAeological Link...SENESCHAL: Semantic ENrichment Enabling Sustainability of arCHAeological Link...
SENESCHAL: Semantic ENrichment Enabling Sustainability of arCHAeological Link...CIGScotland
 

Similaire à The research infrastructure perspective, Dieter Van Uytvanck, CLARIN (20)

Open Research Data: Licensing | Standards | Future
Open Research Data: Licensing | Standards | FutureOpen Research Data: Licensing | Standards | Future
Open Research Data: Licensing | Standards | Future
 
Disciplinary RDM
Disciplinary RDMDisciplinary RDM
Disciplinary RDM
 
VREs and Research Tools - supporting collaborative research
VREs and Research Tools - supporting collaborative researchVREs and Research Tools - supporting collaborative research
VREs and Research Tools - supporting collaborative research
 
Research data spring: a consortial approach to RDM within SaS
Research data spring: a consortial approach to RDM within SaSResearch data spring: a consortial approach to RDM within SaS
Research data spring: a consortial approach to RDM within SaS
 
Sarah Jones RDM from a disciplinary perspective
Sarah Jones RDM from a disciplinary perspectiveSarah Jones RDM from a disciplinary perspective
Sarah Jones RDM from a disciplinary perspective
 
Open Science
Open ScienceOpen Science
Open Science
 
Open science and its advocacy
Open science and its advocacyOpen science and its advocacy
Open science and its advocacy
 
ld4dh demo lecture
ld4dh demo lectureld4dh demo lecture
ld4dh demo lecture
 
Scholarship in a connected world: New ways to know, new ways to show
Scholarship in a connected world: New ways to know, new ways to showScholarship in a connected world: New ways to know, new ways to show
Scholarship in a connected world: New ways to know, new ways to show
 
ICL09 - iClould Paper 'A fish called Guido'
ICL09 - iClould Paper 'A fish called Guido'ICL09 - iClould Paper 'A fish called Guido'
ICL09 - iClould Paper 'A fish called Guido'
 
New challenges for digital scholarship and curation in the era of ubiquitous ...
New challenges for digital scholarship and curation in the era of ubiquitous ...New challenges for digital scholarship and curation in the era of ubiquitous ...
New challenges for digital scholarship and curation in the era of ubiquitous ...
 
Agile resources on the open web …. a global digital library
Agile resources on the open web …. a global digital libraryAgile resources on the open web …. a global digital library
Agile resources on the open web …. a global digital library
 
Benefits and practice of open science
Benefits and practice of open scienceBenefits and practice of open science
Benefits and practice of open science
 
Research Data Management: a gentle introduction
Research Data Management: a gentle introductionResearch Data Management: a gentle introduction
Research Data Management: a gentle introduction
 
Managing and sharing data
Managing and sharing dataManaging and sharing data
Managing and sharing data
 
Clipper jisc rdn cambridge 2016
Clipper jisc rdn cambridge 2016Clipper jisc rdn cambridge 2016
Clipper jisc rdn cambridge 2016
 
Clipper, research data network
Clipper, research data networkClipper, research data network
Clipper, research data network
 
Videolectures for ocwc2010
Videolectures for ocwc2010Videolectures for ocwc2010
Videolectures for ocwc2010
 
Semantic Web / Linked Data Technologies
Semantic Web / Linked Data TechnologiesSemantic Web / Linked Data Technologies
Semantic Web / Linked Data Technologies
 
SENESCHAL: Semantic ENrichment Enabling Sustainability of arCHAeological Link...
SENESCHAL: Semantic ENrichment Enabling Sustainability of arCHAeological Link...SENESCHAL: Semantic ENrichment Enabling Sustainability of arCHAeological Link...
SENESCHAL: Semantic ENrichment Enabling Sustainability of arCHAeological Link...
 

Plus de LIBER Europe

LIBER Europe Covid-19 Research Libraries Survey - December 2020
LIBER Europe Covid-19 Research Libraries Survey - December 2020LIBER Europe Covid-19 Research Libraries Survey - December 2020
LIBER Europe Covid-19 Research Libraries Survey - December 2020LIBER Europe
 
LIBER Webinar: Turning FAIR Data Into Reality
LIBER Webinar: Turning FAIR Data Into RealityLIBER Webinar: Turning FAIR Data Into Reality
LIBER Webinar: Turning FAIR Data Into RealityLIBER Europe
 
Copyright Reform: EU Legislative Process & LIBER Advocacy
Copyright Reform: EU Legislative Process & LIBER AdvocacyCopyright Reform: EU Legislative Process & LIBER Advocacy
Copyright Reform: EU Legislative Process & LIBER AdvocacyLIBER Europe
 
LIBER Webinar: Supporting Data Literacy
LIBER Webinar: Supporting Data LiteracyLIBER Webinar: Supporting Data Literacy
LIBER Webinar: Supporting Data LiteracyLIBER Europe
 
Applying Bourdieu's Field Theory to MLS Curricula Development. Charlotte Nord...
Applying Bourdieu's Field Theory to MLS Curricula Development. Charlotte Nord...Applying Bourdieu's Field Theory to MLS Curricula Development. Charlotte Nord...
Applying Bourdieu's Field Theory to MLS Curricula Development. Charlotte Nord...LIBER Europe
 
Growing a Culture for Change at The University of Manchester Library. Penny H...
Growing a Culture for Change at The University of Manchester Library. Penny H...Growing a Culture for Change at The University of Manchester Library. Penny H...
Growing a Culture for Change at The University of Manchester Library. Penny H...LIBER Europe
 
Knowledge Exchange Consensus: Monitoring of Open Access Publications and Cost...
Knowledge Exchange Consensus: Monitoring of Open Access Publications and Cost...Knowledge Exchange Consensus: Monitoring of Open Access Publications and Cost...
Knowledge Exchange Consensus: Monitoring of Open Access Publications and Cost...LIBER Europe
 
The GND initiative 2017-2021: Developing a Backbone for the Web of Cultural a...
The GND initiative 2017-2021: Developing a Backbone for the Web of Cultural a...The GND initiative 2017-2021: Developing a Backbone for the Web of Cultural a...
The GND initiative 2017-2021: Developing a Backbone for the Web of Cultural a...LIBER Europe
 
The Role of Libraries in the Adoption of Research Data Management. Ingeborg V...
The Role of Libraries in the Adoption of Research Data Management. Ingeborg V...The Role of Libraries in the Adoption of Research Data Management. Ingeborg V...
The Role of Libraries in the Adoption of Research Data Management. Ingeborg V...LIBER Europe
 
LibChain – Open, Verifiable and Anonymous Access Management. Juan Cabello, P...
 LibChain – Open, Verifiable and Anonymous Access Management. Juan Cabello, P... LibChain – Open, Verifiable and Anonymous Access Management. Juan Cabello, P...
LibChain – Open, Verifiable and Anonymous Access Management. Juan Cabello, P...LIBER Europe
 
From Open Access to Open Data: Collaborative Work in the University Libraries...
From Open Access to Open Data: Collaborative Work in the University Libraries...From Open Access to Open Data: Collaborative Work in the University Libraries...
From Open Access to Open Data: Collaborative Work in the University Libraries...LIBER Europe
 
The Perks and Challenges of Drawing Maps and Walking at the Same Time
The Perks and Challenges of Drawing Maps and Walking at the Same TimeThe Perks and Challenges of Drawing Maps and Walking at the Same Time
The Perks and Challenges of Drawing Maps and Walking at the Same TimeLIBER Europe
 
TIB AV-Portal: Semantic Content Mining with Semi-Automatic Metadata Editing. ...
TIB AV-Portal: Semantic Content Mining with Semi-Automatic Metadata Editing. ...TIB AV-Portal: Semantic Content Mining with Semi-Automatic Metadata Editing. ...
TIB AV-Portal: Semantic Content Mining with Semi-Automatic Metadata Editing. ...LIBER Europe
 
Text and Data Mining : Making the Most of a Copyright Exception. Julien Roche...
Text and Data Mining : Making the Most of a Copyright Exception. Julien Roche...Text and Data Mining : Making the Most of a Copyright Exception. Julien Roche...
Text and Data Mining : Making the Most of a Copyright Exception. Julien Roche...LIBER Europe
 
Adoption and Integration of Persistent Identifiers in European Research Infor...
Adoption and Integration of Persistent Identifiers in European Research Infor...Adoption and Integration of Persistent Identifiers in European Research Infor...
Adoption and Integration of Persistent Identifiers in European Research Infor...LIBER Europe
 
Digital Humanities Clinics – Leading Dutch Librarians into DH. Lotte Wilms, N...
Digital Humanities Clinics – Leading Dutch Librarians into DH. Lotte Wilms, N...Digital Humanities Clinics – Leading Dutch Librarians into DH. Lotte Wilms, N...
Digital Humanities Clinics – Leading Dutch Librarians into DH. Lotte Wilms, N...LIBER Europe
 
COUNTER Standards for Open Access: The Value of Measuring/The Measuring of Va...
COUNTER Standards for Open Access: The Value of Measuring/The Measuring of Va...COUNTER Standards for Open Access: The Value of Measuring/The Measuring of Va...
COUNTER Standards for Open Access: The Value of Measuring/The Measuring of Va...LIBER Europe
 
Enabling the Exchange and use of Data in Agriculture
Enabling the Exchange and use of Data in AgricultureEnabling the Exchange and use of Data in Agriculture
Enabling the Exchange and use of Data in AgricultureLIBER Europe
 
GDPR - Thoughts on the EU Data Protection Regulation, Research and Libraries
GDPR - Thoughts on the EU Data Protection Regulation, Research and LibrariesGDPR - Thoughts on the EU Data Protection Regulation, Research and Libraries
GDPR - Thoughts on the EU Data Protection Regulation, Research and LibrariesLIBER Europe
 
Research Data Services and Data Collections: Library Synergies for Economic R...
Research Data Services and Data Collections: Library Synergies for Economic R...Research Data Services and Data Collections: Library Synergies for Economic R...
Research Data Services and Data Collections: Library Synergies for Economic R...LIBER Europe
 

Plus de LIBER Europe (20)

LIBER Europe Covid-19 Research Libraries Survey - December 2020
LIBER Europe Covid-19 Research Libraries Survey - December 2020LIBER Europe Covid-19 Research Libraries Survey - December 2020
LIBER Europe Covid-19 Research Libraries Survey - December 2020
 
LIBER Webinar: Turning FAIR Data Into Reality
LIBER Webinar: Turning FAIR Data Into RealityLIBER Webinar: Turning FAIR Data Into Reality
LIBER Webinar: Turning FAIR Data Into Reality
 
Copyright Reform: EU Legislative Process & LIBER Advocacy
Copyright Reform: EU Legislative Process & LIBER AdvocacyCopyright Reform: EU Legislative Process & LIBER Advocacy
Copyright Reform: EU Legislative Process & LIBER Advocacy
 
LIBER Webinar: Supporting Data Literacy
LIBER Webinar: Supporting Data LiteracyLIBER Webinar: Supporting Data Literacy
LIBER Webinar: Supporting Data Literacy
 
Applying Bourdieu's Field Theory to MLS Curricula Development. Charlotte Nord...
Applying Bourdieu's Field Theory to MLS Curricula Development. Charlotte Nord...Applying Bourdieu's Field Theory to MLS Curricula Development. Charlotte Nord...
Applying Bourdieu's Field Theory to MLS Curricula Development. Charlotte Nord...
 
Growing a Culture for Change at The University of Manchester Library. Penny H...
Growing a Culture for Change at The University of Manchester Library. Penny H...Growing a Culture for Change at The University of Manchester Library. Penny H...
Growing a Culture for Change at The University of Manchester Library. Penny H...
 
Knowledge Exchange Consensus: Monitoring of Open Access Publications and Cost...
Knowledge Exchange Consensus: Monitoring of Open Access Publications and Cost...Knowledge Exchange Consensus: Monitoring of Open Access Publications and Cost...
Knowledge Exchange Consensus: Monitoring of Open Access Publications and Cost...
 
The GND initiative 2017-2021: Developing a Backbone for the Web of Cultural a...
The GND initiative 2017-2021: Developing a Backbone for the Web of Cultural a...The GND initiative 2017-2021: Developing a Backbone for the Web of Cultural a...
The GND initiative 2017-2021: Developing a Backbone for the Web of Cultural a...
 
The Role of Libraries in the Adoption of Research Data Management. Ingeborg V...
The Role of Libraries in the Adoption of Research Data Management. Ingeborg V...The Role of Libraries in the Adoption of Research Data Management. Ingeborg V...
The Role of Libraries in the Adoption of Research Data Management. Ingeborg V...
 
LibChain – Open, Verifiable and Anonymous Access Management. Juan Cabello, P...
 LibChain – Open, Verifiable and Anonymous Access Management. Juan Cabello, P... LibChain – Open, Verifiable and Anonymous Access Management. Juan Cabello, P...
LibChain – Open, Verifiable and Anonymous Access Management. Juan Cabello, P...
 
From Open Access to Open Data: Collaborative Work in the University Libraries...
From Open Access to Open Data: Collaborative Work in the University Libraries...From Open Access to Open Data: Collaborative Work in the University Libraries...
From Open Access to Open Data: Collaborative Work in the University Libraries...
 
The Perks and Challenges of Drawing Maps and Walking at the Same Time
The Perks and Challenges of Drawing Maps and Walking at the Same TimeThe Perks and Challenges of Drawing Maps and Walking at the Same Time
The Perks and Challenges of Drawing Maps and Walking at the Same Time
 
TIB AV-Portal: Semantic Content Mining with Semi-Automatic Metadata Editing. ...
TIB AV-Portal: Semantic Content Mining with Semi-Automatic Metadata Editing. ...TIB AV-Portal: Semantic Content Mining with Semi-Automatic Metadata Editing. ...
TIB AV-Portal: Semantic Content Mining with Semi-Automatic Metadata Editing. ...
 
Text and Data Mining : Making the Most of a Copyright Exception. Julien Roche...
Text and Data Mining : Making the Most of a Copyright Exception. Julien Roche...Text and Data Mining : Making the Most of a Copyright Exception. Julien Roche...
Text and Data Mining : Making the Most of a Copyright Exception. Julien Roche...
 
Adoption and Integration of Persistent Identifiers in European Research Infor...
Adoption and Integration of Persistent Identifiers in European Research Infor...Adoption and Integration of Persistent Identifiers in European Research Infor...
Adoption and Integration of Persistent Identifiers in European Research Infor...
 
Digital Humanities Clinics – Leading Dutch Librarians into DH. Lotte Wilms, N...
Digital Humanities Clinics – Leading Dutch Librarians into DH. Lotte Wilms, N...Digital Humanities Clinics – Leading Dutch Librarians into DH. Lotte Wilms, N...
Digital Humanities Clinics – Leading Dutch Librarians into DH. Lotte Wilms, N...
 
COUNTER Standards for Open Access: The Value of Measuring/The Measuring of Va...
COUNTER Standards for Open Access: The Value of Measuring/The Measuring of Va...COUNTER Standards for Open Access: The Value of Measuring/The Measuring of Va...
COUNTER Standards for Open Access: The Value of Measuring/The Measuring of Va...
 
Enabling the Exchange and use of Data in Agriculture
Enabling the Exchange and use of Data in AgricultureEnabling the Exchange and use of Data in Agriculture
Enabling the Exchange and use of Data in Agriculture
 
GDPR - Thoughts on the EU Data Protection Regulation, Research and Libraries
GDPR - Thoughts on the EU Data Protection Regulation, Research and LibrariesGDPR - Thoughts on the EU Data Protection Regulation, Research and Libraries
GDPR - Thoughts on the EU Data Protection Regulation, Research and Libraries
 
Research Data Services and Data Collections: Library Synergies for Economic R...
Research Data Services and Data Collections: Library Synergies for Economic R...Research Data Services and Data Collections: Library Synergies for Economic R...
Research Data Services and Data Collections: Library Synergies for Economic R...
 

Dernier

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Dernier (20)

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

The research infrastructure perspective, Dieter Van Uytvanck, CLARIN

  • 1. The Perfect Swell: Workshop on Text and Data Mining for Data Driven Innovation The research infrastructure perspective Dieter Van Uytvanck Max Planck Institute for Psycholinguistics Dieter.VanUytvanck@mpi.nl TDM workshop, London 2013-09-27
  • 2. CLARIN? §  Common Language Resources and Technology Infrastructure §  aims at providing easy and sustainable access for scholars in the humanities and social sciences §  to digital language data (in written, spoken, video or multimodal form) §  to advanced tools to discover, explore, exploit, annotate, analyse or combine them §  independent of where they are located: a shared distributed infrastructure §  More information: www.clarin.eu TDM workshop London 2013-09-27 www.clarin.eu
  • 3. Language resources: rich variety §  Modality: written, spoken, signed §  Additional channels: eye movements, gestures, neuro- imaging data (EEG, fMRI, …), etc. TDM workshop London 2013-09-27 www.clarin.euAnnotations Data: the basis for research
  • 4. Language resources: rich variety §  Location: §  data from all over the world (including some very remote corners) §  … and from the world wide web, smartphones, … §  Time: §  old historic collections (hieroglyphs, manuscripts, rock carvings, …), often OCR’ed, digitised and annotated §  up to real-time data gathered from social networks §  Origin: §  elicited (experiments) §  natural language use (“in the wild”) TDM workshop London 2013-09-27 www.clarin.eu Annotations a: the basis for research
  • 5. Data mining in CLARIN §  very important paradigm in language resource processing §  major shift from rule-based to data-driven systems §  not only text, also multimedia §  importance of §  access to primary data for fellow researchers: need access to whole works and not only to snippets and sentences in order to do TDM. §  replicating experiments utterly important §  technical support: virtual collections allow to refer to large online data sets §  safe legal setting for researchers (license signing does not scale to 500.000 texts that are automatically collected from thousands of websites) TDM workshop London 2013-09-27 www.clarin.eu
  • 6. Data mining in CLARIN §  some examples to demonstrate the variation and nature of data mining based on language resources TDM workshop London 2013-09-27 www.clarin.eu
  • 7. Some examples (1) TDM workshop London 2013-09-27 www.clarin.eu §  Mass text analysis (Petersen et al., 2012): doi:10.1038/srep00313
  • 8. Some examples (2) TDM workshop London 2013-09-27 www.clarin.eu §  AUVIS face/hand tracking analysis: http://tla.mpi.nl/ projects_info/auvis/ Head/Hands Tracking
  • 9. Some examples (3) TDM workshop London 2013-09-27 www.clarin.eu §  Stylometry and plagiarism detection http://www.clips.ua.ac.be/category/projects/stylometry §  e.g. Mike Kestemont, http://www.mike-kestemont.org/?p=362
  • 10. Some examples (4) TDM workshop London 2013-09-27 www.clarin.eu §  Language evolution analysis with phylogenetic trees (Bouckaert et al., 2012) – doi:10.1126/science.1219669 At the other extreme, we fit a “sailor” model with no reluctance to move into water and rapid move- ment across water. Consistent with the findings based on the RRW model, each of the landscape- based models supports the Anatolian farming theory of Indo-European origin (Table 1). Our results strongly support an Anatolian homeland for the Indo-European language family. The inferred location (Fig. 1) and timing [95% highest posterior density (HPD) interval, 7116 to 10,410 years ago] of Indo-European origin is con- gruent with the proposal that the family began to diverge with the spread of agriculture from Fig. 2. Map and maximum clade credibility tree showing the diversification of the major Indo-European subfamilies. The tree shows the timing of the emergence of the major branches and their subsequent diversification. The inferred location at the root of each subfamily is shown on the map, colored to match the corresponding branches on the tree. Albanian, Armenian, and Greek subfamilies are shown separately for clarity (inset). Contours represent the 95% (largest), 75%, and 50% HPD regions, based on kernel density estimates (15). Phylogeographic analysis Bayes factor Anatolian vs. steppe I Anatolian vs. steppe II RRW: All languages 175.0 159.3 RRW: Ancient languages only 1404.2 1582.6 RRW: Contemporary languages only 12.0 11.4 Landscape aware: Diffusion 298.2 141.9 Landscape aware: Migration from land into water less likely than from land to land by a factor of 10 197.7 92.3 Landscape aware: Migration from land into water less likely than from land to land by a factor of 100 337.3 161.0 Landscape aware: Sailor 236.0 111.7 onAugust24,2012www.sciencemag.orgDownloadedfrom
  • 11. The research infrastructure role §  Data sets: §  Long-term preservation (archiving) §  Making them citable (persistent identifiers) and findable (metadata) §  Making access easier with federated login §  Lowering the threshold to use advanced software §  offer web front-ends, web service chains §  cooperation with computing centres for heavy tasks §  Know-how building & support §  about the nature of the resources and tools §  technical matters §  legal issues TDM workshop London 2013-09-27 www.clarin.eu
  • 12. Legal perspective on resources TDM workshop London 2013-09-27 www.clarin.eu §  Rough classification of language resources available via the CLARIN centres: §  Public §  full access, no restrictions at all §  e.g. parallel corpora from the EU Parliament §  Academic §  available for all academic users §  e.g. corpus spoken Dutch (radio recordings, …) §  Restricted §  everything more restricted than Academic > personalised access rules §  e.g. video from doctor-patient interaction Examples of each process Resource 2.12.2010 Figure 6 Three main cont the additional requiremen 3.3 The prerequisit The CLARIN prototype s Examples of each process Resource 2.12.2010 Figure 6 Three main cont the additional requiremen The summary of the class Figure 5 above. The CLARIN prototype s Examples of each process Resource 2.12.2010 Figure 6 Three main cont the additional requiremen
  • 13. Legal perspective on resources §  CLARIN recommends CC licenses for new resources as this is the least problematic for all in the long run. Such resources can be made publicly available. §  For older material, we try to distribute them as freely as can be negotiated. For these we offer two categories: §  resources free for researchers §  resources requiring individual permission by the owner. §  It is good to note that not everything is about copyright. §  We also have to deal with personal data which can only be provided for a limited time to individual researchers unless they are anonymized. §  Also ethical perspectives should be taken into account. (e.g. asking participants if they are ok with data mining/processing at the time of recording) TDM workshop London 2013-09-27 www.clarin.eu
  • 14. Technical Perspective (1) §  The above restrictions can be realized by requiring: §  PUB - no identification of the user and no individual permission, i.e. the resources are free for all and publicly available. §  ACA - identification of the user, but no individual permission, e.g. CLARIN-distributed resources for academic use. §  RES - identification of the user and individual usage permission, i.e. the resources are restrictedly available to individual researchers, e.g. resources containing personal data. TDM workshop London 2013-09-27 www.clarin.eu
  • 15. Technical Perspective (2) §  Federated Identity Management (“Shibboleth”) §  allows to access resources at a remote server §  with institutional credentials §  makes it relatively straight-forward to recognize academic users and grant them access to restricted resources §  details: http://clarin.eu/node/3788 TDM workshop London 2013-09-27 www.clarin.eu
  • 16. Future perspective for legal exception framework §  As we in CLARIN are capable of §  identifying researchers and §  protecting the resources from other users, §  CLARIN already has all the technical prerequisites needed for implementing and supervising a broad research exception in the EU such as the one already in effect in the Netherlands. TDM workshop London 2013-09-27 www.clarin.eu
  • 17. Conclusion §  Datamining plays an increasingly important role in (language resource-based) research §  Research infrastructures try to assist academics to make efficiently use of the existing resources and tools §  Many technical issues have been addressed already (e.g. authentication of researchers) §  We hope remaining legal (copyright) issues could be addressed by a research exception (or likewise a concept of fair use) TDM workshop London 2013-09-27 www.clarin.eu
  • 18. Acknowledgement §  Thanks to Krister Lindén and Erik Ketzan from the CLARIN legal issues committee for their valuable input! §  Thank you for your attention! TDM workshop London 2013-09-27 www.clarin.eu