AG Corpus-écrits, 21 novembre meeting highlights

AG Corpus-écrits, 21 novembre
Consortium Corpus-écrits
SIG
TEI-CMC
Open Resources and
TOols for LANGuage
http://comere.org
http://hdl.handle.net/11403/comere
Thierry Chanier, Céline Poudat, Julien Longhi, Gudrun Ledegen, Ciara Wigham,
Linda Hriba, Kun Jin, Georges Antoniadis, Benoit Sagot, Camille Paloque,
Natalia Grabar, Cislaru Georgeta, Achille Falaise, Paul Lotin

2
http://www.tei-c.org/Activities/SIG/CMC/
http://wiki.tei-c.org/index.php/SIG:Computer-Mediated_Communication

Our subject and goals
Our subject:
 building and annotating corpora of computer-mediated
communication (CMC) – as resources for empirical research on
CMC phenomena in the Humanities (linguistics, communication
science, language technology, …)
Cette resource doit donc être libre d'accès (open
access research data) afin d'être réutilisable par les
communautés de chercheurs
Nous reviendrons plus tard sur ce point

Computer-mediated communication (CMC):
All genres of interpersonal communication mediated
through computer networks (the internet) and used
via personal computers and/or mobile devices: chats,
online forums, instant messaging, tweets, comments
on weblogs, discussions in wikis and on “social net-work”
sites, interactions in multimodal communication
environments such as Skype, MMORPGs or “virtual
worlds” (e.g., SecondLife), SMS, WhatsApp, ....

Our subject:
 building and annotating corpora of computer-mediated
communication (CMC) – as resources for empirical research on
CMC phenomena in the Humanities (linguistics, communication
science, language technology, …)
Our vision: These corpora shall be …
 interoperable (i) with each other and (ii) with other types of
linguistic corpora (text corpora, speech corpora)
 represented conformant to established encoding standards in
the field of Digital Humanities
 linguistically annotated in order to allow for sophisticated
queries and language-focused research

The problem / challenge:
 By now, there are no established standards for the
representation of CMC genres
 Established standards for the representation of text genres do
not include models for the representation of the peculiarities of
CMC
 “Off the shelf” NLP tools for automatic linguistic analysis and
annotation (tokenizers, part-of-speech taggers, lematizers,
normalizers, parsers) do not perform well on CMC data
(because they usually have been trained on edited text and
therefore can’t handle “non-standard” phenomena and
multimodal elements in CMC discourse)

Our goals:
 work on solutions for these desiderata
 develop suggestions for standards for
- packaging and sharing (mono- and multimodal) CMC
corpora,
- modeling these types of “texts” within a framework which is
conformant with the encoding framework of the Text
Encoding Initiative (TEI) and thus with a widely accepted de-facto
standard in the field of Digital Humanities,
- processing and annotating these corpora (part-of-speech,
normalization, ...) with NLP tools.

Who belongs to our community (so far)?
Our kernel projects
and founding members
http://http://glottoweb.org/web2corpus/
French CMC corpora
Infrastructure for languages
National consortium on corpora
National infrastructure
for Digital Humanities
Scientific network
„Empirical research of CMC“
http://www.empirikom.net
Dortmund Chat Corpus
http://www.chatkorpus.tu-dortmund.de
German Reference Corpus of CMC
http://www.tinyurl.com/derik-llc
Wikipedia corpus in DeReKo
(Mannheim)
German CMC corpora
Dutch CMC corpora
SoNaR
(Stevin Nederlandstalig Referentiecorpus)
Italian CMC pilot corpus

Activities and initiatives (past and future)
2013, 2014
-European workshops on CMC corpora (Dortmund
- special journal issue (JLCL)
9
Our
pathway
2013
creation of the TEI-CMC SIG
End of 2014
Publication of CMC French
corpora (CoMeRe) in open
access, all TEI-CMC
2015
Application to CLARIN-DE
Tranform existing German
corpora into TEI-CMC
2015 October
International
CMC conference
Rennes (Ledegen)
2015
Submission
of TEI-CMC
model
2015
Launch larger
CMC-corpora
community
2016
Common system
of basic CMC-annotations
(POS tagging)

Project supported by the national
consortium Corpus-écrits, sub-part of
Huma-Num, and Ortolang
Consortium Corpus-écrits
Objective: Kernel corpus assembling existing corpora of different CMC
genres and new corpora build on data extracted from the Internet. These
heterogeneous corpora will be structured and processed in a uniform way,
complemented with metadata. CoMeRe will be released as OpenData
through the national infrastructure Ortolang, following constraints which will
be reused for the forthcoming “Corpus de Référence du Français”.
Variety + Standards + Open Access
http://comere.org

11
Dépositeur individuel
Serveur
Local LRL
Ingénieur :
Kun Jin
Groupe qualité
Discussion avec
dépositeur
Groupe étiquetage
TAL : TEI-v2
TEI-V1
Financements : ORTOLANG > Corpus-écrits > LRL

Ref Tokens Partici. Posts Envir.
(Antoniadis,2014) 449 313 359 22 052 SMS
(Falaise, 2014) 35 M 25 000 3 M textchat
(Ledegen, 2014) 357 000 850 22 000 SMS
(Reffay et al., 2014) 600 000 67 + 4 groups
- textchat: 6 790
- emails: 2 030
- forums: 2 686
LMS
(Yun, Chanier, 2014) 77 605 31 + 2 courses 7 750 textchat
(Abendroth-Timmer
et al., 2014) 273 546 26 + 4 groups 1 200 Blog
(Longhi, Marinica,
2014) 567 851 205 34273 Tweet
Informal
business
Informal
Informal
education
education
education
14
politic

25
Mono
- Mode
- Modality
- Textchat
- Forum
- SMS
- Tweets
- Email
- Blogs
(image
not means of interaction)
Verbal Verbal & Non-verbal
Multi
Modalities
LMS:
- email
- forum
- chat
Multi
Modes
Conf system:
- Audiochat
- Textchat
Conference system,
3D environment
Etc.
- Audiochat
- Textchat
- Icones
- Collec prod
Whiteboard
Word proc.
Semantic maps
- Avatars
- …

26
Time(s)
Interaction
Space
Locations
Course
Session
Channel
Simultaneity
Participants
Environments
Author
Adresse(s)
Group
Network

http://wiki.tei-c.org/index.php/SIG:CMC/Draft:_A_basic_schema_for_representing_CMC_in_TEI
27
New macro-level elements

1.5 mn video
* Paper: (Wigham & Chanier, 2013) CALL
journal
* Data: (Wigham, 2013) LETEC corpus
Modality interplay
Computer-Mediated Communication TEI-MM 2013 (Rome) in TEI: What Lies Ahead

Multimodalité : Verbal et non verbal
(Wigham & Chanier, 2013)
Computer-Mediated Communication TEI-MM 2013 (Rome) in TEI: What Lies Ahead

Context: Lyceum conf environment, 3 learners (English L2) working into
a word processor: one writing, others helping
30
Collab word
processor
Audio:
clarification
Textchat:
Correction
(with error)
Textchat:
Request
confirmation
Maintenant en
TEI-speech

31 http://wiki.tei-c.org/index.php/SIG:Computer-Mediated_Communication

l'utilisateur est autorisé à télécharger une copie du corpus […]
• la réutilisation (reproduction, diffusion) de parties non substantielles du corpus XXX est autorisée […]
• la réutilisation est soumise à la condition de citer in extenso, à titre de crédits : […]
• la réutilisation (reproduction, diffusion) de parties substantielles du corpus XXX n'est pas permise sur
le fondement de la présente licence d'utilisation.
Je consens aux présentes conditions d'utilisation (obligatoire pour avoir accès au corpus)
Example of corpus licence displayed on the National Infrastructure for Digital
Humanities and considered as being"open access"
Viewing but not re-using is
that OA ?
33

AG Corpus-écrits, 21 novembre meeting highlights

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (13)

Similaire à AG Corpus-écrits, 21 novembre meeting highlights

Similaire à AG Corpus-écrits, 21 novembre meeting highlights (20)

Plus de Thierry Chanier

Plus de Thierry Chanier (6)

Dernier

Dernier (20)

AG Corpus-écrits, 21 novembre meeting highlights

Notes de l'éditeur