The document discusses building and annotating corpora of computer-mediated communication (CMC) according to established standards. It presents the goals of the TEI-CMC SIG consortium, which include developing standards for representing and processing CMC genres within the TEI framework. The challenges include the lack of standards for CMC and tools that can handle its peculiarities. The consortium aims to work on solutions and develop suggestions for standardizing the sharing and encoding of CMC corpora to make them interoperable resources for research.
1. AG Corpus-écrits, 21 novembre
Consortium Corpus-écrits
SIG
TEI-CMC
Open Resources and
TOols for LANGuage
http://comere.org
http://hdl.handle.net/11403/comere
Thierry Chanier, Céline Poudat, Julien Longhi, Gudrun Ledegen, Ciara Wigham,
Linda Hriba, Kun Jin, Georges Antoniadis, Benoit Sagot, Camille Paloque,
Natalia Grabar, Cislaru Georgeta, Achille Falaise, Paul Lotin
3. Our subject and goals
Our subject:
building and annotating corpora of computer-mediated
communication (CMC) – as resources for empirical research on
CMC phenomena in the Humanities (linguistics, communication
science, language technology, …)
Cette resource doit donc être libre d'accès (open
access research data) afin d'être réutilisable par les
communautés de chercheurs
Nous reviendrons plus tard sur ce point
4. Our subject and goals
Computer-mediated communication (CMC):
All genres of interpersonal communication mediated
through computer networks (the internet) and used
via personal computers and/or mobile devices: chats,
online forums, instant messaging, tweets, comments
on weblogs, discussions in wikis and on “social net-work”
sites, interactions in multimodal communication
environments such as Skype, MMORPGs or “virtual
worlds” (e.g., SecondLife), SMS, WhatsApp, ....
5. Our subject and goals
Our subject:
building and annotating corpora of computer-mediated
communication (CMC) – as resources for empirical research on
CMC phenomena in the Humanities (linguistics, communication
science, language technology, …)
Our vision: These corpora shall be …
interoperable (i) with each other and (ii) with other types of
linguistic corpora (text corpora, speech corpora)
represented conformant to established encoding standards in
the field of Digital Humanities
linguistically annotated in order to allow for sophisticated
queries and language-focused research
6. Our subject and goals
The problem / challenge:
By now, there are no established standards for the
representation of CMC genres
Established standards for the representation of text genres do
not include models for the representation of the peculiarities of
CMC
“Off the shelf” NLP tools for automatic linguistic analysis and
annotation (tokenizers, part-of-speech taggers, lematizers,
normalizers, parsers) do not perform well on CMC data
(because they usually have been trained on edited text and
therefore can’t handle “non-standard” phenomena and
multimodal elements in CMC discourse)
7. Our subject and goals
Our goals:
work on solutions for these desiderata
develop suggestions for standards for
- packaging and sharing (mono- and multimodal) CMC
corpora,
- modeling these types of “texts” within a framework which is
conformant with the encoding framework of the Text
Encoding Initiative (TEI) and thus with a widely accepted de-facto
standard in the field of Digital Humanities,
- processing and annotating these corpora (part-of-speech,
normalization, ...) with NLP tools.
8. Who belongs to our community (so far)?
Our kernel projects
and founding members
http://http://glottoweb.org/web2corpus/
http://hdl.handle.net/11403/comere
French CMC corpora
Infrastructure for languages
National consortium on corpora
National infrastructure
for Digital Humanities
Scientific network
„Empirical research of CMC“
http://www.empirikom.net
Dortmund Chat Corpus
http://www.chatkorpus.tu-dortmund.de
German Reference Corpus of CMC
http://www.tinyurl.com/derik-llc
Wikipedia corpus in DeReKo
(Mannheim)
German CMC corpora
Dutch CMC corpora
SoNaR
(Stevin Nederlandstalig Referentiecorpus)
Italian CMC pilot corpus
9. Activities and initiatives (past and future)
2013, 2014
-European workshops on CMC corpora (Dortmund
- special journal issue (JLCL)
9
Our
pathway
2013
creation of the TEI-CMC SIG
End of 2014
Publication of CMC French
corpora (CoMeRe) in open
access, all TEI-CMC
2015
Application to CLARIN-DE
Tranform existing German
corpora into TEI-CMC
2015 October
International
CMC conference
Rennes (Ledegen)
2015
Submission
of TEI-CMC
model
2015
Launch larger
CMC-corpora
community
2016
Common system
of basic CMC-annotations
(POS tagging)
10. Project supported by the national
consortium Corpus-écrits, sub-part of
Huma-Num, and Ortolang
Consortium Corpus-écrits
Objective: Kernel corpus assembling existing corpora of different CMC
genres and new corpora build on data extracted from the Internet. These
heterogeneous corpora will be structured and processed in a uniform way,
complemented with metadata. CoMeRe will be released as OpenData
through the national infrastructure Ortolang, following constraints which will
be reused for the forthcoming “Corpus de Référence du Français”.
Variety + Standards + Open Access
http://comere.org
http://hdl.handle.net/11403/comere
11. 11
Dépositeur individuel
Serveur
Local LRL
Ingénieur :
Kun Jin
Groupe qualité
Discussion avec
dépositeur
Groupe étiquetage
TAL : TEI-v2
TEI-V1
Financements : ORTOLANG > Corpus-écrits > LRL
28. 1.5 mn video
* Paper: (Wigham & Chanier, 2013) CALL
journal
* Data: (Wigham, 2013) LETEC corpus
Modality interplay
Computer-Mediated Communication TEI-MM 2013 (Rome) in TEI: What Lies Ahead
29. Multimodalité : Verbal et non verbal
(Wigham & Chanier, 2013)
Computer-Mediated Communication TEI-MM 2013 (Rome) in TEI: What Lies Ahead
30. Context: Lyceum conf environment, 3 learners (English L2) working into
a word processor: one writing, others helping
30
Collab word
processor
Audio:
clarification
Textchat:
Correction
(with error)
Textchat:
Request
confirmation
Maintenant en
TEI-speech
33. l'utilisateur est autorisé à télécharger une copie du corpus […]
• la réutilisation (reproduction, diffusion) de parties non substantielles du corpus XXX est autorisée […]
• la réutilisation est soumise à la condition de citer in extenso, à titre de crédits : […]
• la réutilisation (reproduction, diffusion) de parties substantielles du corpus XXX n'est pas permise sur
le fondement de la présente licence d'utilisation.
Je consens aux présentes conditions d'utilisation (obligatoire pour avoir accès au corpus)
Example of corpus licence displayed on the National Infrastructure for Digital
Humanities and considered as being"open access"
Viewing but not re-using is
that OA ?
33
I‘ll do a final editing of this slide in the next days (in Romne, before the meeting...)
Parler des citations / références
Journal de recherche structuré : création du chercheur pas documentaliste.
Comere repository
Polititwwet
OLAC : métadonnées réduites pour Clarin
Sautez un niveau pilitiwwet
Aller au détail polititweet : manuel PDF
Puis Simuligne diversité avec LMS, participants
Dans la première on peut rectifier à la main.
Malheureusement, les discussions sont organisées de façon très variées. Assez souvent les auteurs ne respectent pas ces consignes. La Figure 3‑3 en donne une illustration. Une personne tape explicitement les graphies Réponse : au début de son texte puis semble signé en faisant appel à la marque d'indentation, seulement pour cette signature. Ici la signature n'indique qu'une adresse IP et la date. On hésite à savoir où se termine le texte du premier auteur. Celui qui répond intervient semble-t-il deux fois, sans respecter les formats et semble terminer par une indication de signature, Curry (pas au sens Wikipédia cependant). Si l'on examine le lien associé à ce dernier mot, on trouve, non une page d'auteur mais une page générale de Wikipédia (cf. Figure 3‑4) ! Traiter automatiquement de telles pages pose donc problème.
An Interaction Space is an abstract concept, located in time (with a beginning and ending date with absolute time, hence a time frame) where interactions between a set of participants occur within an online location . The online location is defined by the properties of the set of environments used by the set of participants
In one of our paper, which will appear in the CALL journal, and the corresponding data are already online in Mulce, Ciara Wigham discusses the interplay between audio and textchat.
Here is an extract from Archi21. In the left column you have the transcription of the audio of one learner, who presents his feeling related to the on-going process of his architectural project. He is a French native and speaks in English as his L2. In the 3 other columns on the right, you find textchats turns coming from the tutor and two other learners belonging to the same architectural project group.
Let me show you a short video.
**** In this example of conversation doubling, the acts in the text chat respond to the voice chat (blue arrows) but equally acts in the voice chat respond to the text chat (orange arrows) and text chat acts respond to interaction in both voice chat and text chat modalities and prompt interaction in both modalities
http://88milsms.huma-num.fr/corpus.html
There exist 3 main criteria that research data should follow in order to be considered OpenData.
Besides being obviously available, the interesting perspective is the fact that data can be access in order to be reuse and mix with other data and licence should explicitly mention this.
Second interesting point is that the constraints for reuse should be reduced to a minimum, then the definition stipulate that non-commercial’ restrictions that would prevent ‘commercial’ use, or restrictions of use for certain purposes are not allowed