AG Corpus-écrits, 21 novembre 
Consortium Corpus-écrits 
SIG 
TEI-CMC 
Open Resources and 
TOols for LANGuage 
http://come...
2 
http://www.tei-c.org/Activities/SIG/CMC/ 
http://wiki.tei-c.org/index.php/SIG:Computer-Mediated_Communication
Our subject and goals 
Our subject: 
 building and annotating corpora of computer-mediated 
communication (CMC) – as reso...
Our subject and goals 
Computer-mediated communication (CMC): 
All genres of interpersonal communication mediated 
through...
Our subject and goals 
Our subject: 
 building and annotating corpora of computer-mediated 
communication (CMC) – as reso...
Our subject and goals 
The problem / challenge: 
 By now, there are no established standards for the 
representation of C...
Our subject and goals 
Our goals: 
 work on solutions for these desiderata 
 develop suggestions for standards for 
- pa...
Who belongs to our community (so far)? 
Our kernel projects 
and founding members 
http://http://glottoweb.org/web2corpus/...
Activities and initiatives (past and future) 
2013, 2014 
-European workshops on CMC corpora (Dortmund 
- special journal ...
Project supported by the national 
consortium Corpus-écrits, sub-part of 
Huma-Num, and Ortolang 
Consortium Corpus-écrits...
11 
Dépositeur individuel 
Serveur 
Local LRL 
Ingénieur : 
Kun Jin 
Groupe qualité 
Discussion avec 
dépositeur 
Groupe é...
12
13
Ref Tokens Partici. Posts Envir. 
(Antoniadis,2014) 449 313 359 22 052 SMS 
(Falaise, 2014) 35 M 25 000 3 M textchat 
(Led...
15
16
17
18
19
20
21
22
23
24
25 
Mono 
- Mode 
- Modality 
- Textchat 
- Forum 
- SMS 
- Tweets 
- Email 
- Blogs 
(image 
not means of interaction) 
V...
26 
Time(s) 
Interaction 
Space 
Locations 
Course 
Session 
Channel 
Simultaneity 
Participants 
Environments 
Author 
Ad...
http://wiki.tei-c.org/index.php/SIG:CMC/Draft:_A_basic_schema_for_representing_CMC_in_TEI 
27 
New macro-level elements
1.5 mn video 
* Paper: (Wigham & Chanier, 2013) CALL 
journal 
* Data: (Wigham, 2013) LETEC corpus 
Modality interplay 
Co...
Multimodalité : Verbal et non verbal 
(Wigham & Chanier, 2013) 
Computer-Mediated Communication TEI-MM 2013 (Rome) in TEI:...
Context: Lyceum conf environment, 3 learners (English L2) working into 
a word processor: one writing, others helping 
30 ...
31 http://wiki.tei-c.org/index.php/SIG:Computer-Mediated_Communication
32
l'utilisateur est autorisé à télécharger une copie du corpus […] 
• la réutilisation (reproduction, diffusion) de parties ...
34
35
36
37
Prochain SlideShare
Chargement dans…5
×

Création de la banque de corpus CoMeRe : un partenariat Corpus-écrits – ORTOLANG -TEI-CMC

1 794 vues

Publié le

A l'occasion de l'AG du consortium Corpsu-écrits, avancées du projet CoMeRe

Publié dans : Sciences
0 commentaire
0 j’aime
Statistiques
Remarques
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Aucun téléchargement
Vues
Nombre de vues
1 794
Sur SlideShare
0
Issues des intégrations
0
Intégrations
1 146
Actions
Partages
0
Téléchargements
5
Commentaires
0
J’aime
0
Intégrations 0
Aucune incorporation

Aucune remarque pour cette diapositive
  • I‘ll do a final editing of this slide in the next days (in Romne, before the meeting...)
  • Parler des citations / références
  • Journal de recherche structuré : création du chercheur pas documentaliste.
    Comere repository
    Polititwwet
    OLAC : métadonnées réduites pour Clarin
    Sautez un niveau pilitiwwet
    Aller au détail polititweet : manuel PDF
    Puis Simuligne diversité avec LMS, participants

  • Dans la première on peut rectifier à la main.
    Malheureusement, les discussions sont organisées de façon très variées. Assez souvent les auteurs ne respectent pas ces consignes. La Figure 3‑3 en donne une illustration. Une personne tape explicitement les graphies Réponse : au début de son texte puis semble signé en faisant appel à la marque d'indentation, seulement pour cette signature. Ici la signature n'indique qu'une adresse IP et la date. On hésite à savoir où se termine le texte du premier auteur. Celui qui répond intervient semble-t-il deux fois, sans respecter les formats et semble terminer par une indication de signature, Curry (pas au sens Wikipédia cependant). Si l'on examine le lien associé à ce dernier mot, on trouve, non une page d'auteur mais une page générale de Wikipédia (cf. Figure 3‑4) ! Traiter automatiquement de telles pages pose donc problème.
  • An Interaction Space is an abstract concept, located in time (with a beginning and ending date with absolute time, hence a time frame) where interactions between a set of participants occur within an online location . The online location is defined by the properties of the set of environments used by the set of participants
  • In one of our paper, which will appear in the CALL journal, and the corresponding data are already online in Mulce, Ciara Wigham discusses the interplay between audio and textchat.
    Here is an extract from Archi21. In the left column you have the transcription of the audio of one learner, who presents his feeling related to the on-going process of his architectural project. He is a French native and speaks in English as his L2. In the 3 other columns on the right, you find textchats turns coming from the tutor and two other learners belonging to the same architectural project group.
    Let me show you a short video.
    **** In this example of conversation doubling, the acts in the text chat respond to the voice chat (blue arrows) but equally acts in the voice chat respond to the text chat (orange arrows) and text chat acts respond to interaction in both voice chat and text chat modalities and prompt interaction in both modalities
  • http://88milsms.huma-num.fr/corpus.html
  • There exist 3 main criteria that research data should follow in order to be considered OpenData.

    Besides being obviously available, the interesting perspective is the fact that data can be access in order to be reuse and mix with other data and licence should explicitly mention this.
    Second interesting point is that the constraints for reuse should be reduced to a minimum, then the definition stipulate that non-commercial’ restrictions that would prevent ‘commercial’ use, or restrictions of use for certain purposes are not allowed
  • Création de la banque de corpus CoMeRe : un partenariat Corpus-écrits – ORTOLANG -TEI-CMC

    1. 1. AG Corpus-écrits, 21 novembre Consortium Corpus-écrits SIG TEI-CMC Open Resources and TOols for LANGuage http://comere.org http://hdl.handle.net/11403/comere Thierry Chanier, Céline Poudat, Julien Longhi, Gudrun Ledegen, Ciara Wigham, Linda Hriba, Kun Jin, Georges Antoniadis, Benoit Sagot, Camille Paloque, Natalia Grabar, Cislaru Georgeta, Achille Falaise, Paul Lotin
    2. 2. 2 http://www.tei-c.org/Activities/SIG/CMC/ http://wiki.tei-c.org/index.php/SIG:Computer-Mediated_Communication
    3. 3. Our subject and goals Our subject:  building and annotating corpora of computer-mediated communication (CMC) – as resources for empirical research on CMC phenomena in the Humanities (linguistics, communication science, language technology, …) Cette resource doit donc être libre d'accès (open access research data) afin d'être réutilisable par les communautés de chercheurs Nous reviendrons plus tard sur ce point
    4. 4. Our subject and goals Computer-mediated communication (CMC): All genres of interpersonal communication mediated through computer networks (the internet) and used via personal computers and/or mobile devices: chats, online forums, instant messaging, tweets, comments on weblogs, discussions in wikis and on “social net-work” sites, interactions in multimodal communication environments such as Skype, MMORPGs or “virtual worlds” (e.g., SecondLife), SMS, WhatsApp, ....
    5. 5. Our subject and goals Our subject:  building and annotating corpora of computer-mediated communication (CMC) – as resources for empirical research on CMC phenomena in the Humanities (linguistics, communication science, language technology, …) Our vision: These corpora shall be …  interoperable (i) with each other and (ii) with other types of linguistic corpora (text corpora, speech corpora)  represented conformant to established encoding standards in the field of Digital Humanities  linguistically annotated in order to allow for sophisticated queries and language-focused research
    6. 6. Our subject and goals The problem / challenge:  By now, there are no established standards for the representation of CMC genres  Established standards for the representation of text genres do not include models for the representation of the peculiarities of CMC  “Off the shelf” NLP tools for automatic linguistic analysis and annotation (tokenizers, part-of-speech taggers, lematizers, normalizers, parsers) do not perform well on CMC data (because they usually have been trained on edited text and therefore can’t handle “non-standard” phenomena and multimodal elements in CMC discourse)
    7. 7. Our subject and goals Our goals:  work on solutions for these desiderata  develop suggestions for standards for - packaging and sharing (mono- and multimodal) CMC corpora, - modeling these types of “texts” within a framework which is conformant with the encoding framework of the Text Encoding Initiative (TEI) and thus with a widely accepted de-facto standard in the field of Digital Humanities, - processing and annotating these corpora (part-of-speech, normalization, ...) with NLP tools.
    8. 8. Who belongs to our community (so far)? Our kernel projects and founding members http://http://glottoweb.org/web2corpus/ http://hdl.handle.net/11403/comere French CMC corpora Infrastructure for languages National consortium on corpora National infrastructure for Digital Humanities Scientific network „Empirical research of CMC“ http://www.empirikom.net Dortmund Chat Corpus http://www.chatkorpus.tu-dortmund.de German Reference Corpus of CMC http://www.tinyurl.com/derik-llc Wikipedia corpus in DeReKo (Mannheim) German CMC corpora Dutch CMC corpora SoNaR (Stevin Nederlandstalig Referentiecorpus) Italian CMC pilot corpus
    9. 9. Activities and initiatives (past and future) 2013, 2014 -European workshops on CMC corpora (Dortmund - special journal issue (JLCL) 9 Our pathway 2013 creation of the TEI-CMC SIG End of 2014 Publication of CMC French corpora (CoMeRe) in open access, all TEI-CMC 2015 Application to CLARIN-DE Tranform existing German corpora into TEI-CMC 2015 October International CMC conference Rennes (Ledegen) 2015 Submission of TEI-CMC model 2015 Launch larger CMC-corpora community 2016 Common system of basic CMC-annotations (POS tagging)
    10. 10. Project supported by the national consortium Corpus-écrits, sub-part of Huma-Num, and Ortolang Consortium Corpus-écrits Objective: Kernel corpus assembling existing corpora of different CMC genres and new corpora build on data extracted from the Internet. These heterogeneous corpora will be structured and processed in a uniform way, complemented with metadata. CoMeRe will be released as OpenData through the national infrastructure Ortolang, following constraints which will be reused for the forthcoming “Corpus de Référence du Français”. Variety + Standards + Open Access http://comere.org http://hdl.handle.net/11403/comere
    11. 11. 11 Dépositeur individuel Serveur Local LRL Ingénieur : Kun Jin Groupe qualité Discussion avec dépositeur Groupe étiquetage TAL : TEI-v2 TEI-V1 Financements : ORTOLANG > Corpus-écrits > LRL
    12. 12. 12
    13. 13. 13
    14. 14. Ref Tokens Partici. Posts Envir. (Antoniadis,2014) 449 313 359 22 052 SMS (Falaise, 2014) 35 M 25 000 3 M textchat (Ledegen, 2014) 357 000 850 22 000 SMS (Reffay et al., 2014) 600 000 67 + 4 groups - textchat: 6 790 - emails: 2 030 - forums: 2 686 LMS (Yun, Chanier, 2014) 77 605 31 + 2 courses 7 750 textchat (Abendroth-Timmer et al., 2014) 273 546 26 + 4 groups 1 200 Blog (Longhi, Marinica, 2014) 567 851 205 34273 Tweet Informal business Informal Informal education education education 14 politic
    15. 15. 15
    16. 16. 16
    17. 17. 17
    18. 18. 18
    19. 19. 19
    20. 20. 20
    21. 21. 21
    22. 22. 22
    23. 23. 23
    24. 24. 24
    25. 25. 25 Mono - Mode - Modality - Textchat - Forum - SMS - Tweets - Email - Blogs (image not means of interaction) Verbal Verbal & Non-verbal Multi Modalities LMS: - email - forum - chat Multi Modes Conf system: - Audiochat - Textchat Conference system, 3D environment Etc. - Audiochat - Textchat - Icones - Collec prod Whiteboard Word proc. Semantic maps - Avatars - …
    26. 26. 26 Time(s) Interaction Space Locations Course Session Channel Simultaneity Participants Environments Author Adresse(s) Group Network
    27. 27. http://wiki.tei-c.org/index.php/SIG:CMC/Draft:_A_basic_schema_for_representing_CMC_in_TEI 27 New macro-level elements
    28. 28. 1.5 mn video * Paper: (Wigham & Chanier, 2013) CALL journal * Data: (Wigham, 2013) LETEC corpus Modality interplay Computer-Mediated Communication TEI-MM 2013 (Rome) in TEI: What Lies Ahead
    29. 29. Multimodalité : Verbal et non verbal (Wigham & Chanier, 2013) Computer-Mediated Communication TEI-MM 2013 (Rome) in TEI: What Lies Ahead
    30. 30. Context: Lyceum conf environment, 3 learners (English L2) working into a word processor: one writing, others helping 30 Collab word processor Audio: clarification Textchat: Correction (with error) Textchat: Request confirmation Maintenant en TEI-speech
    31. 31. 31 http://wiki.tei-c.org/index.php/SIG:Computer-Mediated_Communication
    32. 32. 32
    33. 33. l'utilisateur est autorisé à télécharger une copie du corpus […] • la réutilisation (reproduction, diffusion) de parties non substantielles du corpus XXX est autorisée […] • la réutilisation est soumise à la condition de citer in extenso, à titre de crédits : […] • la réutilisation (reproduction, diffusion) de parties substantielles du corpus XXX n'est pas permise sur le fondement de la présente licence d'utilisation. Je consens aux présentes conditions d'utilisation (obligatoire pour avoir accès au corpus) Example of corpus licence displayed on the National Infrastructure for Digital Humanities and considered as being"open access" Viewing but not re-using is that OA ? 33
    34. 34. 34
    35. 35. 35
    36. 36. 36
    37. 37. 37

    ×