AG Corpus-écrits, 21 novembre 
Consortium Corpus-écrits 
SIG 
TEI-CMC 
Open Resources and 
TOols for LANGuage 
http://come...
2 
http://www.tei-c.org/Activities/SIG/CMC/ 
http://wiki.tei-c.org/index.php/SIG:Computer-Mediated_Communication
Our subject and goals 
Our subject: 
 building and annotating corpora of computer-mediated 
communication (CMC) – as reso...
Our subject and goals 
Computer-mediated communication (CMC): 
All genres of interpersonal communication mediated 
through...
Our subject and goals 
Our subject: 
 building and annotating corpora of computer-mediated 
communication (CMC) – as reso...
Our subject and goals 
The problem / challenge: 
 By now, there are no established standards for the 
representation of C...
Our subject and goals 
Our goals: 
 work on solutions for these desiderata 
 develop suggestions for standards for 
- pa...
Who belongs to our community (so far)? 
Our kernel projects 
and founding members 
http://http://glottoweb.org/web2corpus/...
Activities and initiatives (past and future) 
2013, 2014 
-European workshops on CMC corpora (Dortmund 
- special journal ...
Project supported by the national 
consortium Corpus-écrits, sub-part of 
Huma-Num, and Ortolang 
Consortium Corpus-écrits...
11 
Dépositeur individuel 
Serveur 
Local LRL 
Ingénieur : 
Kun Jin 
Groupe qualité 
Discussion avec 
dépositeur 
Groupe é...
12
13
Ref Tokens Partici. Posts Envir. 
(Antoniadis,2014) 449 313 359 22 052 SMS 
(Falaise, 2014) 35 M 25 000 3 M textchat 
(Led...
15
16
17
18
19
20
21
22
23
24
25 
Mono 
- Mode 
- Modality 
- Textchat 
- Forum 
- SMS 
- Tweets 
- Email 
- Blogs 
(image 
not means of interaction) 
V...
26 
Time(s) 
Interaction 
Space 
Locations 
Course 
Session 
Channel 
Simultaneity 
Participants 
Environments 
Author 
Ad...
http://wiki.tei-c.org/index.php/SIG:CMC/Draft:_A_basic_schema_for_representing_CMC_in_TEI 
27 
New macro-level elements
1.5 mn video 
* Paper: (Wigham & Chanier, 2013) CALL 
journal 
* Data: (Wigham, 2013) LETEC corpus 
Modality interplay 
Co...
Multimodalité : Verbal et non verbal 
(Wigham & Chanier, 2013) 
Computer-Mediated Communication TEI-MM 2013 (Rome) in TEI:...
Context: Lyceum conf environment, 3 learners (English L2) working into 
a word processor: one writing, others helping 
30 ...
31 http://wiki.tei-c.org/index.php/SIG:Computer-Mediated_Communication
32
l'utilisateur est autorisé à télécharger une copie du corpus […] 
• la réutilisation (reproduction, diffusion) de parties ...
34
35
36
37
Prochain SlideShare
Chargement dans…5
×

Création de la banque de corpus CoMeRe : un partenariat Corpus-écrits – ORTOLANG -TEI-CMC

1 903 vues

Publié le

A l'occasion de l'AG du consortium Corpsu-écrits, avancées du projet CoMeRe

Publié dans : Sciences
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Création de la banque de corpus CoMeRe : un partenariat Corpus-écrits – ORTOLANG -TEI-CMC

  1. 1. AG Corpus-écrits, 21 novembre Consortium Corpus-écrits SIG TEI-CMC Open Resources and TOols for LANGuage http://comere.org http://hdl.handle.net/11403/comere Thierry Chanier, Céline Poudat, Julien Longhi, Gudrun Ledegen, Ciara Wigham, Linda Hriba, Kun Jin, Georges Antoniadis, Benoit Sagot, Camille Paloque, Natalia Grabar, Cislaru Georgeta, Achille Falaise, Paul Lotin
  2. 2. 2 http://www.tei-c.org/Activities/SIG/CMC/ http://wiki.tei-c.org/index.php/SIG:Computer-Mediated_Communication
  3. 3. Our subject and goals Our subject:  building and annotating corpora of computer-mediated communication (CMC) – as resources for empirical research on CMC phenomena in the Humanities (linguistics, communication science, language technology, …) Cette resource doit donc être libre d'accès (open access research data) afin d'être réutilisable par les communautés de chercheurs Nous reviendrons plus tard sur ce point
  4. 4. Our subject and goals Computer-mediated communication (CMC): All genres of interpersonal communication mediated through computer networks (the internet) and used via personal computers and/or mobile devices: chats, online forums, instant messaging, tweets, comments on weblogs, discussions in wikis and on “social net-work” sites, interactions in multimodal communication environments such as Skype, MMORPGs or “virtual worlds” (e.g., SecondLife), SMS, WhatsApp, ....
  5. 5. Our subject and goals Our subject:  building and annotating corpora of computer-mediated communication (CMC) – as resources for empirical research on CMC phenomena in the Humanities (linguistics, communication science, language technology, …) Our vision: These corpora shall be …  interoperable (i) with each other and (ii) with other types of linguistic corpora (text corpora, speech corpora)  represented conformant to established encoding standards in the field of Digital Humanities  linguistically annotated in order to allow for sophisticated queries and language-focused research
  6. 6. Our subject and goals The problem / challenge:  By now, there are no established standards for the representation of CMC genres  Established standards for the representation of text genres do not include models for the representation of the peculiarities of CMC  “Off the shelf” NLP tools for automatic linguistic analysis and annotation (tokenizers, part-of-speech taggers, lematizers, normalizers, parsers) do not perform well on CMC data (because they usually have been trained on edited text and therefore can’t handle “non-standard” phenomena and multimodal elements in CMC discourse)
  7. 7. Our subject and goals Our goals:  work on solutions for these desiderata  develop suggestions for standards for - packaging and sharing (mono- and multimodal) CMC corpora, - modeling these types of “texts” within a framework which is conformant with the encoding framework of the Text Encoding Initiative (TEI) and thus with a widely accepted de-facto standard in the field of Digital Humanities, - processing and annotating these corpora (part-of-speech, normalization, ...) with NLP tools.
  8. 8. Who belongs to our community (so far)? Our kernel projects and founding members http://http://glottoweb.org/web2corpus/ http://hdl.handle.net/11403/comere French CMC corpora Infrastructure for languages National consortium on corpora National infrastructure for Digital Humanities Scientific network „Empirical research of CMC“ http://www.empirikom.net Dortmund Chat Corpus http://www.chatkorpus.tu-dortmund.de German Reference Corpus of CMC http://www.tinyurl.com/derik-llc Wikipedia corpus in DeReKo (Mannheim) German CMC corpora Dutch CMC corpora SoNaR (Stevin Nederlandstalig Referentiecorpus) Italian CMC pilot corpus
  9. 9. Activities and initiatives (past and future) 2013, 2014 -European workshops on CMC corpora (Dortmund - special journal issue (JLCL) 9 Our pathway 2013 creation of the TEI-CMC SIG End of 2014 Publication of CMC French corpora (CoMeRe) in open access, all TEI-CMC 2015 Application to CLARIN-DE Tranform existing German corpora into TEI-CMC 2015 October International CMC conference Rennes (Ledegen) 2015 Submission of TEI-CMC model 2015 Launch larger CMC-corpora community 2016 Common system of basic CMC-annotations (POS tagging)
  10. 10. Project supported by the national consortium Corpus-écrits, sub-part of Huma-Num, and Ortolang Consortium Corpus-écrits Objective: Kernel corpus assembling existing corpora of different CMC genres and new corpora build on data extracted from the Internet. These heterogeneous corpora will be structured and processed in a uniform way, complemented with metadata. CoMeRe will be released as OpenData through the national infrastructure Ortolang, following constraints which will be reused for the forthcoming “Corpus de Référence du Français”. Variety + Standards + Open Access http://comere.org http://hdl.handle.net/11403/comere
  11. 11. 11 Dépositeur individuel Serveur Local LRL Ingénieur : Kun Jin Groupe qualité Discussion avec dépositeur Groupe étiquetage TAL : TEI-v2 TEI-V1 Financements : ORTOLANG > Corpus-écrits > LRL
  12. 12. 12
  13. 13. 13
  14. 14. Ref Tokens Partici. Posts Envir. (Antoniadis,2014) 449 313 359 22 052 SMS (Falaise, 2014) 35 M 25 000 3 M textchat (Ledegen, 2014) 357 000 850 22 000 SMS (Reffay et al., 2014) 600 000 67 + 4 groups - textchat: 6 790 - emails: 2 030 - forums: 2 686 LMS (Yun, Chanier, 2014) 77 605 31 + 2 courses 7 750 textchat (Abendroth-Timmer et al., 2014) 273 546 26 + 4 groups 1 200 Blog (Longhi, Marinica, 2014) 567 851 205 34273 Tweet Informal business Informal Informal education education education 14 politic
  15. 15. 15
  16. 16. 16
  17. 17. 17
  18. 18. 18
  19. 19. 19
  20. 20. 20
  21. 21. 21
  22. 22. 22
  23. 23. 23
  24. 24. 24
  25. 25. 25 Mono - Mode - Modality - Textchat - Forum - SMS - Tweets - Email - Blogs (image not means of interaction) Verbal Verbal & Non-verbal Multi Modalities LMS: - email - forum - chat Multi Modes Conf system: - Audiochat - Textchat Conference system, 3D environment Etc. - Audiochat - Textchat - Icones - Collec prod Whiteboard Word proc. Semantic maps - Avatars - …
  26. 26. 26 Time(s) Interaction Space Locations Course Session Channel Simultaneity Participants Environments Author Adresse(s) Group Network
  27. 27. http://wiki.tei-c.org/index.php/SIG:CMC/Draft:_A_basic_schema_for_representing_CMC_in_TEI 27 New macro-level elements
  28. 28. 1.5 mn video * Paper: (Wigham & Chanier, 2013) CALL journal * Data: (Wigham, 2013) LETEC corpus Modality interplay Computer-Mediated Communication TEI-MM 2013 (Rome) in TEI: What Lies Ahead
  29. 29. Multimodalité : Verbal et non verbal (Wigham & Chanier, 2013) Computer-Mediated Communication TEI-MM 2013 (Rome) in TEI: What Lies Ahead
  30. 30. Context: Lyceum conf environment, 3 learners (English L2) working into a word processor: one writing, others helping 30 Collab word processor Audio: clarification Textchat: Correction (with error) Textchat: Request confirmation Maintenant en TEI-speech
  31. 31. 31 http://wiki.tei-c.org/index.php/SIG:Computer-Mediated_Communication
  32. 32. 32
  33. 33. l'utilisateur est autorisé à télécharger une copie du corpus […] • la réutilisation (reproduction, diffusion) de parties non substantielles du corpus XXX est autorisée […] • la réutilisation est soumise à la condition de citer in extenso, à titre de crédits : […] • la réutilisation (reproduction, diffusion) de parties substantielles du corpus XXX n'est pas permise sur le fondement de la présente licence d'utilisation. Je consens aux présentes conditions d'utilisation (obligatoire pour avoir accès au corpus) Example of corpus licence displayed on the National Infrastructure for Digital Humanities and considered as being"open access" Viewing but not re-using is that OA ? 33
  34. 34. 34
  35. 35. 35
  36. 36. 36
  37. 37. 37

×