SlideShare une entreprise Scribd logo
1  sur  36
Text Encoding and Enrichment for Linguistic 
Analysis: Archives on the policy of Armaments 
within Western European Union 
Centre Virtuel de la Connaissance sur l’Europe (CVCE), Luxembourg 
Florentina Armaselu (DHLab) -florentina.armaselu@cvce.eu 
Verónica Martins (EIS) - veronica.martins@cvce.eu 
Catherine Jones (DHLab) - catherine.jones@cvce.eu 
1 
www.cvce.eu 
Exploring Historical Sources with Language Technology: Results and Perspectives 
Huygens ING , The Hague, December 8, 9, 2014
Summary 
1. About the CVCE 
2. Overview of the WEU-DIPLO project 
3. XML-TEI Encoding 
4. Named Entity Recognition (NER) 
5. Corpus Analysis 
6. Future work 
7. References 
Summary 2
CVCE - Centre Virtuel de la 
Connaissance sur l'Europe 
An interdisciplinary centre of e-research 
and documentation on 
the European Integration 
Process. 
two key areas of activity: 
- Interdisciplinary research on the European 
integration process in the XX and XXI centuries; 
- Research, development and integration of 
digital tools and methods to support 
advancement in European Integration Studies. 
About the CVCE 3
Overview of the WEU-DIPLO project 
1. Goal: XML-TEI encoding, corpus analysis and Web publication of institutional documents 
of the W.E.U. (Western European Union): 
• Topics: armament production, standardization, control in the period from 1954 to 1982; 
• Source: Archives nationales de Luxembourg, W.E.U collection. 
2. Format: 
• digitized versions (JPEG) of typewritten materials (one file per page). 
3. Size: 
Category Number of 
documents 
Note 89 43 46 34 395 191 204 144 
Minutes 30 15 15 15 256 138 118 118 
Memorandum 3 1 2 2 16 7 9 9 
Study 2 0 2 1 12 0 12 8 
Discourse 1 0 1 0 4 0 4 0 
Draft protocol 2 1 1 0 4 2 2 0 
Total 127 60 67 52 687 338 349 279 
*proc. = processed 
Number of documents 
per language 
Number 
of pages 
Number of pages per 
language 
EN FR FR proc.* EN FR FR proc.* 
Overview WEU-DIPLO 4
Overview of the WEU-DIPLO project 
5. Corpus Selection 
• Form and content 
 Form 
 OCR experiment conditions - need to diversify the form of the 
documents; 
 Bilingual. 
 Content 
 Archives’ 30 years rule corresponds with 30 years time period for the 
corpus (1954-1982)-selection of documents from the 1950’s, 1960’s, 
1970’s and 1980’s; 
 Case study: Armaments production and control within WEU 
 Selection based on research question and more specific topics: 
French and British positions, WEU’s role/competences, nature of the 
debates within the Council/Standing Armament Committee; 
 Need for the documents to cover all the available material categories 
(minutes, notes, memorandum…). 
• Resources 
 limited time and human resources. 
Overview WEU-DIPLO 5
Overview of the WEU-DIPLO project: examples ©WEU-UEO 
Memorandum 
Minutes 
Study 
Notes 
Overview WEU-DIPLO 6
Overview of the WEU-DIPLO project: workflow 
Overview WEU-DIPLO 7
XML-TEI Encoding: WEU-DIPLO 
Why TEI encoding? 
• structured and retrievable metadata (title, author, origin place of document, 
availability date, confidentiality status, document reference, etc.); 
• clear representation of the document structure (header, footer, divisions – 
section, subsection, paragraph, line); 
• identification of semantic elements (discourse of countries representatives, 
entities: names of organisations, persons, places, functions, dates, etc.). 
XML-TEI: WEU-DIPLO 8
XML-TEI Encoding: WEU-DIPLO metadata, structure 
XML-TEI: WEU-DIPLO 9
XML-TEI Encoding: WEU-DIPLO semantics 
XML-TEI: WEU-DIPLO 10
Named Entity Recognition (NER): GATE - https://gate.ac.uk/ 
XML-TEI: WEU-DIPLO 11
Named Entity Recognition (NER/GATE): WEU-DIPLO 
NER/GATE: WEU-DIPLO 12
Named Entity Recognition (NER/XML-TEI): WEU-DIPLO 
NER/XML-TEI: WEU-DIPLO 13
Corpus Analysis: TXM – http://textometrie.ens-lyon.fr/ 
Corpus Analysis: TXM - Textométrie 14
Corpus Analysis - TXM: WEU-DIPLO 
Corpus WEU-DIPLO: 52 documents, French, 6905 items for 101965 occurrences (content and metadata); 
6417 items for 74287 occurrences (content) – Lexicon (functional /lexical forms) (lemmatised, POS tagged, lower case) 
Corpus Analysis - TXM: WEU-DIPLO 15
Corpus Analysis - TXM: WEU-DIPLO 
• WEU-DIPLO: content – Index (by type of entity) 
Corpus Analysis - TXM: WEU-DIPLO 16
Corpus Analysis: WEU-DIPLO 
• Participants description 
Corpus Analysis: WEU-DIPLO 17
Corpus Analysis – TXM: WEU-DIPLO 
Partition: representatives’ discourse by country/organisation 
Corpus Analysis - TXM: WEU-DIPLO 18
Corpus Analysis - TXM: WEU-DIPLO 
Specificities 
• Specificity score (log10 ): 
o overuse (+)/deficit (-) of a form in a part/subcorpus as compared with the parent 
corpus and a threshold. 
• Statistical model (Lafon, 1980): 
Where: T = number of occurrences in the parent corpus; 
t = number of occurrences in a part/subcorpus; 
f = frequency of a form F in the parent corpus; 
X = variable of value 0, 1, 2, …, k, …, f; 
Prob (X=K) = probability that F occurs k times in the part/subcorpus of size t. 
Corpus Analysis - TXM: WEU-DIPLO 19
Corpus Analysis - TXM: WEU-DIPLO 
Specificities: by part of speech 
Corpus Analysis - TXM: WEU-DIPLO 20
Corpus Analysis - TXM: WEU-DIPLO – 
Specificities: by part of speech (Verb) 
Corpus Analysis - TXM: WEU-DIPLO 21
Corpus Analysis - TXM: WEU-DIPLO 
Specificities (Verb) by representatives and mode/tense (Grevisse, 1993). 
Representative Mode / Tense 
France CONDITIONAL: attenuation (wish, advice, necessity, certainty) 
Forms: serait (37); aurait (19); seraient (17); pourrait (16); devrait (13), voudrais (11); … 
Exemples: le gouvernement français serait partisan d'accélérer …; cette réunion se déroulerait selon la 
formule …; qu'il ne faudrait pas trop ralentir l'opération envisagée … 
UK delegation PAST PARTICIPLE: passive/past perfect, adjectives 
Forms: été (20); donné (6); destinés (5); placées (5); établi (4); révisé (4); chargé (3); … 
Exemples: le produit final devrait être mis à la disposition de …; les accords auxquels elles ont abouti 
n'ont pas encore donné de résultats suffisamment …; projectiles nucléaires destinés à ces armes … 
C.P.A. (Comité 
permanent des 
armements) 
SIMPLE PAST: narration, succession of past actions 
Forms: exposa (1); fut (1); intervint (1); posa (1); prirent (1); soutinrent (1); … 
Exemples: une première proposition (belge) tendit à la réunion des hautes autorités …; luxembourg 
et france soutinrent, sans insistance, ce point de vue …; les pays-bas prirent la même attitude … 
A.C.A. (Agence 
pour le contrôle 
des 
armements) 
IMPERFECT: description, explanation 
Forms: était (11); avait (4); étaient (3); présidait (2); affectait (1); ajoutait (1); dépasseraient (1); … 
Exemples: le retrait des forces françaises de l’organisation intégrée de l’o.t.a.n. n'affectait nullement 
l'exécution des tâches …; il est bien évident que, s’il était adopté, il cesserait d’être inexact …; il résultait 
de cette étude que " le problème du stockage des armes nucléaires … 
Conseil de 
l'U.E.O. 
FUTURE: actions/goals to be accomplished 
Forms: sera (11); seront (7); pourra (5); devront (3); pourront (3); auront (2); donnera (2), … 
Exemples: les principes généraux ci-après devront gouverner nos travaux …; cela nous fournira la 
transition entre les sections a et b de notre mandat …; le conseil procédera à un examen attentif de la … 
Corpus Analysis - TXM: WEU-DIPLO 22
Corpus Analysis - TXM: WEU-DIPLO 
Concordances: use of conditional, French representatives/name/document 
Corpus Analysis - TXM: WEU-DIPLO 23
Corpus Analysis - TXM: WEU-DIPLO 
Context: conditional forms (French representative/Beaumarchais), vo-CR-73-10_FR 
Corpus Analysis - TXM: WEU-DIPLO 24
Corpus Analysis - TXM: WEU-DIPLO 
Specificities: by lemma, representatives partition (selection), groupe (contrôle) 
Corpus Analysis - TXM: WEU-DIPLO 25
Corpus Analysis - TXM: WEU-DIPLO 
Specificities: by lemma, representatives (selection), groupe (contrôle) - Discussion 
• Predictable results: 
o A.C.A.’s (Agence pour le contrôle des armements) discourse positive specificity (overuse): 
 contrôle/contrôler/contrôlable – inspection - vérification/vérifier; 
 limitation/limite/limiter-restriction/restreindre/restrictif. 
(A.C.A.’s role) 
o UK reprentesatives/delegation’s discourse negative specificity (scarcity): 
 arme/armement nucléaire/abc/atomique. 
(interested in the topic but not mainly concerned) 
• Less predictable results: 
o UK and France representatives’ discourse negative specificity: 
 contrôle/contrôler/contrôlable – inspection - vérification/vérifier; 
 A.C.A. - agence pour le contrôle des armements. 
(possible cause: selection of documents in the sample?) 
Corpus Analysis - TXM: WEU-DIPLO 26
Corpus Analysis - TXM: WEU-DIPLO 
Specificities: by lemma, representatives partition (selection), groupe (standardisation) 
Corpus Analysis - TXM: WEU-DIPLO 27
Corpus Analysis - TXM: WEU-DIPLO 
Cooccurrences: for ‘standard*’ sorted by co-frequency 
Corpus Analysis - TXM: WEU-DIPLO 28
Corpus Analysis - TXM: WEU-DIPLO 
Concordances: ‘standard*’ – ‘armements’ 
Corpus Analysis - TXM: WEU-DIPLO 29
Corpus Analysis: WEU-DIPLO 
Partition: representatives’ discourse (by name) 
Corpus Analysis: WEU-DIPLO 30
Corpus Analysis - TXM: WEU-DIPLO 
Lexical profile (Guyard, 1981): positive specificities (>2.0), lemmas, names partition 
Part of 
speech / 
Name 
Noun Proper 
Noun 
Adjective Verb Adverb 
Chauvel 
(FR) 
commun; arme; accord 
d’exécution; 
recensement; mise; 
choix; point; centre; 
opération; déclaration; 
système d’armes 
- commun; 
équitable; secret; 
suivant 
procéder - 
Lloyd 
(UK) 
pays; discussion; 
arrangement; 
coopération; 
gouvernement 
britannique; partenaire; 
estime 
- bilatéral; 
déterminé; 
multilatéral; 
analogue; final; 
européen 
engager; 
associer; 
offrir; devoir 
- 
Destremau 
(FR) 
ministre belge; avis; 
gouvernement français; 
idée; opération; désir 
- autonome; 
américain; 
industriel 
falloir; 
mériter; 
envisager 
trop; pas; ne 
Callaghan 
(UK) 
gouvernement 
britannique; doctrine; 
industrie 
Eurogroupe; 
M. Van 
Elslande 
- exister - 
Corpus Analysis - TXM: WEU-DIPLO 31
Corpus Analysis - TXM: WEU-DIPLO 
Lexical profile (Guyard, 1981): positive specificities, lemmas, names partition - Discussion 
• Chauvel (FR) / Lloyd (UK) : 
o Commun (rank 1) / bilatéral (rank 1) 
• production en commun; programme (régional), intérêt, défense, fonds commun(e)(s) 
• base, discussion, arrangements, comités directeurs bilatéra(l)(le)(ux) 
• Destremau (FR) / Callaghan (UK): 
o C.P.A – Comité permanent des armements (specificity score 1.44) / Eurogroupe (rank 1) 
(French attempts to revive CPA / UK’s Atlanticist preference - creation of Eurogroup in 1968 which did 
not include France). 
• Why standard(isation)(iser) not specific to any of individualized discourse by 
name, although high specificity for French representatives discourse as a whole? 
Corpus Analysis - TXM: WEU-DIPLO 32
Corpus Analysis - TXM: WEU-DIPLO 
Specificities: standard(isation)(iser) lemmas, names partition 
Corpus Analysis - TXM: WEU-DIPLO 33
Corpus Analysis - TXM: WEU-DIPLO 
Specificities: standard(isation)(iser) lemmas, documents subtypes partition 
Corpus Analysis - TXM: WEU-DIPLO 34
Future work 
1. Corpus analysis and interpretation (in progress). 
2. Choice and adaptation of Web publication platform (in progress) 
EVT (Edition Visualization Technology): http://sourceforge.net/projects/evt-project/ 
KILN : http://kiln.readthedocs.org/en/latest/# 
PhiloLOGIC: https://sites.google.com/site/philologic3/home 
XTF : http://xtf.cdlib.org/about/ 
TEIBoilerplate : http://dcl.ils.indiana.edu/teibp/ 
Future work 
35
References 
• GATE: https://gate.ac.uk/ 
• Grevisse, Le bon usage. Grammaire française, Duculot, Paris, 1993. 
• Guyard Marie-Renée. Spécificités d'auteurs dans Le Surréalisme au service 
de la Révolution. In: Mots, mars 1981, N°2. Qu'est-ce que le vocabulaire 
spécifique d'un texte politique? pp. 95-122. 
• Lafon Pierre, Sur la variabilité de la fréquence des formes dans un corpus. 
In: Mots, octobre 1980, N°1. Saussure, Zipf, Lagado, des méthodes, des 
calculs, des doutes et le vocabulaire de quelques textes politiques. pp. 
127-165. 
• TEI: http://www.tei-c.org 
• TXM: http://textometrie.ens-lyon.fr/ 
References 36

Contenu connexe

En vedette

Esa 2013 presentation (final)
Esa 2013 presentation (final)Esa 2013 presentation (final)
Esa 2013 presentation (final)
Warwick Allen
 
Humanist machine interaction for the digital humanities
Humanist machine interaction for the digital humanitiesHumanist machine interaction for the digital humanities
Humanist machine interaction for the digital humanities
dhlab
 
Algebra de baldor by. aimb
Algebra de baldor by. aimbAlgebra de baldor by. aimb
Algebra de baldor by. aimb
Alex Mindiola
 
Change management - leading people
Change management - leading peopleChange management - leading people
Change management - leading people
Clarkson Alliance
 
Termetet dhe energjia e valeve sizmike
Termetet dhe energjia e valeve sizmikeTermetet dhe energjia e valeve sizmike
Termetet dhe energjia e valeve sizmike
Mirsad
 

En vedette (18)

Advertising / Off-line
Advertising / Off-lineAdvertising / Off-line
Advertising / Off-line
 
Institutional and product videos
Institutional and product videosInstitutional and product videos
Institutional and product videos
 
Analitica Latin America Exhibition 2013
Analitica Latin America Exhibition 2013Analitica Latin America Exhibition 2013
Analitica Latin America Exhibition 2013
 
Esa 2013 presentation (final)
Esa 2013 presentation (final)Esa 2013 presentation (final)
Esa 2013 presentation (final)
 
TS Quick Start Guide2013
TS Quick Start Guide2013TS Quick Start Guide2013
TS Quick Start Guide2013
 
TEI Conference - CVCE
TEI Conference - CVCETEI Conference - CVCE
TEI Conference - CVCE
 
CUbRIK Summer School RHodes histoGraph
CUbRIK Summer School RHodes histoGraphCUbRIK Summer School RHodes histoGraph
CUbRIK Summer School RHodes histoGraph
 
Silviu
SilviuSilviu
Silviu
 
Silviu
SilviuSilviu
Silviu
 
History of Europe demo at IEEE MMSP 2013
History of Europe demo at IEEE MMSP 2013History of Europe demo at IEEE MMSP 2013
History of Europe demo at IEEE MMSP 2013
 
Humanist machine interaction for the digital humanities
Humanist machine interaction for the digital humanitiesHumanist machine interaction for the digital humanities
Humanist machine interaction for the digital humanities
 
Algebra de baldor by. aimb
Algebra de baldor by. aimbAlgebra de baldor by. aimb
Algebra de baldor by. aimb
 
HistoGraph presentation Insa de Lyon
HistoGraph presentation Insa de LyonHistoGraph presentation Insa de Lyon
HistoGraph presentation Insa de Lyon
 
Change management - leading people
Change management - leading peopleChange management - leading people
Change management - leading people
 
Google scholar
Google scholarGoogle scholar
Google scholar
 
Creating Blog
Creating BlogCreating Blog
Creating Blog
 
Termetet dhe energjia e valeve sizmike
Termetet dhe energjia e valeve sizmikeTermetet dhe energjia e valeve sizmike
Termetet dhe energjia e valeve sizmike
 
Europe’s Beginnings through the Looking Glass: Publishing Historical Document...
Europe’s Beginnings through the Looking Glass: Publishing Historical Document...Europe’s Beginnings through the Looking Glass: Publishing Historical Document...
Europe’s Beginnings through the Looking Glass: Publishing Historical Document...
 

Similaire à Text Encoding and Enrichment for Linguistic Analysis: Archives on the policy of Armaments within Western European Union

Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Christophe Tricot
 
The High Frequency Receiver Function
The High Frequency Receiver FunctionThe High Frequency Receiver Function
The High Frequency Receiver Function
Tracy Huang
 

Similaire à Text Encoding and Enrichment for Linguistic Analysis: Archives on the policy of Armaments within Western European Union (15)

Embedding NomLex-BR nominalizations into OpenWordnet-PT
Embedding NomLex-BR nominalizations into OpenWordnet-PTEmbedding NomLex-BR nominalizations into OpenWordnet-PT
Embedding NomLex-BR nominalizations into OpenWordnet-PT
 
The CIDOC CRM Family and LOD
The CIDOC CRM Family and LODThe CIDOC CRM Family and LOD
The CIDOC CRM Family and LOD
 
PRESSoo: A formal ontology for continuing resources
PRESSoo: A formal ontology for continuing resourcesPRESSoo: A formal ontology for continuing resources
PRESSoo: A formal ontology for continuing resources
 
grammer genration
grammer genration grammer genration
grammer genration
 
Modeling and Querying Greek Legislation using Semantic Web Technologies
Modeling and Querying Greek Legislation using Semantic Web TechnologiesModeling and Querying Greek Legislation using Semantic Web Technologies
Modeling and Querying Greek Legislation using Semantic Web Technologies
 
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
 
The Role of Ontology in the Era of Big Military Data
The Role of Ontology in the Era of Big Military DataThe Role of Ontology in the Era of Big Military Data
The Role of Ontology in the Era of Big Military Data
 
Academic and professional written genres - By Giovanni Parodi
Academic and professional written genres - By Giovanni ParodiAcademic and professional written genres - By Giovanni Parodi
Academic and professional written genres - By Giovanni Parodi
 
Fantoni Urgo - Cirp Dictionary
Fantoni Urgo - Cirp DictionaryFantoni Urgo - Cirp Dictionary
Fantoni Urgo - Cirp Dictionary
 
Clustering over the cultural heritage linked open dataset xlendi shipwreck
Clustering over the cultural heritage linked open dataset xlendi shipwreckClustering over the cultural heritage linked open dataset xlendi shipwreck
Clustering over the cultural heritage linked open dataset xlendi shipwreck
 
Language tools bne-5-10-2011
Language tools bne-5-10-2011Language tools bne-5-10-2011
Language tools bne-5-10-2011
 
The High Frequency Receiver Function
The High Frequency Receiver FunctionThe High Frequency Receiver Function
The High Frequency Receiver Function
 
Barbiers iclave-fr
Barbiers iclave-frBarbiers iclave-fr
Barbiers iclave-fr
 
EAA2013 Archaeological Recording Methods - How Many Archaeologists does it t...
 EAA2013 Archaeological Recording Methods - How Many Archaeologists does it t... EAA2013 Archaeological Recording Methods - How Many Archaeologists does it t...
EAA2013 Archaeological Recording Methods - How Many Archaeologists does it t...
 
Automatic Transcription Of English Connected Speech Phenomena
Automatic Transcription Of English Connected Speech PhenomenaAutomatic Transcription Of English Connected Speech Phenomena
Automatic Transcription Of English Connected Speech Phenomena
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 

Text Encoding and Enrichment for Linguistic Analysis: Archives on the policy of Armaments within Western European Union

  • 1. Text Encoding and Enrichment for Linguistic Analysis: Archives on the policy of Armaments within Western European Union Centre Virtuel de la Connaissance sur l’Europe (CVCE), Luxembourg Florentina Armaselu (DHLab) -florentina.armaselu@cvce.eu Verónica Martins (EIS) - veronica.martins@cvce.eu Catherine Jones (DHLab) - catherine.jones@cvce.eu 1 www.cvce.eu Exploring Historical Sources with Language Technology: Results and Perspectives Huygens ING , The Hague, December 8, 9, 2014
  • 2. Summary 1. About the CVCE 2. Overview of the WEU-DIPLO project 3. XML-TEI Encoding 4. Named Entity Recognition (NER) 5. Corpus Analysis 6. Future work 7. References Summary 2
  • 3. CVCE - Centre Virtuel de la Connaissance sur l'Europe An interdisciplinary centre of e-research and documentation on the European Integration Process. two key areas of activity: - Interdisciplinary research on the European integration process in the XX and XXI centuries; - Research, development and integration of digital tools and methods to support advancement in European Integration Studies. About the CVCE 3
  • 4. Overview of the WEU-DIPLO project 1. Goal: XML-TEI encoding, corpus analysis and Web publication of institutional documents of the W.E.U. (Western European Union): • Topics: armament production, standardization, control in the period from 1954 to 1982; • Source: Archives nationales de Luxembourg, W.E.U collection. 2. Format: • digitized versions (JPEG) of typewritten materials (one file per page). 3. Size: Category Number of documents Note 89 43 46 34 395 191 204 144 Minutes 30 15 15 15 256 138 118 118 Memorandum 3 1 2 2 16 7 9 9 Study 2 0 2 1 12 0 12 8 Discourse 1 0 1 0 4 0 4 0 Draft protocol 2 1 1 0 4 2 2 0 Total 127 60 67 52 687 338 349 279 *proc. = processed Number of documents per language Number of pages Number of pages per language EN FR FR proc.* EN FR FR proc.* Overview WEU-DIPLO 4
  • 5. Overview of the WEU-DIPLO project 5. Corpus Selection • Form and content  Form  OCR experiment conditions - need to diversify the form of the documents;  Bilingual.  Content  Archives’ 30 years rule corresponds with 30 years time period for the corpus (1954-1982)-selection of documents from the 1950’s, 1960’s, 1970’s and 1980’s;  Case study: Armaments production and control within WEU  Selection based on research question and more specific topics: French and British positions, WEU’s role/competences, nature of the debates within the Council/Standing Armament Committee;  Need for the documents to cover all the available material categories (minutes, notes, memorandum…). • Resources  limited time and human resources. Overview WEU-DIPLO 5
  • 6. Overview of the WEU-DIPLO project: examples ©WEU-UEO Memorandum Minutes Study Notes Overview WEU-DIPLO 6
  • 7. Overview of the WEU-DIPLO project: workflow Overview WEU-DIPLO 7
  • 8. XML-TEI Encoding: WEU-DIPLO Why TEI encoding? • structured and retrievable metadata (title, author, origin place of document, availability date, confidentiality status, document reference, etc.); • clear representation of the document structure (header, footer, divisions – section, subsection, paragraph, line); • identification of semantic elements (discourse of countries representatives, entities: names of organisations, persons, places, functions, dates, etc.). XML-TEI: WEU-DIPLO 8
  • 9. XML-TEI Encoding: WEU-DIPLO metadata, structure XML-TEI: WEU-DIPLO 9
  • 10. XML-TEI Encoding: WEU-DIPLO semantics XML-TEI: WEU-DIPLO 10
  • 11. Named Entity Recognition (NER): GATE - https://gate.ac.uk/ XML-TEI: WEU-DIPLO 11
  • 12. Named Entity Recognition (NER/GATE): WEU-DIPLO NER/GATE: WEU-DIPLO 12
  • 13. Named Entity Recognition (NER/XML-TEI): WEU-DIPLO NER/XML-TEI: WEU-DIPLO 13
  • 14. Corpus Analysis: TXM – http://textometrie.ens-lyon.fr/ Corpus Analysis: TXM - Textométrie 14
  • 15. Corpus Analysis - TXM: WEU-DIPLO Corpus WEU-DIPLO: 52 documents, French, 6905 items for 101965 occurrences (content and metadata); 6417 items for 74287 occurrences (content) – Lexicon (functional /lexical forms) (lemmatised, POS tagged, lower case) Corpus Analysis - TXM: WEU-DIPLO 15
  • 16. Corpus Analysis - TXM: WEU-DIPLO • WEU-DIPLO: content – Index (by type of entity) Corpus Analysis - TXM: WEU-DIPLO 16
  • 17. Corpus Analysis: WEU-DIPLO • Participants description Corpus Analysis: WEU-DIPLO 17
  • 18. Corpus Analysis – TXM: WEU-DIPLO Partition: representatives’ discourse by country/organisation Corpus Analysis - TXM: WEU-DIPLO 18
  • 19. Corpus Analysis - TXM: WEU-DIPLO Specificities • Specificity score (log10 ): o overuse (+)/deficit (-) of a form in a part/subcorpus as compared with the parent corpus and a threshold. • Statistical model (Lafon, 1980): Where: T = number of occurrences in the parent corpus; t = number of occurrences in a part/subcorpus; f = frequency of a form F in the parent corpus; X = variable of value 0, 1, 2, …, k, …, f; Prob (X=K) = probability that F occurs k times in the part/subcorpus of size t. Corpus Analysis - TXM: WEU-DIPLO 19
  • 20. Corpus Analysis - TXM: WEU-DIPLO Specificities: by part of speech Corpus Analysis - TXM: WEU-DIPLO 20
  • 21. Corpus Analysis - TXM: WEU-DIPLO – Specificities: by part of speech (Verb) Corpus Analysis - TXM: WEU-DIPLO 21
  • 22. Corpus Analysis - TXM: WEU-DIPLO Specificities (Verb) by representatives and mode/tense (Grevisse, 1993). Representative Mode / Tense France CONDITIONAL: attenuation (wish, advice, necessity, certainty) Forms: serait (37); aurait (19); seraient (17); pourrait (16); devrait (13), voudrais (11); … Exemples: le gouvernement français serait partisan d'accélérer …; cette réunion se déroulerait selon la formule …; qu'il ne faudrait pas trop ralentir l'opération envisagée … UK delegation PAST PARTICIPLE: passive/past perfect, adjectives Forms: été (20); donné (6); destinés (5); placées (5); établi (4); révisé (4); chargé (3); … Exemples: le produit final devrait être mis à la disposition de …; les accords auxquels elles ont abouti n'ont pas encore donné de résultats suffisamment …; projectiles nucléaires destinés à ces armes … C.P.A. (Comité permanent des armements) SIMPLE PAST: narration, succession of past actions Forms: exposa (1); fut (1); intervint (1); posa (1); prirent (1); soutinrent (1); … Exemples: une première proposition (belge) tendit à la réunion des hautes autorités …; luxembourg et france soutinrent, sans insistance, ce point de vue …; les pays-bas prirent la même attitude … A.C.A. (Agence pour le contrôle des armements) IMPERFECT: description, explanation Forms: était (11); avait (4); étaient (3); présidait (2); affectait (1); ajoutait (1); dépasseraient (1); … Exemples: le retrait des forces françaises de l’organisation intégrée de l’o.t.a.n. n'affectait nullement l'exécution des tâches …; il est bien évident que, s’il était adopté, il cesserait d’être inexact …; il résultait de cette étude que " le problème du stockage des armes nucléaires … Conseil de l'U.E.O. FUTURE: actions/goals to be accomplished Forms: sera (11); seront (7); pourra (5); devront (3); pourront (3); auront (2); donnera (2), … Exemples: les principes généraux ci-après devront gouverner nos travaux …; cela nous fournira la transition entre les sections a et b de notre mandat …; le conseil procédera à un examen attentif de la … Corpus Analysis - TXM: WEU-DIPLO 22
  • 23. Corpus Analysis - TXM: WEU-DIPLO Concordances: use of conditional, French representatives/name/document Corpus Analysis - TXM: WEU-DIPLO 23
  • 24. Corpus Analysis - TXM: WEU-DIPLO Context: conditional forms (French representative/Beaumarchais), vo-CR-73-10_FR Corpus Analysis - TXM: WEU-DIPLO 24
  • 25. Corpus Analysis - TXM: WEU-DIPLO Specificities: by lemma, representatives partition (selection), groupe (contrôle) Corpus Analysis - TXM: WEU-DIPLO 25
  • 26. Corpus Analysis - TXM: WEU-DIPLO Specificities: by lemma, representatives (selection), groupe (contrôle) - Discussion • Predictable results: o A.C.A.’s (Agence pour le contrôle des armements) discourse positive specificity (overuse):  contrôle/contrôler/contrôlable – inspection - vérification/vérifier;  limitation/limite/limiter-restriction/restreindre/restrictif. (A.C.A.’s role) o UK reprentesatives/delegation’s discourse negative specificity (scarcity):  arme/armement nucléaire/abc/atomique. (interested in the topic but not mainly concerned) • Less predictable results: o UK and France representatives’ discourse negative specificity:  contrôle/contrôler/contrôlable – inspection - vérification/vérifier;  A.C.A. - agence pour le contrôle des armements. (possible cause: selection of documents in the sample?) Corpus Analysis - TXM: WEU-DIPLO 26
  • 27. Corpus Analysis - TXM: WEU-DIPLO Specificities: by lemma, representatives partition (selection), groupe (standardisation) Corpus Analysis - TXM: WEU-DIPLO 27
  • 28. Corpus Analysis - TXM: WEU-DIPLO Cooccurrences: for ‘standard*’ sorted by co-frequency Corpus Analysis - TXM: WEU-DIPLO 28
  • 29. Corpus Analysis - TXM: WEU-DIPLO Concordances: ‘standard*’ – ‘armements’ Corpus Analysis - TXM: WEU-DIPLO 29
  • 30. Corpus Analysis: WEU-DIPLO Partition: representatives’ discourse (by name) Corpus Analysis: WEU-DIPLO 30
  • 31. Corpus Analysis - TXM: WEU-DIPLO Lexical profile (Guyard, 1981): positive specificities (>2.0), lemmas, names partition Part of speech / Name Noun Proper Noun Adjective Verb Adverb Chauvel (FR) commun; arme; accord d’exécution; recensement; mise; choix; point; centre; opération; déclaration; système d’armes - commun; équitable; secret; suivant procéder - Lloyd (UK) pays; discussion; arrangement; coopération; gouvernement britannique; partenaire; estime - bilatéral; déterminé; multilatéral; analogue; final; européen engager; associer; offrir; devoir - Destremau (FR) ministre belge; avis; gouvernement français; idée; opération; désir - autonome; américain; industriel falloir; mériter; envisager trop; pas; ne Callaghan (UK) gouvernement britannique; doctrine; industrie Eurogroupe; M. Van Elslande - exister - Corpus Analysis - TXM: WEU-DIPLO 31
  • 32. Corpus Analysis - TXM: WEU-DIPLO Lexical profile (Guyard, 1981): positive specificities, lemmas, names partition - Discussion • Chauvel (FR) / Lloyd (UK) : o Commun (rank 1) / bilatéral (rank 1) • production en commun; programme (régional), intérêt, défense, fonds commun(e)(s) • base, discussion, arrangements, comités directeurs bilatéra(l)(le)(ux) • Destremau (FR) / Callaghan (UK): o C.P.A – Comité permanent des armements (specificity score 1.44) / Eurogroupe (rank 1) (French attempts to revive CPA / UK’s Atlanticist preference - creation of Eurogroup in 1968 which did not include France). • Why standard(isation)(iser) not specific to any of individualized discourse by name, although high specificity for French representatives discourse as a whole? Corpus Analysis - TXM: WEU-DIPLO 32
  • 33. Corpus Analysis - TXM: WEU-DIPLO Specificities: standard(isation)(iser) lemmas, names partition Corpus Analysis - TXM: WEU-DIPLO 33
  • 34. Corpus Analysis - TXM: WEU-DIPLO Specificities: standard(isation)(iser) lemmas, documents subtypes partition Corpus Analysis - TXM: WEU-DIPLO 34
  • 35. Future work 1. Corpus analysis and interpretation (in progress). 2. Choice and adaptation of Web publication platform (in progress) EVT (Edition Visualization Technology): http://sourceforge.net/projects/evt-project/ KILN : http://kiln.readthedocs.org/en/latest/# PhiloLOGIC: https://sites.google.com/site/philologic3/home XTF : http://xtf.cdlib.org/about/ TEIBoilerplate : http://dcl.ils.indiana.edu/teibp/ Future work 35
  • 36. References • GATE: https://gate.ac.uk/ • Grevisse, Le bon usage. Grammaire française, Duculot, Paris, 1993. • Guyard Marie-Renée. Spécificités d'auteurs dans Le Surréalisme au service de la Révolution. In: Mots, mars 1981, N°2. Qu'est-ce que le vocabulaire spécifique d'un texte politique? pp. 95-122. • Lafon Pierre, Sur la variabilité de la fréquence des formes dans un corpus. In: Mots, octobre 1980, N°1. Saussure, Zipf, Lagado, des méthodes, des calculs, des doutes et le vocabulaire de quelques textes politiques. pp. 127-165. • TEI: http://www.tei-c.org • TXM: http://textometrie.ens-lyon.fr/ References 36