SlideShare une entreprise Scribd logo
1  sur  1
Télécharger pour lire hors ligne
Cicling
        Knowledge-poor and Knowledge-rich Approaches                                                               2013
           for Multilingual Terminology Extraction
                             Béatrice      Daille (1),   Helena   Blancafort(2)
                                 (1)University   of Nantes – Lina, (2)Syllabs

Abstract
We present two terminology extraction tools to compare a knowledge-poor and a knowledge-rich
approach. Both tools process SWT and MWT and are designed to handle multilingualism. We run an evaluation
on 6 languages and 2 different domains using crawled comparable corpora and hand-crafted reference term
lists (RTL). We discuss the 3 main results achieved for terminology extraction. The first two evaluation scenarios
concern the knowledge-rich framework. Scenario 1 (S1) compares performances for each of the languages
depending on the ranking that is applied: specificity score vs. the number of occurrences. Scenario 2 (S2)
examines the relevancy of the term variant identification to increase the precision ranking for any of the
languages. Scenario 3 (S3) compares both tools and demonstrates that a probabilistic term extraction approach,
developed with minimal effort, achieves satisfactory results when compared to a rule-based method.

   Knowledge-Rich Framework (KR)                                       Knowledge-Poor Approach (KP)
1. Linguistic processing: tokenization, POS tagging,                1. Training of a Pseudo POS Tagger (Clark, 2003) with
   lemmatization (TreeTagger)                                          raw corpora (2,5 GB to 250 MB)
2. Rule-based candidate term (CT) extraction based                  2. CRFs (Lafferty et al. 2001) to train a Term Candidate
   on POS tags                                                         Extractor (Sha et al., Guégan and Loupy, 2011),
3. Hand-crafted rules for the grouping of variants                     manually small annotated corpora with CT (300 to
4. Multilingual framework : 6 languages                                600 sentences)
                                                  Shared Features
                              Extraction of SWT and MWT, ranking based on specificity score

   Ressources                                                          Term Variation
• Hand-crafted RTL from 103 to 159 terms with variants                                          wind energy
• Monolingual Crawled Comparable Corpora from 220K to 474K                        wind turbine energy, onshore wind energy
tokens                                                                            energy from wind, small-scale wind energy

                                  Evaluation (Wind Energy Domain)




      S1: F-Measure based on specificity       S1: F-Measure based on occurrences           S2: F-Measure based on specificity
                                                                                          ranking of CT with and without variants

                                                                                Conclusions
                                                S1: Specificity ranking outperforms the frequency of occurrence
                                               ranking
                                                S2: The handling of term variants improves the ranking for the first
                                               candidate terms
                                                S3: The knowledge-poor approach provides satisfactory results with
                                               minimal effort. Results are language and domain dependent.
                                                      EN better results than DE  limits of multilingual framework
                                                      ES better results in mobile domain than wind energy
                                                      KR tool and RTL handle MWT of 2-3, KP longer terms as small
                                                         scale domestic wind turbine system
   S3: F-Measure based on specificity                 Future Work
   Knowledge-rich vs. knowledge-poor
                                               - Evaluate method using a POS tagger but no hand-crafted rules
 The research leading to these results has received funding from the European Community's FP7/2007-2013 under
grant agreement nº 248005 for the TTC project (Terminology Extraction, Translation Tools and Comparable Corpora)

Contenu connexe

Similaire à Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Extraction

An efficient transcoding algorithm for G.723.1 and G.729A ...
An efficient transcoding algorithm for G.723.1 and G.729A ...An efficient transcoding algorithm for G.723.1 and G.729A ...
An efficient transcoding algorithm for G.723.1 and G.729A ...Videoguy
 
Thesis+of+latifa+guerrouj.ppt
Thesis+of+latifa+guerrouj.pptThesis+of+latifa+guerrouj.ppt
Thesis+of+latifa+guerrouj.pptPtidej Team
 
130817 latifa guerrouj - context-aware source code vocabulary normalization...
130817   latifa guerrouj - context-aware source code vocabulary normalization...130817   latifa guerrouj - context-aware source code vocabulary normalization...
130817 latifa guerrouj - context-aware source code vocabulary normalization...Ptidej Team
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...kevig
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...ijnlc
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingNimrita Koul
 
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...Christophe Tricot
 
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...kevig
 
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...kevig
 
Statistical Named Entity Recognition for Hungarian – analysis ...
Statistical Named Entity Recognition for Hungarian – analysis ...Statistical Named Entity Recognition for Hungarian – analysis ...
Statistical Named Entity Recognition for Hungarian – analysis ...butest
 
ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters
ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing FiltersICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters
ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing FiltersSebastiano Panichella
 
ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters
ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing FiltersICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters
ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing FiltersSebastiano Panichella
 
On the realization of non linear pseudo-noise generator for various signal pr...
On the realization of non linear pseudo-noise generator for various signal pr...On the realization of non linear pseudo-noise generator for various signal pr...
On the realization of non linear pseudo-noise generator for various signal pr...Alexander Decker
 
A Study of Digital Media Based Voice Activity Detection Protocols
A Study of Digital Media Based Voice Activity Detection ProtocolsA Study of Digital Media Based Voice Activity Detection Protocols
A Study of Digital Media Based Voice Activity Detection Protocolsijtsrd
 
Arabic named entity recognition using deep learning approach
Arabic named entity recognition using deep learning approachArabic named entity recognition using deep learning approach
Arabic named entity recognition using deep learning approachIJECEIAES
 
Speech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfSpeech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfssuser849b73
 
A Contextualized Knowledge Repository for Open Data about Trentino
A Contextualized Knowledge Repository for Open Data about TrentinoA Contextualized Knowledge Repository for Open Data about Trentino
A Contextualized Knowledge Repository for Open Data about TrentinoPlanetData Network of Excellence
 
Continuous variable quantum key distribution finite key analysis of composabl...
Continuous variable quantum key distribution finite key analysis of composabl...Continuous variable quantum key distribution finite key analysis of composabl...
Continuous variable quantum key distribution finite key analysis of composabl...wtyru1989
 
ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015
ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015
ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015RIILP
 

Similaire à Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Extraction (20)

An efficient transcoding algorithm for G.723.1 and G.729A ...
An efficient transcoding algorithm for G.723.1 and G.729A ...An efficient transcoding algorithm for G.723.1 and G.729A ...
An efficient transcoding algorithm for G.723.1 and G.729A ...
 
Thesis+of+latifa+guerrouj.ppt
Thesis+of+latifa+guerrouj.pptThesis+of+latifa+guerrouj.ppt
Thesis+of+latifa+guerrouj.ppt
 
130817 latifa guerrouj - context-aware source code vocabulary normalization...
130817   latifa guerrouj - context-aware source code vocabulary normalization...130817   latifa guerrouj - context-aware source code vocabulary normalization...
130817 latifa guerrouj - context-aware source code vocabulary normalization...
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
 
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
 
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
 
Statistical Named Entity Recognition for Hungarian – analysis ...
Statistical Named Entity Recognition for Hungarian – analysis ...Statistical Named Entity Recognition for Hungarian – analysis ...
Statistical Named Entity Recognition for Hungarian – analysis ...
 
ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters
ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing FiltersICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters
ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters
 
ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters
ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing FiltersICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters
ICPC 2011 - Improving IR-based Traceability Recovery Using Smoothing Filters
 
On the realization of non linear pseudo-noise generator for various signal pr...
On the realization of non linear pseudo-noise generator for various signal pr...On the realization of non linear pseudo-noise generator for various signal pr...
On the realization of non linear pseudo-noise generator for various signal pr...
 
A Study of Digital Media Based Voice Activity Detection Protocols
A Study of Digital Media Based Voice Activity Detection ProtocolsA Study of Digital Media Based Voice Activity Detection Protocols
A Study of Digital Media Based Voice Activity Detection Protocols
 
Arabic named entity recognition using deep learning approach
Arabic named entity recognition using deep learning approachArabic named entity recognition using deep learning approach
Arabic named entity recognition using deep learning approach
 
Speech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfSpeech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdf
 
A Contextualized Knowledge Repository for Open Data about Trentino
A Contextualized Knowledge Repository for Open Data about TrentinoA Contextualized Knowledge Repository for Open Data about Trentino
A Contextualized Knowledge Repository for Open Data about Trentino
 
Continuous variable quantum key distribution finite key analysis of composabl...
Continuous variable quantum key distribution finite key analysis of composabl...Continuous variable quantum key distribution finite key analysis of composabl...
Continuous variable quantum key distribution finite key analysis of composabl...
 
ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015
ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015
ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015
 
MUD
MUDMUD
MUD
 

Dernier

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 

Dernier (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Extraction

  • 1. Cicling Knowledge-poor and Knowledge-rich Approaches 2013 for Multilingual Terminology Extraction Béatrice Daille (1), Helena Blancafort(2) (1)University of Nantes – Lina, (2)Syllabs Abstract We present two terminology extraction tools to compare a knowledge-poor and a knowledge-rich approach. Both tools process SWT and MWT and are designed to handle multilingualism. We run an evaluation on 6 languages and 2 different domains using crawled comparable corpora and hand-crafted reference term lists (RTL). We discuss the 3 main results achieved for terminology extraction. The first two evaluation scenarios concern the knowledge-rich framework. Scenario 1 (S1) compares performances for each of the languages depending on the ranking that is applied: specificity score vs. the number of occurrences. Scenario 2 (S2) examines the relevancy of the term variant identification to increase the precision ranking for any of the languages. Scenario 3 (S3) compares both tools and demonstrates that a probabilistic term extraction approach, developed with minimal effort, achieves satisfactory results when compared to a rule-based method. Knowledge-Rich Framework (KR) Knowledge-Poor Approach (KP) 1. Linguistic processing: tokenization, POS tagging, 1. Training of a Pseudo POS Tagger (Clark, 2003) with lemmatization (TreeTagger) raw corpora (2,5 GB to 250 MB) 2. Rule-based candidate term (CT) extraction based 2. CRFs (Lafferty et al. 2001) to train a Term Candidate on POS tags Extractor (Sha et al., Guégan and Loupy, 2011), 3. Hand-crafted rules for the grouping of variants manually small annotated corpora with CT (300 to 4. Multilingual framework : 6 languages 600 sentences) Shared Features Extraction of SWT and MWT, ranking based on specificity score Ressources Term Variation • Hand-crafted RTL from 103 to 159 terms with variants wind energy • Monolingual Crawled Comparable Corpora from 220K to 474K wind turbine energy, onshore wind energy tokens energy from wind, small-scale wind energy Evaluation (Wind Energy Domain) S1: F-Measure based on specificity S1: F-Measure based on occurrences S2: F-Measure based on specificity ranking of CT with and without variants Conclusions  S1: Specificity ranking outperforms the frequency of occurrence ranking  S2: The handling of term variants improves the ranking for the first candidate terms  S3: The knowledge-poor approach provides satisfactory results with minimal effort. Results are language and domain dependent.  EN better results than DE  limits of multilingual framework  ES better results in mobile domain than wind energy  KR tool and RTL handle MWT of 2-3, KP longer terms as small scale domestic wind turbine system S3: F-Measure based on specificity  Future Work Knowledge-rich vs. knowledge-poor - Evaluate method using a POS tagger but no hand-crafted rules The research leading to these results has received funding from the European Community's FP7/2007-2013 under grant agreement nº 248005 for the TTC project (Terminology Extraction, Translation Tools and Comparable Corpora)