SlideShare une entreprise Scribd logo
1  sur  15
Télécharger pour lire hors ligne
Active Annotation of Corpora
Kepa J. Rodriguez
Text Analysis Seminar at the Göttingen Center of Digital Humanities
02.05.2012
Outline



• Goal of the presentation.
• The LUNA corpus.
• Active annotation.
    – Concept
    – Algorithm.
    – Evaluation.
• Potential use of Active Annotation in projects in humanities.
Goal of the presentation



• Introduce concepts of:
    – Active Learning
    – Active Annotation.
• Present its use in the annotation of the LUNA corpus.
• Discuss the utility of the Active Annotation in projects in
  humanities.
The LUNA Corpus (1)
• Corpus consists of:
    – 3000 Human-Human and 8100 WOZ dialogues
    – Multiple annotation levels: POS, entities, coreference, predicate structure, dialogue acts,
        etc.
    – in French, Italian and Polish.
• French subcorpus:
    – Application domains: travel information and reservation, IT help desk, telecom costumer
       care and financial information transaction
    – Human-Machine dialogues: 7100
• Italian subcorpus:
    – Application domain: IT helpdesk
    – 2500 Human-Human and 500 WOZ dialogues
• Polish subcorpus:
    – Application domain: public transportation information
    – 500 Human-Human and 500 WOZ dialogues

More information about annotation scheme and levels:
http://www.ist-luna.eu/pdf/schemepresentationPdm.pdf
The LUNA Corpus (2)

[Operator:] allora m'ha detto che [non riusciva]c1 ad [accedere]c2 [al
    computer]c3 e [le manca]c4 [la procedura]c5
so, you have told me that you cannot access the computer, and that you need the
    procedure
          c1 trouble : unable_to
          c2 action : access
          c3 computer-hardware : pc
          c4 trouble : lack_of
          c5 computer-software : procedure
[Caller:] esatto
exactly
[Operator:] allora avrei bisogno [dell' RWS]c6 [del PC]c7
so I need the RWS of the computer
          c6 code-identificationCode : rws
          c7 computer-hardware : pc
[Caller:] si allora [tredici zero ottantasei]c8
yes, 13 0 86
          c8 code-identificationCode-rws : 13086
Active annotation (1)




Components of the active annotation are:
• Active learning paradigm
   – Selection of examples for annotation.
• Potential error detection
   – Cases in which manual annotation seems to be ambiguous
     or contradictory.
Active annotation (2)

• Active learning paradigm:
   –   Statistical learning based paradigm
   –   A first small set will randomly chosen and manually annotated.
   –   Use this set to train a model and annotate the rest of samples.
   –   Selection of the most informative examples to update the statistical
       model
        • Most informative = lower confidence score


• Use of active learning:
   – Speed-up annotation
   – Support annotators in their work
   – Select examples to be annotated: which examples from a big
     amount of data will be useful for my purposes?
Active annotation (3)




          Learn curve comparison: active vs. random learning
                   (Riccardi and Takkani-Tür, 2005 )
Active annotation (4)

• Likely error detection:
   – Re-annotate the training data using the statistical model.
   – Extract examples in which manual annotation and automatic
     annotation are different.
   – Send them to human supervision.


• Use of the likely error detection:
   – If manual annotation is correct, example is hard to learn:
       • Analyze which new features can be implemented to enrich the model.
   – If the annotation is erroneous:
       • Correct it.
Annotation algoritm

1. Select randomly a small amount of dialogues and annotate it manually
   from scratch (SL).
2. Train a model M using SL
3. while (labeler/data available)
    a) Use M to automatically annotate the unannotated part of the corpus (Su).
    b) Rank automatically annotated examples of (Su) according to the confidence
       measure given by M
    c) Select a batch of k dialogues with the lowest score (Sk)
    d) Ask for human control/correction on Sk
    e) Use M to automatically annotate SL and produce SaL
    f) Look at the difference between SL and produce SaL
        i. HARD TO LEARN EXAMPLE: Add new features when training M
        ii. ANNOTATION AMBIGUITIES: Hire human annotators to disambiguate SL
    g) SL = SL + Sk
    h) Train a new model M with SL
    i) Go to 3.1
Evaluation (2)

•   Annotator point of view:
     – Annotation from scratch: 80-90 minutes/file.
     – Supervision after 3rd active annotation loop: 25-20 min/file.
     – Annotators more concentrated in:
          • Difficult/interesting issues.
          • Giving feedback about the model.


•   Error detection: no statistics.
     – Most of the reported feedback requests were annotation errors.
     – Some of the reported feedback requests were caused by ambiguities and
       helped to add features to enrich the model.
Evaluation (1)
•   Wizard of Oz dialogues
                    Act-turn    Size in turns       Error rate
                        1            200             59.2%
                        2            400             44.4%
                        3            600             39.3%
                        4            800              6.4%
                        5           1200              0.0%
•   Human-human dialogues
                     Act-turn   Size in dialogues    Error rate
                        1              10              71.2%
                        2              20              59.5%
                        3              30              54.0%
                        4              40              51.1%
                        5              60              45.7%
                        6              80              42.4%
Discussion

•   Questions
•   Annotation tasks in the GCDH:
     – Corpus of Coptic Texts.
     – …..
References

•   LUNA project: http://www.ist-luna.eu
•   Raymond, Rodriguez and Riccardi (2008): Active Annotation in the
    LUNA Italian Corpus of Spontaneous Dialogues. In Proceedings of the
    sixth international conference on Language Resources and Evaluation
    (LREC 2008).Marrakech. Marrocco.
•   Riccardi, G. and Hakkani-Tür, D. (2005): Active learning: theory and
    applications to automatic speech recognition. In IEEE Transactions on
    Speech and Audio Processing.
Thanks!!!




Text Analysis Seminar at the Göttingen
     Center of Digital Humanities

Contenu connexe

Similaire à Active Annotation of Corpora.

2013 Lecture 5: AR Tools and Interaction
2013 Lecture 5: AR Tools and Interaction 2013 Lecture 5: AR Tools and Interaction
2013 Lecture 5: AR Tools and Interaction Mark Billinghurst
 
Software Analytics - Achievements and Challenges
Software Analytics - Achievements and ChallengesSoftware Analytics - Achievements and Challenges
Software Analytics - Achievements and ChallengesTao Xie
 
Introduction To Applied Machine Learning
Introduction To Applied Machine LearningIntroduction To Applied Machine Learning
Introduction To Applied Machine Learningananth
 
Artificial Intelligence for Automated Software Testing
Artificial Intelligence for Automated Software TestingArtificial Intelligence for Automated Software Testing
Artificial Intelligence for Automated Software TestingLionel Briand
 
Iwsm2014 application of function points to software based on open source - ...
Iwsm2014   application of function points to software based on open source - ...Iwsm2014   application of function points to software based on open source - ...
Iwsm2014 application of function points to software based on open source - ...Nesma
 
Fixing the program my computer learned: End-user debugging of machine-learned...
Fixing the program my computer learned: End-user debugging of machine-learned...Fixing the program my computer learned: End-user debugging of machine-learned...
Fixing the program my computer learned: End-user debugging of machine-learned...City University London
 
OWF14 - Big Data : The State of Machine Learning in 2014
OWF14 - Big Data : The State of Machine  Learning in 2014OWF14 - Big Data : The State of Machine  Learning in 2014
OWF14 - Big Data : The State of Machine Learning in 2014Paris Open Source Summit
 
Crowdsourcing using MTurk for HCI research
Crowdsourcing using MTurk for HCI researchCrowdsourcing using MTurk for HCI research
Crowdsourcing using MTurk for HCI researchEd Chi
 
A Language Support for Exhaustive Fault-Injection in Message-Passing System M...
A Language Support for Exhaustive Fault-Injection in Message-Passing System M...A Language Support for Exhaustive Fault-Injection in Message-Passing System M...
A Language Support for Exhaustive Fault-Injection in Message-Passing System M...Takuo Watanabe
 
IRJET- ASL Language Translation using ML
IRJET- ASL Language Translation using MLIRJET- ASL Language Translation using ML
IRJET- ASL Language Translation using MLIRJET Journal
 
TechnicalBackgroundOverview
TechnicalBackgroundOverviewTechnicalBackgroundOverview
TechnicalBackgroundOverviewMotaz El-Saban
 
Requirements-Collector: Automating Requirements Specification from Elicitatio...
Requirements-Collector: Automating Requirements Specification from Elicitatio...Requirements-Collector: Automating Requirements Specification from Elicitatio...
Requirements-Collector: Automating Requirements Specification from Elicitatio...Sebastiano Panichella
 
Synergy of Human and Artificial Intelligence in Software Engineering
Synergy of Human and Artificial Intelligence in Software EngineeringSynergy of Human and Artificial Intelligence in Software Engineering
Synergy of Human and Artificial Intelligence in Software EngineeringTao Xie
 
Software Analytics: Towards Software Mining that Matters (2014)
Software Analytics:Towards Software Mining that Matters (2014)Software Analytics:Towards Software Mining that Matters (2014)
Software Analytics: Towards Software Mining that Matters (2014)Tao Xie
 
Keynote at-icpc-2020
Keynote at-icpc-2020Keynote at-icpc-2020
Keynote at-icpc-2020Ralf Laemmel
 
Ectel nods v2
Ectel nods v2Ectel nods v2
Ectel nods v2nodenot
 
Intro to User Centered Design Workshop
Intro to User Centered Design WorkshopIntro to User Centered Design Workshop
Intro to User Centered Design WorkshopPatrick McNeil
 
Prototyping for knowledge based entrepreneurship
Prototyping for knowledge based entrepreneurshipPrototyping for knowledge based entrepreneurship
Prototyping for knowledge based entrepreneurshipVlad Manea
 
DSD-INT 2014 - OpenMI Symposium - Federated modelling of Critical Infrastruct...
DSD-INT 2014 - OpenMI Symposium - Federated modelling of Critical Infrastruct...DSD-INT 2014 - OpenMI Symposium - Federated modelling of Critical Infrastruct...
DSD-INT 2014 - OpenMI Symposium - Federated modelling of Critical Infrastruct...Deltares
 

Similaire à Active Annotation of Corpora. (20)

2013 Lecture 5: AR Tools and Interaction
2013 Lecture 5: AR Tools and Interaction 2013 Lecture 5: AR Tools and Interaction
2013 Lecture 5: AR Tools and Interaction
 
Software Analytics - Achievements and Challenges
Software Analytics - Achievements and ChallengesSoftware Analytics - Achievements and Challenges
Software Analytics - Achievements and Challenges
 
Introduction To Applied Machine Learning
Introduction To Applied Machine LearningIntroduction To Applied Machine Learning
Introduction To Applied Machine Learning
 
Artificial Intelligence for Automated Software Testing
Artificial Intelligence for Automated Software TestingArtificial Intelligence for Automated Software Testing
Artificial Intelligence for Automated Software Testing
 
Iwsm2014 application of function points to software based on open source - ...
Iwsm2014   application of function points to software based on open source - ...Iwsm2014   application of function points to software based on open source - ...
Iwsm2014 application of function points to software based on open source - ...
 
Fixing the program my computer learned: End-user debugging of machine-learned...
Fixing the program my computer learned: End-user debugging of machine-learned...Fixing the program my computer learned: End-user debugging of machine-learned...
Fixing the program my computer learned: End-user debugging of machine-learned...
 
OWF14 - Big Data : The State of Machine Learning in 2014
OWF14 - Big Data : The State of Machine  Learning in 2014OWF14 - Big Data : The State of Machine  Learning in 2014
OWF14 - Big Data : The State of Machine Learning in 2014
 
NEXiDA at OMG June 2009
NEXiDA at OMG June 2009NEXiDA at OMG June 2009
NEXiDA at OMG June 2009
 
Crowdsourcing using MTurk for HCI research
Crowdsourcing using MTurk for HCI researchCrowdsourcing using MTurk for HCI research
Crowdsourcing using MTurk for HCI research
 
A Language Support for Exhaustive Fault-Injection in Message-Passing System M...
A Language Support for Exhaustive Fault-Injection in Message-Passing System M...A Language Support for Exhaustive Fault-Injection in Message-Passing System M...
A Language Support for Exhaustive Fault-Injection in Message-Passing System M...
 
IRJET- ASL Language Translation using ML
IRJET- ASL Language Translation using MLIRJET- ASL Language Translation using ML
IRJET- ASL Language Translation using ML
 
TechnicalBackgroundOverview
TechnicalBackgroundOverviewTechnicalBackgroundOverview
TechnicalBackgroundOverview
 
Requirements-Collector: Automating Requirements Specification from Elicitatio...
Requirements-Collector: Automating Requirements Specification from Elicitatio...Requirements-Collector: Automating Requirements Specification from Elicitatio...
Requirements-Collector: Automating Requirements Specification from Elicitatio...
 
Synergy of Human and Artificial Intelligence in Software Engineering
Synergy of Human and Artificial Intelligence in Software EngineeringSynergy of Human and Artificial Intelligence in Software Engineering
Synergy of Human and Artificial Intelligence in Software Engineering
 
Software Analytics: Towards Software Mining that Matters (2014)
Software Analytics:Towards Software Mining that Matters (2014)Software Analytics:Towards Software Mining that Matters (2014)
Software Analytics: Towards Software Mining that Matters (2014)
 
Keynote at-icpc-2020
Keynote at-icpc-2020Keynote at-icpc-2020
Keynote at-icpc-2020
 
Ectel nods v2
Ectel nods v2Ectel nods v2
Ectel nods v2
 
Intro to User Centered Design Workshop
Intro to User Centered Design WorkshopIntro to User Centered Design Workshop
Intro to User Centered Design Workshop
 
Prototyping for knowledge based entrepreneurship
Prototyping for knowledge based entrepreneurshipPrototyping for knowledge based entrepreneurship
Prototyping for knowledge based entrepreneurship
 
DSD-INT 2014 - OpenMI Symposium - Federated modelling of Critical Infrastruct...
DSD-INT 2014 - OpenMI Symposium - Federated modelling of Critical Infrastruct...DSD-INT 2014 - OpenMI Symposium - Federated modelling of Critical Infrastruct...
DSD-INT 2014 - OpenMI Symposium - Federated modelling of Critical Infrastruct...
 

Plus de Kepa J. Rodriguez

LOD4JS - Linked Open Data for Jewish Studies
LOD4JS - Linked Open Data for Jewish StudiesLOD4JS - Linked Open Data for Jewish Studies
LOD4JS - Linked Open Data for Jewish StudiesKepa J. Rodriguez
 
Use case: data edited as a book !!!
Use case: data edited as a book !!!Use case: data edited as a book !!!
Use case: data edited as a book !!!Kepa J. Rodriguez
 
Building a 3-gram model for Language Identification
Building a 3-gram model for Language IdentificationBuilding a 3-gram model for Language Identification
Building a 3-gram model for Language IdentificationKepa J. Rodriguez
 
Information Extraction on Noisy Texts for Historical Research
Information Extraction on Noisy Texts for Historical ResearchInformation Extraction on Noisy Texts for Historical Research
Information Extraction on Noisy Texts for Historical ResearchKepa J. Rodriguez
 
Resources for linguistically motivated Multilingual Anaphora Resolution
Resources for linguistically motivated Multilingual Anaphora ResolutionResources for linguistically motivated Multilingual Anaphora Resolution
Resources for linguistically motivated Multilingual Anaphora ResolutionKepa J. Rodriguez
 

Plus de Kepa J. Rodriguez (6)

LOD4JS - Linked Open Data for Jewish Studies
LOD4JS - Linked Open Data for Jewish StudiesLOD4JS - Linked Open Data for Jewish Studies
LOD4JS - Linked Open Data for Jewish Studies
 
Use case: data edited as a book !!!
Use case: data edited as a book !!!Use case: data edited as a book !!!
Use case: data edited as a book !!!
 
Building a 3-gram model for Language Identification
Building a 3-gram model for Language IdentificationBuilding a 3-gram model for Language Identification
Building a 3-gram model for Language Identification
 
Information Extraction on Noisy Texts for Historical Research
Information Extraction on Noisy Texts for Historical ResearchInformation Extraction on Noisy Texts for Historical Research
Information Extraction on Noisy Texts for Historical Research
 
Resources for linguistically motivated Multilingual Anaphora Resolution
Resources for linguistically motivated Multilingual Anaphora ResolutionResources for linguistically motivated Multilingual Anaphora Resolution
Resources for linguistically motivated Multilingual Anaphora Resolution
 
Cross Document Coreference
Cross Document CoreferenceCross Document Coreference
Cross Document Coreference
 

Dernier

HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxEsquimalt MFRC
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structuredhanjurrannsibayan2
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxJisc
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxAmanpreet Kaur
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfDr Vijay Vishwakarma
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024Elizabeth Walsh
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsKarakKing
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxJisc
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxCeline George
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - Englishneillewis46
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 

Dernier (20)

HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 

Active Annotation of Corpora.

  • 1. Active Annotation of Corpora Kepa J. Rodriguez Text Analysis Seminar at the Göttingen Center of Digital Humanities 02.05.2012
  • 2. Outline • Goal of the presentation. • The LUNA corpus. • Active annotation. – Concept – Algorithm. – Evaluation. • Potential use of Active Annotation in projects in humanities.
  • 3. Goal of the presentation • Introduce concepts of: – Active Learning – Active Annotation. • Present its use in the annotation of the LUNA corpus. • Discuss the utility of the Active Annotation in projects in humanities.
  • 4. The LUNA Corpus (1) • Corpus consists of: – 3000 Human-Human and 8100 WOZ dialogues – Multiple annotation levels: POS, entities, coreference, predicate structure, dialogue acts, etc. – in French, Italian and Polish. • French subcorpus: – Application domains: travel information and reservation, IT help desk, telecom costumer care and financial information transaction – Human-Machine dialogues: 7100 • Italian subcorpus: – Application domain: IT helpdesk – 2500 Human-Human and 500 WOZ dialogues • Polish subcorpus: – Application domain: public transportation information – 500 Human-Human and 500 WOZ dialogues More information about annotation scheme and levels: http://www.ist-luna.eu/pdf/schemepresentationPdm.pdf
  • 5. The LUNA Corpus (2) [Operator:] allora m'ha detto che [non riusciva]c1 ad [accedere]c2 [al computer]c3 e [le manca]c4 [la procedura]c5 so, you have told me that you cannot access the computer, and that you need the procedure c1 trouble : unable_to c2 action : access c3 computer-hardware : pc c4 trouble : lack_of c5 computer-software : procedure [Caller:] esatto exactly [Operator:] allora avrei bisogno [dell' RWS]c6 [del PC]c7 so I need the RWS of the computer c6 code-identificationCode : rws c7 computer-hardware : pc [Caller:] si allora [tredici zero ottantasei]c8 yes, 13 0 86 c8 code-identificationCode-rws : 13086
  • 6. Active annotation (1) Components of the active annotation are: • Active learning paradigm – Selection of examples for annotation. • Potential error detection – Cases in which manual annotation seems to be ambiguous or contradictory.
  • 7. Active annotation (2) • Active learning paradigm: – Statistical learning based paradigm – A first small set will randomly chosen and manually annotated. – Use this set to train a model and annotate the rest of samples. – Selection of the most informative examples to update the statistical model • Most informative = lower confidence score • Use of active learning: – Speed-up annotation – Support annotators in their work – Select examples to be annotated: which examples from a big amount of data will be useful for my purposes?
  • 8. Active annotation (3) Learn curve comparison: active vs. random learning (Riccardi and Takkani-Tür, 2005 )
  • 9. Active annotation (4) • Likely error detection: – Re-annotate the training data using the statistical model. – Extract examples in which manual annotation and automatic annotation are different. – Send them to human supervision. • Use of the likely error detection: – If manual annotation is correct, example is hard to learn: • Analyze which new features can be implemented to enrich the model. – If the annotation is erroneous: • Correct it.
  • 10. Annotation algoritm 1. Select randomly a small amount of dialogues and annotate it manually from scratch (SL). 2. Train a model M using SL 3. while (labeler/data available) a) Use M to automatically annotate the unannotated part of the corpus (Su). b) Rank automatically annotated examples of (Su) according to the confidence measure given by M c) Select a batch of k dialogues with the lowest score (Sk) d) Ask for human control/correction on Sk e) Use M to automatically annotate SL and produce SaL f) Look at the difference between SL and produce SaL i. HARD TO LEARN EXAMPLE: Add new features when training M ii. ANNOTATION AMBIGUITIES: Hire human annotators to disambiguate SL g) SL = SL + Sk h) Train a new model M with SL i) Go to 3.1
  • 11. Evaluation (2) • Annotator point of view: – Annotation from scratch: 80-90 minutes/file. – Supervision after 3rd active annotation loop: 25-20 min/file. – Annotators more concentrated in: • Difficult/interesting issues. • Giving feedback about the model. • Error detection: no statistics. – Most of the reported feedback requests were annotation errors. – Some of the reported feedback requests were caused by ambiguities and helped to add features to enrich the model.
  • 12. Evaluation (1) • Wizard of Oz dialogues Act-turn Size in turns Error rate 1 200 59.2% 2 400 44.4% 3 600 39.3% 4 800 6.4% 5 1200 0.0% • Human-human dialogues Act-turn Size in dialogues Error rate 1 10 71.2% 2 20 59.5% 3 30 54.0% 4 40 51.1% 5 60 45.7% 6 80 42.4%
  • 13. Discussion • Questions • Annotation tasks in the GCDH: – Corpus of Coptic Texts. – …..
  • 14. References • LUNA project: http://www.ist-luna.eu • Raymond, Rodriguez and Riccardi (2008): Active Annotation in the LUNA Italian Corpus of Spontaneous Dialogues. In Proceedings of the sixth international conference on Language Resources and Evaluation (LREC 2008).Marrakech. Marrocco. • Riccardi, G. and Hakkani-Tür, D. (2005): Active learning: theory and applications to automatic speech recognition. In IEEE Transactions on Speech and Audio Processing.
  • 15. Thanks!!! Text Analysis Seminar at the Göttingen Center of Digital Humanities