1. Active Annotation of Corpora
Kepa J. Rodriguez
Text Analysis Seminar at the Göttingen Center of Digital Humanities
02.05.2012
2. Outline
• Goal of the presentation.
• The LUNA corpus.
• Active annotation.
– Concept
– Algorithm.
– Evaluation.
• Potential use of Active Annotation in projects in humanities.
3. Goal of the presentation
• Introduce concepts of:
– Active Learning
– Active Annotation.
• Present its use in the annotation of the LUNA corpus.
• Discuss the utility of the Active Annotation in projects in
humanities.
4. The LUNA Corpus (1)
• Corpus consists of:
– 3000 Human-Human and 8100 WOZ dialogues
– Multiple annotation levels: POS, entities, coreference, predicate structure, dialogue acts,
etc.
– in French, Italian and Polish.
• French subcorpus:
– Application domains: travel information and reservation, IT help desk, telecom costumer
care and financial information transaction
– Human-Machine dialogues: 7100
• Italian subcorpus:
– Application domain: IT helpdesk
– 2500 Human-Human and 500 WOZ dialogues
• Polish subcorpus:
– Application domain: public transportation information
– 500 Human-Human and 500 WOZ dialogues
More information about annotation scheme and levels:
http://www.ist-luna.eu/pdf/schemepresentationPdm.pdf
5. The LUNA Corpus (2)
[Operator:] allora m'ha detto che [non riusciva]c1 ad [accedere]c2 [al
computer]c3 e [le manca]c4 [la procedura]c5
so, you have told me that you cannot access the computer, and that you need the
procedure
c1 trouble : unable_to
c2 action : access
c3 computer-hardware : pc
c4 trouble : lack_of
c5 computer-software : procedure
[Caller:] esatto
exactly
[Operator:] allora avrei bisogno [dell' RWS]c6 [del PC]c7
so I need the RWS of the computer
c6 code-identificationCode : rws
c7 computer-hardware : pc
[Caller:] si allora [tredici zero ottantasei]c8
yes, 13 0 86
c8 code-identificationCode-rws : 13086
6. Active annotation (1)
Components of the active annotation are:
• Active learning paradigm
– Selection of examples for annotation.
• Potential error detection
– Cases in which manual annotation seems to be ambiguous
or contradictory.
7. Active annotation (2)
• Active learning paradigm:
– Statistical learning based paradigm
– A first small set will randomly chosen and manually annotated.
– Use this set to train a model and annotate the rest of samples.
– Selection of the most informative examples to update the statistical
model
• Most informative = lower confidence score
• Use of active learning:
– Speed-up annotation
– Support annotators in their work
– Select examples to be annotated: which examples from a big
amount of data will be useful for my purposes?
8. Active annotation (3)
Learn curve comparison: active vs. random learning
(Riccardi and Takkani-Tür, 2005 )
9. Active annotation (4)
• Likely error detection:
– Re-annotate the training data using the statistical model.
– Extract examples in which manual annotation and automatic
annotation are different.
– Send them to human supervision.
• Use of the likely error detection:
– If manual annotation is correct, example is hard to learn:
• Analyze which new features can be implemented to enrich the model.
– If the annotation is erroneous:
• Correct it.
10. Annotation algoritm
1. Select randomly a small amount of dialogues and annotate it manually
from scratch (SL).
2. Train a model M using SL
3. while (labeler/data available)
a) Use M to automatically annotate the unannotated part of the corpus (Su).
b) Rank automatically annotated examples of (Su) according to the confidence
measure given by M
c) Select a batch of k dialogues with the lowest score (Sk)
d) Ask for human control/correction on Sk
e) Use M to automatically annotate SL and produce SaL
f) Look at the difference between SL and produce SaL
i. HARD TO LEARN EXAMPLE: Add new features when training M
ii. ANNOTATION AMBIGUITIES: Hire human annotators to disambiguate SL
g) SL = SL + Sk
h) Train a new model M with SL
i) Go to 3.1
11. Evaluation (2)
• Annotator point of view:
– Annotation from scratch: 80-90 minutes/file.
– Supervision after 3rd active annotation loop: 25-20 min/file.
– Annotators more concentrated in:
• Difficult/interesting issues.
• Giving feedback about the model.
• Error detection: no statistics.
– Most of the reported feedback requests were annotation errors.
– Some of the reported feedback requests were caused by ambiguities and
helped to add features to enrich the model.
13. Discussion
• Questions
• Annotation tasks in the GCDH:
– Corpus of Coptic Texts.
– …..
14. References
• LUNA project: http://www.ist-luna.eu
• Raymond, Rodriguez and Riccardi (2008): Active Annotation in the
LUNA Italian Corpus of Spontaneous Dialogues. In Proceedings of the
sixth international conference on Language Resources and Evaluation
(LREC 2008).Marrakech. Marrocco.
• Riccardi, G. and Hakkani-Tür, D. (2005): Active learning: theory and
applications to automatic speech recognition. In IEEE Transactions on
Speech and Audio Processing.