Proceedings of the Australasian Language Technology Association Workshop 2013 (ALTA 2013), pp132-136, Brisbane, Australia. http://aclweb.org/anthology/U/U13/
1. Overview of the 2013 ALTA Shared Task
Diego Moll´
a
Australasian Language Technology
Macquarie University
ALTA 2013, Brisbane, Australia
2. The ALTA Shared Tasks
The 2013 Task
Kaggle in Class
Results
Use in UniMelb
Contents
The ALTA Shared Tasks
The 2013 ALTA Shared Task
Kaggle in Class
Results
Use in University of Melbourne (Karin Verspoor)
2013 ALTA Shared Task
Diego Moll´
a
2/26
3. The ALTA Shared Tasks
The 2013 Task
Kaggle in Class
Results
Use in UniMelb
Contents
The ALTA Shared Tasks
The 2013 ALTA Shared Task
Kaggle in Class
Results
Use in University of Melbourne (Karin Verspoor)
2013 ALTA Shared Task
Diego Moll´
a
3/26
4. The ALTA Shared Tasks
The 2013 Task
Kaggle in Class
Results
Use in UniMelb
The ALTA Shared Tasks
Aims
Target university students with programming experience.
No background on text processing required.
Aim to expose potential researchers to NLP-related problems.
Format
All participants attempt to solve the same problem.
The training and test data are common to all.
Any tools and external resources can be used.
The solution must be completely automated.
2013 ALTA Shared Task
Diego Moll´
a
4/26
5. The ALTA Shared Tasks
The 2013 Task
Kaggle in Class
Results
Use in UniMelb
The ALTA Shared Tasks
Aims
Target university students with programming experience.
No background on text processing required.
Aim to expose potential researchers to NLP-related problems.
Format
All participants attempt to solve the same problem.
The training and test data are common to all.
Any tools and external resources can be used.
The solution must be completely automated.
2013 ALTA Shared Task
Diego Moll´
a
4/26
6. The ALTA Shared Tasks
The 2013 Task
Kaggle in Class
Results
Use in UniMelb
The 2013 Shared Task
Task: Case and punctuation restoration
Categories: student, open
Prize: $350
Framework: Kaggle in Class
Student Category
Open Category
All members are
university students.
Any other teams.
No members are full-time
employed.
No members have a PhD.
2013 ALTA Shared Task
Diego Moll´
a
5/26
7. The ALTA Shared Tasks
The 2013 Task
Kaggle in Class
Results
Use in UniMelb
Contents
The ALTA Shared Tasks
The 2013 ALTA Shared Task
Kaggle in Class
Results
Use in University of Melbourne (Karin Verspoor)
2013 ALTA Shared Task
Diego Moll´
a
6/26
8. The ALTA Shared Tasks
The 2013 Task
Kaggle in Class
Results
Use in UniMelb
Case and Punctuation Restoration
Input
. . . stored at the ucla television archives the archived episodes were
telecast march 8 16 and 24 1971 april 1 and . . .
Output
. . . stored at the UCLA Television Archives. The archived episodes
were telecast: March 8, 16, and 24, 1971, April 1 and . . .
2013 ALTA Shared Task
Diego Moll´
a
7/26
9. The ALTA Shared Tasks
The 2013 Task
Kaggle in Class
Results
Use in UniMelb
Motivation
In some situations, English text does not have information
about capitalisation or punctuation.
Automated text transcriptions.
Quick notes.
Text messages, tweets.
In some applications, a preliminary stage of case and
punctuation restoration improves outcomes.
Machine translation.
Information extraction.
2013 ALTA Shared Task
Diego Moll´
a
8/26
10. The ALTA Shared Tasks
The 2013 Task
Kaggle in Class
Results
Use in UniMelb
Motivation
In some situations, English text does not have information
about capitalisation or punctuation.
Automated text transcriptions.
Quick notes.
Text messages, tweets.
In some applications, a preliminary stage of case and
punctuation restoration improves outcomes.
Machine translation.
Information extraction.
2013 ALTA Shared Task
Diego Moll´
a
8/26
11. The ALTA Shared Tasks
The 2013 Task
Kaggle in Class
Results
Use in UniMelb
Case and Punctuation Restoration as a Classification Task
Baldwin and Joseph (2009)
Multi-label classification.
Each label indicates the information to restore.
COMMA: Word is followed by a comma.
CAPi: Character i is in uppercase.
ALLCAPS: All characters in uppercase.
NOCHANGE: No special restoration needed.
...
corp/CAP1+FULLSTOP+COMMA
Corp.
2013 ALTA Shared Task
Diego Moll´
a
9/26
12. The ALTA Shared Tasks
The 2013 Task
Kaggle in Class
Results
Use in UniMelb
Simplification for the ALTA Shared Task
Only Two Labels
Case: The word has at least one character in uppercase.
Punct: The word is followed by at least one punctuation mark.
Punctuation Marks
,.;:?!
2013 ALTA Shared Task
Diego Moll´
a
10/26
13. The ALTA Shared Tasks
The 2013 Task
Kaggle in Class
Results
Use in UniMelb
Training Set
CAPITALIZED PUNCTUATION WORD
True False positive
False False pressure
False False ventilation
False False (
True False ppv
False False )
False False consists
False False of
False False using
False False a
False False fan
False False to
False False create
2013 ALTA Shared Task
Diego Moll´
a
11/26
14. The ALTA Shared Tasks
The 2013 Task
Kaggle in Class
Results
Use in UniMelb
Test Set
Input
Output
ID WORD
255 stored
256 at
257 the
258 ucla
259 television
260 archives
261 the
262 archived
263 episodes
264 were
Id,documents
Case,258 259 260 261 266 272
Punct,260 265 267 268 270 271
2013 ALTA Shared Task
Diego Moll´
a
12/26
15. The ALTA Shared Tasks
The 2013 Task
Kaggle in Class
Results
Use in UniMelb
Data Sources
Test Set
Data collected by Baldwin & Joseph (2009) from the AP
Newswire (APW) and New York Times (NYT) sections of the
English Gigaword Corpus.
1. Public test set: available for participants during the
competition.
2. Private test set: released at the last minute.
Training Set
A third partition from the data by Baldwin & Joseph (2009).
An extract of Wikipedia.
2013 ALTA Shared Task
Diego Moll´
a
13/26
16. The ALTA Shared Tasks
The 2013 Task
Kaggle in Class
Results
Use in UniMelb
Data Sizes
Wikipedia Extract for Training
18 files.
306,445 words in total.
Data from Baldwin & Joseph (2009)
Training: 66,371 words.
Public test: 64,072 words.
Private test: 66,371 words.
2013 ALTA Shared Task
Diego Moll´
a
14/26
17. The ALTA Shared Tasks
The 2013 Task
Kaggle in Class
Results
Use in UniMelb
Contents
The ALTA Shared Tasks
The 2013 ALTA Shared Task
Kaggle in Class
Results
Use in University of Melbourne (Karin Verspoor)
2013 ALTA Shared Task
Diego Moll´
a
15/26
18. The ALTA Shared Tasks
The 2013 Task
Kaggle in Class
Results
Use in UniMelb
Kaggle in Class
Kaggle
Kaggle offers a Web-based framework for data-driven
competitions.
A large base of potential participants.
Potentially large prizes for the participants.
Fee-based for the organisers; free for the participants.
Kaggle in Class
Free for organisers and participants.
Limited user support by Kaggle.
Used by course-based competitions.
2013 ALTA Shared Task
Diego Moll´
a
16/26
19. The ALTA Shared Tasks
The 2013 Task
Kaggle in Class
Results
Use in UniMelb
Alta Shared Task in Kaggle in Class
2013 ALTA Shared Task
Diego Moll´
a
17/26
20. The ALTA Shared Tasks
The 2013 Task
Kaggle in Class
Results
Use in UniMelb
Features of Kaggle in Class
Public leaderboard: all participants can submit and compare
with other participants.
Automated evaluation: organisers can choose among several
evaluation metrics.
Public and private partitions: A private partition of the test
data is held private for the final ranking
But this feature does not work well with some evaluation
metrics.
Discussion forum: for communication among participants.
2013 ALTA Shared Task
Diego Moll´
a
18/26
21. The ALTA Shared Tasks
The 2013 Task
Kaggle in Class
Results
Use in UniMelb
Contents
The ALTA Shared Tasks
The 2013 ALTA Shared Task
Kaggle in Class
Results
Use in University of Melbourne (Karin Verspoor)
2013 ALTA Shared Task
Diego Moll´
a
19/26
22. The ALTA Shared Tasks
The 2013 Task
Kaggle in Class
Results
Use in UniMelb
Evaluation Metric
Output
Macro-Averaged F1
Id,documents
Case,258 259 260 262 270
Punct,259 260 265 270
Case:
P = 3/5; R = 3/6;
F1 = 0.54
Target
Punct:
P = 3/4; R = 3/6;
F1 = 0.6
Id,documents
Case,258 259 260 261 266 272
Punct,260 265 267 268 270 271
Final score:
(0.54+0.6)/2 =
0.57
2013 ALTA Shared Task
Diego Moll´
a
20/26
23. The ALTA Shared Tasks
The 2013 Task
Kaggle in Class
Results
Use in UniMelb
A Baseline
Training data
F1 (public)
F1 (private)
Train data
Wikipedia 0-5
Wikipedia 0-10
Wikipedia 0-1
Train + Wikipedia
0.4355
0.4077
0.4173
0.42267
0.4493
0.2895
0.2761
0.2791
0.2789
0.2876
Single-label task: Each of the 4 combinations of possible
labels forms a single label.
Trained NLTK’s Hidden Markov Model (HMM).
Results improved as we added more training data.
Large difference between “public” and “private” test sets.
2013 ALTA Shared Task
Diego Moll´
a
21/26
24. The ALTA Shared Tasks
The 2013 Task
Kaggle in Class
Results
Use in UniMelb
Results
Public Data
Rank
Team
Score
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Winner
Second
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
(test system)
?
?
?
0.73763
0.68360
0.63232
0.63109
0.60251
0.60147
0.59517
0.58332
0.56832
0.56747
0.55793
0.55606
0.55087
0.52261
0.51954
0.51167
0.49311
0.47622
0.46667
0.46490
0.45986
0.45291
Baseline
Public Data
0.44930
Rank
Team
Score
23
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
(8 systems)
?
?
?
?
?
?
?
?
?
?
?
?
?
?
Team A
?
?
?
?
?
?
0.44930
0.44914
0.42710
0.42257
0.41692
0.40239
0.38812
0.38113
0.32594
0.32320
0.30988
0.29891
0.29304
0.27642
0.23504
0.23108
0.21930
0.21771
0.21291
0.20226
0.13397
0.00000
2013 ALTA Shared Task
Private Data
Rank
Team
Score
1
2
3
4
Winner
Second
?
Team A
0.73660
0.64934
0.30037
0.07656
Diego Moll´
a
22/26
25. The ALTA Shared Tasks
The 2013 Task
Kaggle in Class
Results
Use in UniMelb
Contents
The ALTA Shared Tasks
The 2013 ALTA Shared Task
Kaggle in Class
Results
Use in University of Melbourne (Karin Verspoor)
2013 ALTA Shared Task
Diego Moll´
a
23/26
26. The ALTA Shared Tasks
The 2013 Task
Kaggle in Class
Results
Use in UniMelb
The ALTA Shared Task in Class at UniMelb
Students in the UniMelb Knowledge Technologies subject
were assigned the shared task as a class project.
Blended Learning : augmenting classroom learning with
on-line opportunities.
Some adaptations were made to the class context:
Stage 1: Data pre-processing
Stage 2: Feature and Method Exploration; Report write-up
Stage 3: Peer review
Emphasis on critical analysis of methods and results.
2013 ALTA Shared Task
Diego Moll´
a
24/26
27. The ALTA Shared Tasks
The 2013 Task
Kaggle in Class
Results
Use in UniMelb
ALTA Kaggle in Class at UniMelb
Students were given the option of participating on-line
through Kaggle in Class.
Participating in the on-line forum gave immediate feedback on
performance.
Open ’competition’ through leader board stimulated
experimentation.
Anecdotal observation suggested better overall marks for
students who participated on-line.
2013 ALTA Shared Task
Diego Moll´
a
25/26
28. The ALTA Shared Tasks
The 2013 Task
Kaggle in Class
Results
Use in UniMelb
Conclusions
Conclusions
Larger participation than in past tasks.
Used as an assignment at a Masters unit at University of
Melbourne.
Many participants did much better than our baseline.
Easy to produce training data.
Larger training data from other domains (Wikipedia) improves
on results.
Kaggle in Class useful, though had to use a second “final”
submission that had very few participants.
Questions?
2013 ALTA Shared Task
Diego Moll´
a
26/26
29. The ALTA Shared Tasks
The 2013 Task
Kaggle in Class
Results
Use in UniMelb
Conclusions
Conclusions
Larger participation than in past tasks.
Used as an assignment at a Masters unit at University of
Melbourne.
Many participants did much better than our baseline.
Easy to produce training data.
Larger training data from other domains (Wikipedia) improves
on results.
Kaggle in Class useful, though had to use a second “final”
submission that had very few participants.
Questions?
2013 ALTA Shared Task
Diego Moll´
a
26/26