This paper presents three new dependency treebanks for Korean created by converting existing corpora to follow Universal Dependencies guidelines version 2. The treebanks are: 1) A revised version of the Google Universal Dependency Treebank with corrections to align with UDv2. 2) The Penn Korean Treebank containing over 5,000 sentences of newswire converted from phrase structure trees to dependency trees. 3) The KAIST Treebank containing over 27,000 sentences from various domains converted from phrase structure trees to dependency trees. Statistics are provided on the treebanks including part-of-speech tags and dependency labels.
Handwritten Text Recognition for manuscripts and early printed texts
Building Universal Dependency Treebanks in Korean
1. Building Universal Dependency Treebanks in Korean
Jayeol Chun,1
Na-Rae Han,2
Jena D. Hwang,3
Jinho D. Choi1
1
Emory University; 2
University of Pittsburgh; 3
IHMC
{che.yeol.chun, jinho.choi}@emory.edu, naraehan@pitt.edu, jhwang@ihmc.us
Objectives
This paper presents three dependency treebanks in Korean derived from existing corpora, and pseudo-annotated by the latest UD guidelines, version 2 (UDv2).
• Fix several issues with the Korean portion of Google UD Treebank with respect to UDv2.
• Convert phrase structure trees in Penn Korean Treebank and KAIST Treebank into dependency trees following UDv2.
• Provide corpus analytics that include statistics of the new dependency treebanks and remaining issues with the current annotation.
Google UD Treebank
Google UD Treebank (GKT) includes 6K+ sentences
from weblogs and newswire annotated under old UD
guidelines. We carry out systematic correction of GKT,
bring it up to the standards of UDv2.
Original Tree
Morphological Analysis
Tokenization
Head ID Remapping
Dependency Labeling
Corpus Analytics
• At approximately 26 dependency nodes per sentence, PKT includes on average
the longest and complex sentences among the three corpora, likely reflective of
the news domain.
• KTB is by far the largest corpus in this study with its sentence complexity com-
parable to that of GKT at approximately 12 dependency nodes per sentence.
Part-of-Speech Tags
• NOUN, VERB, ADV and PUNCT as the top parts-of-speech.
• In both PKT and GKT, PROPN (proper noun) is the fifth-highest ranking POS,
while it is seen ranking much lower in KAIST, which instead has ADJ (adjective)
taking the spot.
• NUM (number) is prominent in PKT which is likely a reflection of its news domain.
• Absence of the SCONJ in GKT is due to the tokenization that does not analyze
particles as separate tokens.
• Notably, AUX (auxiliary) and PART (particle), lacking in GKT, were partially
introduced into the revised GKT as the result of tokenization of symbols and
punctuation marks.
Dependency Labels
• PKT and KTB appear consistent except in compound, nummod, dislocated and
nsubj. As briefly mentioned, compound and nummod are likely domain-specific
particularities.
• GKT’s abundant annotation of flat is a remnant of coarse tokenization that led
to embedded tokens labeled flat as a whole.
Statistics
GKT PKT KTB Total
Tokens 80,392 132,041 350,090 562,523
Sentences 6,339 5,010 27,363 38,712
Official UD Project: http://universaldependencies.org
Korean UD Project: https://github.com/emorynlp/ud-korean
Penn Korean Universal Dependency Treebank will be released officially through LDC.
Language Resources and Evaluation Conference
May 7-12, 2018; Miyazaki, Japan
Penn Korean Treebank & KAIST Treebank
Two Korean phrase structure treebanks are analyzed and converted into dependency trees using UDv2.
• Penn Korean Treebank (PKT): 5K+ sentences from newswire.
• KAIST Treebank (KTB): 27K+ sentences from literature, newswire, and academic manuscripts.
Empty Categories Coordination Part-of-Speech Tags Dependency Relations
Penn
KAIST N/A
Empty Categories Coordination Structures
• Heuristics are used for matching constituency
tags at both phrasal and morpheme levels.
• Elided predicates caused by gapping relations
are handled as fixed conjuncts, which needs
to be further investigated.
• Coordination structures are detected by heuris-
tics discovered from corpus analytics.
• Each conjunct becomes a head of its left sib-
ling such that the rightmost conjunct becomes
the head of the coordination structure.
Part-of-Speech Tags Dependency Relations
• Part-of-speech tags are mapped to UDv2 via
manually analyzed heuristics. With a few ex-
ceptions, the mappings are categorical for both
the PKT and KTB.
• Some post-position markers (josa) and verbal
endings (eomi) were identified as encoding
conjunction: CCONJ, SCONJ. Rest mapped to
adpositions (ADP) and particles (PART), respec-
tively.
• Once the empty categories are handled, each
constituency node is assigned its head with
head-percolation rules established separately
for PKT and KTB.
• The dependency relation between the node
and its head is inferred by investigating the
function tags, phrasal tags and morphemes
from the original treebanks.