SlideShare une entreprise Scribd logo
1  sur  1
Télécharger pour lire hors ligne
Building Universal Dependency Treebanks in Korean
Jayeol Chun,1
Na-Rae Han,2
Jena D. Hwang,3
Jinho D. Choi1
1
Emory University; 2
University of Pittsburgh; 3
IHMC
{che.yeol.chun, jinho.choi}@emory.edu, naraehan@pitt.edu, jhwang@ihmc.us
Objectives
This paper presents three dependency treebanks in Korean derived from existing corpora, and pseudo-annotated by the latest UD guidelines, version 2 (UDv2).
• Fix several issues with the Korean portion of Google UD Treebank with respect to UDv2.
• Convert phrase structure trees in Penn Korean Treebank and KAIST Treebank into dependency trees following UDv2.
• Provide corpus analytics that include statistics of the new dependency treebanks and remaining issues with the current annotation.
Google UD Treebank
Google UD Treebank (GKT) includes 6K+ sentences
from weblogs and newswire annotated under old UD
guidelines. We carry out systematic correction of GKT,
bring it up to the standards of UDv2.
Original Tree
Morphological Analysis
Tokenization
Head ID Remapping
Dependency Labeling
Corpus Analytics
• At approximately 26 dependency nodes per sentence, PKT includes on average
the longest and complex sentences among the three corpora, likely reflective of
the news domain.
• KTB is by far the largest corpus in this study with its sentence complexity com-
parable to that of GKT at approximately 12 dependency nodes per sentence.
Part-of-Speech Tags
• NOUN, VERB, ADV and PUNCT as the top parts-of-speech.
• In both PKT and GKT, PROPN (proper noun) is the fifth-highest ranking POS,
while it is seen ranking much lower in KAIST, which instead has ADJ (adjective)
taking the spot.
• NUM (number) is prominent in PKT which is likely a reflection of its news domain.
• Absence of the SCONJ in GKT is due to the tokenization that does not analyze
particles as separate tokens.
• Notably, AUX (auxiliary) and PART (particle), lacking in GKT, were partially
introduced into the revised GKT as the result of tokenization of symbols and
punctuation marks.
Dependency Labels
• PKT and KTB appear consistent except in compound, nummod, dislocated and
nsubj. As briefly mentioned, compound and nummod are likely domain-specific
particularities.
• GKT’s abundant annotation of flat is a remnant of coarse tokenization that led
to embedded tokens labeled flat as a whole.
Statistics
GKT PKT KTB Total
Tokens 80,392 132,041 350,090 562,523
Sentences 6,339 5,010 27,363 38,712
Official UD Project: http://universaldependencies.org
Korean UD Project: https://github.com/emorynlp/ud-korean
Penn Korean Universal Dependency Treebank will be released officially through LDC.
Language Resources and Evaluation Conference
May 7-12, 2018; Miyazaki, Japan
Penn Korean Treebank & KAIST Treebank
Two Korean phrase structure treebanks are analyzed and converted into dependency trees using UDv2.
• Penn Korean Treebank (PKT): 5K+ sentences from newswire.
• KAIST Treebank (KTB): 27K+ sentences from literature, newswire, and academic manuscripts.
Empty Categories Coordination Part-of-Speech Tags Dependency Relations
Penn
KAIST N/A
Empty Categories Coordination Structures
• Heuristics are used for matching constituency
tags at both phrasal and morpheme levels.
• Elided predicates caused by gapping relations
are handled as fixed conjuncts, which needs
to be further investigated.
• Coordination structures are detected by heuris-
tics discovered from corpus analytics.
• Each conjunct becomes a head of its left sib-
ling such that the rightmost conjunct becomes
the head of the coordination structure.
Part-of-Speech Tags Dependency Relations
• Part-of-speech tags are mapped to UDv2 via
manually analyzed heuristics. With a few ex-
ceptions, the mappings are categorical for both
the PKT and KTB.
• Some post-position markers (josa) and verbal
endings (eomi) were identified as encoding
conjunction: CCONJ, SCONJ. Rest mapped to
adpositions (ADP) and particles (PART), respec-
tively.
• Once the empty categories are handled, each
constituency node is assigned its head with
head-percolation rules established separately
for PKT and KTB.
• The dependency relation between the node
and its head is inferred by investigating the
function tags, phrasal tags and morphemes
from the original treebanks.

Contenu connexe

Tendances

RuleML2015 - Tutorial - Powerful Practical Semantic Rules in Rulelog - Funda...
RuleML2015 - Tutorial -  Powerful Practical Semantic Rules in Rulelog - Funda...RuleML2015 - Tutorial -  Powerful Practical Semantic Rules in Rulelog - Funda...
RuleML2015 - Tutorial - Powerful Practical Semantic Rules in Rulelog - Funda...RuleML
 
Personalised Terms Derivative- Semantic Stemming
Personalised Terms Derivative- Semantic StemmingPersonalised Terms Derivative- Semantic Stemming
Personalised Terms Derivative- Semantic Stemmingnitin jha
 
17. Anne Schuman (USAAR) Terminology and Ontologies 2
17. Anne Schuman (USAAR) Terminology and Ontologies 217. Anne Schuman (USAAR) Terminology and Ontologies 2
17. Anne Schuman (USAAR) Terminology and Ontologies 2RIILP
 
RuleML2015 PSOA RuleML: Integrated Object-Relational Data and Rules
RuleML2015 PSOA RuleML: Integrated Object-Relational Data and RulesRuleML2015 PSOA RuleML: Integrated Object-Relational Data and Rules
RuleML2015 PSOA RuleML: Integrated Object-Relational Data and RulesRuleML
 
RuleML2015: Explanation of proofs of regulatory (non-)complianceusing semanti...
RuleML2015: Explanation of proofs of regulatory (non-)complianceusing semanti...RuleML2015: Explanation of proofs of regulatory (non-)complianceusing semanti...
RuleML2015: Explanation of proofs of regulatory (non-)complianceusing semanti...RuleML
 
Automated building of taxonomies for search engines
Automated building of taxonomies for search enginesAutomated building of taxonomies for search engines
Automated building of taxonomies for search enginesBoris Galitsky
 
GRDDL: A Pictorial Approach
GRDDL: A Pictorial ApproachGRDDL: A Pictorial Approach
GRDDL: A Pictorial ApproachChimezie Ogbuji
 
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...Waqas Tariq
 
Boolean Retrieval
Boolean RetrievalBoolean Retrieval
Boolean Retrievalmghgk
 

Tendances (12)

RuleML2015 - Tutorial - Powerful Practical Semantic Rules in Rulelog - Funda...
RuleML2015 - Tutorial -  Powerful Practical Semantic Rules in Rulelog - Funda...RuleML2015 - Tutorial -  Powerful Practical Semantic Rules in Rulelog - Funda...
RuleML2015 - Tutorial - Powerful Practical Semantic Rules in Rulelog - Funda...
 
Personalised Terms Derivative- Semantic Stemming
Personalised Terms Derivative- Semantic StemmingPersonalised Terms Derivative- Semantic Stemming
Personalised Terms Derivative- Semantic Stemming
 
17. Anne Schuman (USAAR) Terminology and Ontologies 2
17. Anne Schuman (USAAR) Terminology and Ontologies 217. Anne Schuman (USAAR) Terminology and Ontologies 2
17. Anne Schuman (USAAR) Terminology and Ontologies 2
 
RuleML2015 PSOA RuleML: Integrated Object-Relational Data and Rules
RuleML2015 PSOA RuleML: Integrated Object-Relational Data and RulesRuleML2015 PSOA RuleML: Integrated Object-Relational Data and Rules
RuleML2015 PSOA RuleML: Integrated Object-Relational Data and Rules
 
NLP todo
NLP todoNLP todo
NLP todo
 
RuleML2015: Explanation of proofs of regulatory (non-)complianceusing semanti...
RuleML2015: Explanation of proofs of regulatory (non-)complianceusing semanti...RuleML2015: Explanation of proofs of regulatory (non-)complianceusing semanti...
RuleML2015: Explanation of proofs of regulatory (non-)complianceusing semanti...
 
Treebank annotation
Treebank annotationTreebank annotation
Treebank annotation
 
Automated building of taxonomies for search engines
Automated building of taxonomies for search enginesAutomated building of taxonomies for search engines
Automated building of taxonomies for search engines
 
GRDDL: A Pictorial Approach
GRDDL: A Pictorial ApproachGRDDL: A Pictorial Approach
GRDDL: A Pictorial Approach
 
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...
 
FinalDraftRevisisions
FinalDraftRevisisionsFinalDraftRevisisions
FinalDraftRevisisions
 
Boolean Retrieval
Boolean RetrievalBoolean Retrieval
Boolean Retrieval
 

Similaire à Building Universal Dependency Treebanks in Korean

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingNimrita Koul
 
[ ] uottawa_copeck.doc
[ ] uottawa_copeck.doc[ ] uottawa_copeck.doc
[ ] uottawa_copeck.docbutest
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingYoung Seok Kim
 
Segmentation - based Historical Handwritten Word Spotting using document-spec...
Segmentation - based Historical Handwritten Word Spotting using document-spec...Segmentation - based Historical Handwritten Word Spotting using document-spec...
Segmentation - based Historical Handwritten Word Spotting using document-spec...Konstantinos Zagoris
 
Lecture 2 Hierarchy of NLP & TF-IDF.pptx
Lecture 2 Hierarchy of NLP & TF-IDF.pptxLecture 2 Hierarchy of NLP & TF-IDF.pptx
Lecture 2 Hierarchy of NLP & TF-IDF.pptxKunalSingh560957
 
SKOS and Its Application in Transferring Traditional Thesauri into Networked KOS
SKOS and Its Application in Transferring Traditional Thesauri into Networked KOSSKOS and Its Application in Transferring Traditional Thesauri into Networked KOS
SKOS and Its Application in Transferring Traditional Thesauri into Networked KOSMarcia Zeng
 
Reference Scope Identification of Citances Using Convolutional Neural Network
Reference Scope Identification of Citances Using Convolutional Neural NetworkReference Scope Identification of Citances Using Convolutional Neural Network
Reference Scope Identification of Citances Using Convolutional Neural NetworkSaurav Jha
 
2014-03-12 White - MANA Poster (Epoxy-PA-MWCNT) 2
2014-03-12 White - MANA Poster (Epoxy-PA-MWCNT) 22014-03-12 White - MANA Poster (Epoxy-PA-MWCNT) 2
2014-03-12 White - MANA Poster (Epoxy-PA-MWCNT) 2Kevin White, Ph.D
 
Formalising the Swedish Constructicon in Grammatical Framework
Formalising the Swedish Constructicon in Grammatical FrameworkFormalising the Swedish Constructicon in Grammatical Framework
Formalising the Swedish Constructicon in Grammatical FrameworkNormunds Grūzītis
 
Normunds Gruzitis - 2015 - Formalising the Swedish Constructicon in Grammatic...
Normunds Gruzitis - 2015 - Formalising the Swedish Constructicon in Grammatic...Normunds Gruzitis - 2015 - Formalising the Swedish Constructicon in Grammatic...
Normunds Gruzitis - 2015 - Formalising the Swedish Constructicon in Grammatic...Association for Computational Linguistics
 
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Nltk  natural language toolkit overview and application @ PyCon.tw 2012Nltk  natural language toolkit overview and application @ PyCon.tw 2012
Nltk natural language toolkit overview and application @ PyCon.tw 2012Jimmy Lai
 
Coling2014:Single Document Keyphrase Extraction Using Label Information
Coling2014:Single Document Keyphrase Extraction Using Label InformationColing2014:Single Document Keyphrase Extraction Using Label Information
Coling2014:Single Document Keyphrase Extraction Using Label InformationRyuchi Tachibana
 
Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...Edmond Lepedus
 
An Empirical Comparison of Fast and Efficient Tools for Mining Textual Data
An Empirical Comparison of Fast and Efficient Tools for Mining Textual DataAn Empirical Comparison of Fast and Efficient Tools for Mining Textual Data
An Empirical Comparison of Fast and Efficient Tools for Mining Textual Datavtunali
 

Similaire à Building Universal Dependency Treebanks in Korean (20)

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
[ ] uottawa_copeck.doc
[ ] uottawa_copeck.doc[ ] uottawa_copeck.doc
[ ] uottawa_copeck.doc
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Segmentation - based Historical Handwritten Word Spotting using document-spec...
Segmentation - based Historical Handwritten Word Spotting using document-spec...Segmentation - based Historical Handwritten Word Spotting using document-spec...
Segmentation - based Historical Handwritten Word Spotting using document-spec...
 
Lecture 2 Hierarchy of NLP & TF-IDF.pptx
Lecture 2 Hierarchy of NLP & TF-IDF.pptxLecture 2 Hierarchy of NLP & TF-IDF.pptx
Lecture 2 Hierarchy of NLP & TF-IDF.pptx
 
SKOS and Its Application in Transferring Traditional Thesauri into Networked KOS
SKOS and Its Application in Transferring Traditional Thesauri into Networked KOSSKOS and Its Application in Transferring Traditional Thesauri into Networked KOS
SKOS and Its Application in Transferring Traditional Thesauri into Networked KOS
 
Compiler design Project
Compiler design ProjectCompiler design Project
Compiler design Project
 
NLP
NLPNLP
NLP
 
N20190530
N20190530N20190530
N20190530
 
Reference Scope Identification of Citances Using Convolutional Neural Network
Reference Scope Identification of Citances Using Convolutional Neural NetworkReference Scope Identification of Citances Using Convolutional Neural Network
Reference Scope Identification of Citances Using Convolutional Neural Network
 
2014-03-12 White - MANA Poster (Epoxy-PA-MWCNT) 2
2014-03-12 White - MANA Poster (Epoxy-PA-MWCNT) 22014-03-12 White - MANA Poster (Epoxy-PA-MWCNT) 2
2014-03-12 White - MANA Poster (Epoxy-PA-MWCNT) 2
 
Marek Rei - 2017 - Semi-supervised Multitask Learning for Sequence Labeling
Marek Rei - 2017 - Semi-supervised Multitask Learning for Sequence LabelingMarek Rei - 2017 - Semi-supervised Multitask Learning for Sequence Labeling
Marek Rei - 2017 - Semi-supervised Multitask Learning for Sequence Labeling
 
Formalising the Swedish Constructicon in Grammatical Framework
Formalising the Swedish Constructicon in Grammatical FrameworkFormalising the Swedish Constructicon in Grammatical Framework
Formalising the Swedish Constructicon in Grammatical Framework
 
Normunds Gruzitis - 2015 - Formalising the Swedish Constructicon in Grammatic...
Normunds Gruzitis - 2015 - Formalising the Swedish Constructicon in Grammatic...Normunds Gruzitis - 2015 - Formalising the Swedish Constructicon in Grammatic...
Normunds Gruzitis - 2015 - Formalising the Swedish Constructicon in Grammatic...
 
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Nltk  natural language toolkit overview and application @ PyCon.tw 2012Nltk  natural language toolkit overview and application @ PyCon.tw 2012
Nltk natural language toolkit overview and application @ PyCon.tw 2012
 
Coling2014:Single Document Keyphrase Extraction Using Label Information
Coling2014:Single Document Keyphrase Extraction Using Label InformationColing2014:Single Document Keyphrase Extraction Using Label Information
Coling2014:Single Document Keyphrase Extraction Using Label Information
 
Protein structure prediction with a focus on Rosetta
Protein structure prediction with a focus on RosettaProtein structure prediction with a focus on Rosetta
Protein structure prediction with a focus on Rosetta
 
Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...
 
An Empirical Comparison of Fast and Efficient Tools for Mining Textual Data
An Empirical Comparison of Fast and Efficient Tools for Mining Textual DataAn Empirical Comparison of Fast and Efficient Tools for Mining Textual Data
An Empirical Comparison of Fast and Efficient Tools for Mining Textual Data
 
Dl2014 slides
Dl2014 slidesDl2014 slides
Dl2014 slides
 

Plus de Jinho Choi

Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...Jinho Choi
 
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...Jinho Choi
 
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...Jinho Choi
 
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...Jinho Choi
 
The Myth of Higher-Order Inference in Coreference Resolution
The Myth of Higher-Order Inference in Coreference ResolutionThe Myth of Higher-Order Inference in Coreference Resolution
The Myth of Higher-Order Inference in Coreference ResolutionJinho Choi
 
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...Jinho Choi
 
Abstract Meaning Representation
Abstract Meaning RepresentationAbstract Meaning Representation
Abstract Meaning RepresentationJinho Choi
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role LabelingJinho Choi
 
CS329 - WordNet Similarities
CS329 - WordNet SimilaritiesCS329 - WordNet Similarities
CS329 - WordNet SimilaritiesJinho Choi
 
CS329 - Lexical Relations
CS329 - Lexical RelationsCS329 - Lexical Relations
CS329 - Lexical RelationsJinho Choi
 
Automatic Knowledge Base Expansion for Dialogue Management
Automatic Knowledge Base Expansion for Dialogue ManagementAutomatic Knowledge Base Expansion for Dialogue Management
Automatic Knowledge Base Expansion for Dialogue ManagementJinho Choi
 
Attention is All You Need for AMR Parsing
Attention is All You Need for AMR ParsingAttention is All You Need for AMR Parsing
Attention is All You Need for AMR ParsingJinho Choi
 
Graph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to DialogueGraph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to DialogueJinho Choi
 
Real-time Coreference Resolution for Dialogue Understanding
Real-time Coreference Resolution for Dialogue UnderstandingReal-time Coreference Resolution for Dialogue Understanding
Real-time Coreference Resolution for Dialogue UnderstandingJinho Choi
 
Topological Sort
Topological SortTopological Sort
Topological SortJinho Choi
 
Multi-modal Embedding Learning for Early Detection of Alzheimer's Disease
Multi-modal Embedding Learning for Early Detection of Alzheimer's DiseaseMulti-modal Embedding Learning for Early Detection of Alzheimer's Disease
Multi-modal Embedding Learning for Early Detection of Alzheimer's DiseaseJinho Choi
 
Building Widely-Interpretable Semantic Networks for Dialogue Contexts
Building Widely-Interpretable Semantic Networks for Dialogue ContextsBuilding Widely-Interpretable Semantic Networks for Dialogue Contexts
Building Widely-Interpretable Semantic Networks for Dialogue ContextsJinho Choi
 
How to make Emora talk about Sports Intelligently
How to make Emora talk about Sports IntelligentlyHow to make Emora talk about Sports Intelligently
How to make Emora talk about Sports IntelligentlyJinho Choi
 

Plus de Jinho Choi (20)

Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
 
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
 
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
 
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
 
The Myth of Higher-Order Inference in Coreference Resolution
The Myth of Higher-Order Inference in Coreference ResolutionThe Myth of Higher-Order Inference in Coreference Resolution
The Myth of Higher-Order Inference in Coreference Resolution
 
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
 
Abstract Meaning Representation
Abstract Meaning RepresentationAbstract Meaning Representation
Abstract Meaning Representation
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
 
CKY Parsing
CKY ParsingCKY Parsing
CKY Parsing
 
CS329 - WordNet Similarities
CS329 - WordNet SimilaritiesCS329 - WordNet Similarities
CS329 - WordNet Similarities
 
CS329 - Lexical Relations
CS329 - Lexical RelationsCS329 - Lexical Relations
CS329 - Lexical Relations
 
Automatic Knowledge Base Expansion for Dialogue Management
Automatic Knowledge Base Expansion for Dialogue ManagementAutomatic Knowledge Base Expansion for Dialogue Management
Automatic Knowledge Base Expansion for Dialogue Management
 
Attention is All You Need for AMR Parsing
Attention is All You Need for AMR ParsingAttention is All You Need for AMR Parsing
Attention is All You Need for AMR Parsing
 
Graph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to DialogueGraph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to Dialogue
 
Real-time Coreference Resolution for Dialogue Understanding
Real-time Coreference Resolution for Dialogue UnderstandingReal-time Coreference Resolution for Dialogue Understanding
Real-time Coreference Resolution for Dialogue Understanding
 
Topological Sort
Topological SortTopological Sort
Topological Sort
 
Tries - Put
Tries - PutTries - Put
Tries - Put
 
Multi-modal Embedding Learning for Early Detection of Alzheimer's Disease
Multi-modal Embedding Learning for Early Detection of Alzheimer's DiseaseMulti-modal Embedding Learning for Early Detection of Alzheimer's Disease
Multi-modal Embedding Learning for Early Detection of Alzheimer's Disease
 
Building Widely-Interpretable Semantic Networks for Dialogue Contexts
Building Widely-Interpretable Semantic Networks for Dialogue ContextsBuilding Widely-Interpretable Semantic Networks for Dialogue Contexts
Building Widely-Interpretable Semantic Networks for Dialogue Contexts
 
How to make Emora talk about Sports Intelligently
How to make Emora talk about Sports IntelligentlyHow to make Emora talk about Sports Intelligently
How to make Emora talk about Sports Intelligently
 

Dernier

Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Dernier (20)

Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Building Universal Dependency Treebanks in Korean

  • 1. Building Universal Dependency Treebanks in Korean Jayeol Chun,1 Na-Rae Han,2 Jena D. Hwang,3 Jinho D. Choi1 1 Emory University; 2 University of Pittsburgh; 3 IHMC {che.yeol.chun, jinho.choi}@emory.edu, naraehan@pitt.edu, jhwang@ihmc.us Objectives This paper presents three dependency treebanks in Korean derived from existing corpora, and pseudo-annotated by the latest UD guidelines, version 2 (UDv2). • Fix several issues with the Korean portion of Google UD Treebank with respect to UDv2. • Convert phrase structure trees in Penn Korean Treebank and KAIST Treebank into dependency trees following UDv2. • Provide corpus analytics that include statistics of the new dependency treebanks and remaining issues with the current annotation. Google UD Treebank Google UD Treebank (GKT) includes 6K+ sentences from weblogs and newswire annotated under old UD guidelines. We carry out systematic correction of GKT, bring it up to the standards of UDv2. Original Tree Morphological Analysis Tokenization Head ID Remapping Dependency Labeling Corpus Analytics • At approximately 26 dependency nodes per sentence, PKT includes on average the longest and complex sentences among the three corpora, likely reflective of the news domain. • KTB is by far the largest corpus in this study with its sentence complexity com- parable to that of GKT at approximately 12 dependency nodes per sentence. Part-of-Speech Tags • NOUN, VERB, ADV and PUNCT as the top parts-of-speech. • In both PKT and GKT, PROPN (proper noun) is the fifth-highest ranking POS, while it is seen ranking much lower in KAIST, which instead has ADJ (adjective) taking the spot. • NUM (number) is prominent in PKT which is likely a reflection of its news domain. • Absence of the SCONJ in GKT is due to the tokenization that does not analyze particles as separate tokens. • Notably, AUX (auxiliary) and PART (particle), lacking in GKT, were partially introduced into the revised GKT as the result of tokenization of symbols and punctuation marks. Dependency Labels • PKT and KTB appear consistent except in compound, nummod, dislocated and nsubj. As briefly mentioned, compound and nummod are likely domain-specific particularities. • GKT’s abundant annotation of flat is a remnant of coarse tokenization that led to embedded tokens labeled flat as a whole. Statistics GKT PKT KTB Total Tokens 80,392 132,041 350,090 562,523 Sentences 6,339 5,010 27,363 38,712 Official UD Project: http://universaldependencies.org Korean UD Project: https://github.com/emorynlp/ud-korean Penn Korean Universal Dependency Treebank will be released officially through LDC. Language Resources and Evaluation Conference May 7-12, 2018; Miyazaki, Japan Penn Korean Treebank & KAIST Treebank Two Korean phrase structure treebanks are analyzed and converted into dependency trees using UDv2. • Penn Korean Treebank (PKT): 5K+ sentences from newswire. • KAIST Treebank (KTB): 27K+ sentences from literature, newswire, and academic manuscripts. Empty Categories Coordination Part-of-Speech Tags Dependency Relations Penn KAIST N/A Empty Categories Coordination Structures • Heuristics are used for matching constituency tags at both phrasal and morpheme levels. • Elided predicates caused by gapping relations are handled as fixed conjuncts, which needs to be further investigated. • Coordination structures are detected by heuris- tics discovered from corpus analytics. • Each conjunct becomes a head of its left sib- ling such that the rightmost conjunct becomes the head of the coordination structure. Part-of-Speech Tags Dependency Relations • Part-of-speech tags are mapped to UDv2 via manually analyzed heuristics. With a few ex- ceptions, the mappings are categorical for both the PKT and KTB. • Some post-position markers (josa) and verbal endings (eomi) were identified as encoding conjunction: CCONJ, SCONJ. Rest mapped to adpositions (ADP) and particles (PART), respec- tively. • Once the empty categories are handled, each constituency node is assigned its head with head-percolation rules established separately for PKT and KTB. • The dependency relation between the node and its head is inferred by investigating the function tags, phrasal tags and morphemes from the original treebanks.