SlideShare une entreprise Scribd logo
1  sur  19
Télécharger pour lire hors ligne
STANDOFF ANNOTATION FOR THE ANCIENT
GREEK AND LATIN DEPENDENCY TREEBANK
DATECH2019, 9.5.2019
Giuseppe G. A. Celano
DFG PROJECT
1. Revise: correct the errors
2. Standardize: make the AGLDT standoff as PAULA XML (and convert into
UD)
1. standoff for multiple annotations and/or multiple interpretations of the
same token
2. standoff to overcome the problem of conflicting hierarchies
3. Expand: add new annotations
(https://git.informatik.uni-leipzig.de/celano/agldt1)
THE AGLT
▸ Ancient Greek texts: 557,922 tokens
▸ Latin texts: 79,697 tokens
▸ available in GitHub/GitLab:
▸ https://perseusdl.github.io/treebank_data/
▸ https://git.informatik.uni-leipzig.de/celano/agldt1
LABELED DIRECTED ACYCLIC GRAPHS
THE PERSEUS TREEBANK (LAST RELEASE, 2.1)
▸ 12 texts
composition date text token number
63 BC Cicero, In Catilinam 6,652
51 BC Caesar, De Bello Gallico 1556
post 44 BC Sallust, Bellum Catilinae 13191
ca 25 BC Prop. Elegiae 5297
29-19 BC Vergil, Aeneid 2839
ca 8 AD Ov., Metamorphoses 5209
14 AD Aug., Res Gestae 3035
15-50 AD Ph., Fabulae 6588
ca 100 AD? Petr., Satyricon 14177
ca 100-110 AD Tac. Historiae 3531
117-138 AD Suet., Vita Divi Augusti 8313
ca 400 AD Ger. Vulgata 9309
TREEBANK PIPELINE
start
Choose
a TEI/XML text
preliminary
automatic
annotation
tokenize it
rule-based
POS Tagger/Parser
manual
correction
end
THE PERSEUS TREEBANK: TEI XML TEXT
THE PERSEUS TREEBANK: INLINE ANNOTATION
INLINE ANNOTATION: ADVANTAGES
1. easy to add
2. easy to query
3. well supported by annotation tools
INLINE ANNOTATION: DISADVANTAGES
1. the tokenized text becomes the new base text
2. after text extraction from a TEI text, links to the original text is virtually lost
(e.g., amabam-que and content of some editorial markup)
3. it is unfeasible to connect such base texts to other annotation layers with
different tokenization schemes. For example:
‣ amabamque: one phonetic word
‣ amabam-que: two syntactic words
‣ am-a-ba-m-que: five morphemes
‣ verse vs. sentence
STANDOFF ANNOTATION
1. each annotation layer is attached separately to the original text
(i.e., the base text).
2. an annotation layer references the original text or another
annotaion layer which references the original text
STANDOFF ANNOTATION: PAULA XML
1. Open format based on the principles of LAF (ISO 24612:2012)
2. already employed in a number of historical language corpora
3. the base text is a bare xml text, which is virtually referenced only
via offsets
THE CASE STUDY: CAESAR’S DE BELLO CIVILI
1. the base text is a ‘complex’ TEI xml file’
‣ reference is made via XPath coinciding with CTS divisions
(https://git.informatik.uni-leipzig.de/celano/latinnlp/tree/master/case-study))
TOKENIZATION/WORD SEGMENTATION
▸ Latin: rule-based
▸ select the text to annotate from the TEI XML file
▸ identify abbreviations (word list + regular expressions)
▸ Cn. = Gnaeus
▸ list of not-to-tokenize words (e.g., Antigone, aeque)
▸ tokens ending with ne/que/ve
▸ list of to-tokenize words (e.g., nequis, nobiscum)
PAULA: TEI BASE TEXT
PAULA: TOKENIZATION
PAULA: SENTENCE SPLIT
CURRENT CHALLENGES
▸ extraction of text from TEI texts may require different scripts
▸ what is the ideal tokenization/word segmentation?
▸ annotation tools do not support standoff annotation
▸ lack of support for XPointer
THANK YOU FOR YOUR ATTENTION!

Contenu connexe

Similaire à Session6 04.giuseppe celano

Ekon bestof rtl_delphi
Ekon bestof rtl_delphiEkon bestof rtl_delphi
Ekon bestof rtl_delphi
Max Kleiner
 
Bash shell programming in linux
Bash shell programming in linuxBash shell programming in linux
Bash shell programming in linux
Norberto Angulo
 
(Ebook) linux shell scripting tutorial
(Ebook) linux shell scripting tutorial(Ebook) linux shell scripting tutorial
(Ebook) linux shell scripting tutorial
jayaramprabhu
 

Similaire à Session6 04.giuseppe celano (20)

Haskell vs. F# vs. Scala
Haskell vs. F# vs. ScalaHaskell vs. F# vs. Scala
Haskell vs. F# vs. Scala
 
Getty Vocabulary Program LOD: Ontologies and Semantic Representation
Getty Vocabulary Program LOD: Ontologies and Semantic RepresentationGetty Vocabulary Program LOD: Ontologies and Semantic Representation
Getty Vocabulary Program LOD: Ontologies and Semantic Representation
 
Basic Introduction to LaTeX
Basic Introduction to LaTeXBasic Introduction to LaTeX
Basic Introduction to LaTeX
 
sigproc-sp.pdf
sigproc-sp.pdfsigproc-sp.pdf
sigproc-sp.pdf
 
Unixshellscript 100406085942-phpapp02
Unixshellscript 100406085942-phpapp02Unixshellscript 100406085942-phpapp02
Unixshellscript 100406085942-phpapp02
 
abc12
abc12abc12
abc12
 
popopo
popopopopopo
popopo
 
abc
abcabc
abc
 
sphinx-i18n — The True Story
sphinx-i18n — The True Storysphinx-i18n — The True Story
sphinx-i18n — The True Story
 
Sour Pickles
Sour PicklesSour Pickles
Sour Pickles
 
Ekon bestof rtl_delphi
Ekon bestof rtl_delphiEkon bestof rtl_delphi
Ekon bestof rtl_delphi
 
Bash shell programming in linux
Bash shell programming in linuxBash shell programming in linux
Bash shell programming in linux
 
BEL.bio Overview and BioDati Studio
BEL.bio Overview and BioDati Studio BEL.bio Overview and BioDati Studio
BEL.bio Overview and BioDati Studio
 
21bUc8YeDzZpE
21bUc8YeDzZpE21bUc8YeDzZpE
21bUc8YeDzZpE
 
21bUc8YeDzZpE
21bUc8YeDzZpE21bUc8YeDzZpE
21bUc8YeDzZpE
 
21bUc8YeDzZpE
21bUc8YeDzZpE21bUc8YeDzZpE
21bUc8YeDzZpE
 
(Ebook) linux shell scripting tutorial
(Ebook) linux shell scripting tutorial(Ebook) linux shell scripting tutorial
(Ebook) linux shell scripting tutorial
 
Algorithm2e package for Latex
Algorithm2e package for LatexAlgorithm2e package for Latex
Algorithm2e package for Latex
 
Dsohowto
DsohowtoDsohowto
Dsohowto
 
Cross-lingual event-mining using wordnet as a shared knowledge interface
Cross-lingual event-mining using wordnet as a shared knowledge interfaceCross-lingual event-mining using wordnet as a shared knowledge interface
Cross-lingual event-mining using wordnet as a shared knowledge interface
 

Plus de IMPACT Centre of Competence

Plus de IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 
Session1 04.florian fink
Session1 04.florian finkSession1 04.florian fink
Session1 04.florian fink
 

Dernier

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Dernier (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 

Session6 04.giuseppe celano

  • 1. STANDOFF ANNOTATION FOR THE ANCIENT GREEK AND LATIN DEPENDENCY TREEBANK DATECH2019, 9.5.2019 Giuseppe G. A. Celano
  • 2. DFG PROJECT 1. Revise: correct the errors 2. Standardize: make the AGLDT standoff as PAULA XML (and convert into UD) 1. standoff for multiple annotations and/or multiple interpretations of the same token 2. standoff to overcome the problem of conflicting hierarchies 3. Expand: add new annotations (https://git.informatik.uni-leipzig.de/celano/agldt1)
  • 3. THE AGLT ▸ Ancient Greek texts: 557,922 tokens ▸ Latin texts: 79,697 tokens ▸ available in GitHub/GitLab: ▸ https://perseusdl.github.io/treebank_data/ ▸ https://git.informatik.uni-leipzig.de/celano/agldt1
  • 5. THE PERSEUS TREEBANK (LAST RELEASE, 2.1) ▸ 12 texts composition date text token number 63 BC Cicero, In Catilinam 6,652 51 BC Caesar, De Bello Gallico 1556 post 44 BC Sallust, Bellum Catilinae 13191 ca 25 BC Prop. Elegiae 5297 29-19 BC Vergil, Aeneid 2839 ca 8 AD Ov., Metamorphoses 5209 14 AD Aug., Res Gestae 3035 15-50 AD Ph., Fabulae 6588 ca 100 AD? Petr., Satyricon 14177 ca 100-110 AD Tac. Historiae 3531 117-138 AD Suet., Vita Divi Augusti 8313 ca 400 AD Ger. Vulgata 9309
  • 6. TREEBANK PIPELINE start Choose a TEI/XML text preliminary automatic annotation tokenize it rule-based POS Tagger/Parser manual correction end
  • 7. THE PERSEUS TREEBANK: TEI XML TEXT
  • 8. THE PERSEUS TREEBANK: INLINE ANNOTATION
  • 9. INLINE ANNOTATION: ADVANTAGES 1. easy to add 2. easy to query 3. well supported by annotation tools
  • 10. INLINE ANNOTATION: DISADVANTAGES 1. the tokenized text becomes the new base text 2. after text extraction from a TEI text, links to the original text is virtually lost (e.g., amabam-que and content of some editorial markup) 3. it is unfeasible to connect such base texts to other annotation layers with different tokenization schemes. For example: ‣ amabamque: one phonetic word ‣ amabam-que: two syntactic words ‣ am-a-ba-m-que: five morphemes ‣ verse vs. sentence
  • 11. STANDOFF ANNOTATION 1. each annotation layer is attached separately to the original text (i.e., the base text). 2. an annotation layer references the original text or another annotaion layer which references the original text
  • 12. STANDOFF ANNOTATION: PAULA XML 1. Open format based on the principles of LAF (ISO 24612:2012) 2. already employed in a number of historical language corpora 3. the base text is a bare xml text, which is virtually referenced only via offsets
  • 13. THE CASE STUDY: CAESAR’S DE BELLO CIVILI 1. the base text is a ‘complex’ TEI xml file’ ‣ reference is made via XPath coinciding with CTS divisions (https://git.informatik.uni-leipzig.de/celano/latinnlp/tree/master/case-study))
  • 14. TOKENIZATION/WORD SEGMENTATION ▸ Latin: rule-based ▸ select the text to annotate from the TEI XML file ▸ identify abbreviations (word list + regular expressions) ▸ Cn. = Gnaeus ▸ list of not-to-tokenize words (e.g., Antigone, aeque) ▸ tokens ending with ne/que/ve ▸ list of to-tokenize words (e.g., nequis, nobiscum)
  • 18. CURRENT CHALLENGES ▸ extraction of text from TEI texts may require different scripts ▸ what is the ideal tokenization/word segmentation? ▸ annotation tools do not support standoff annotation ▸ lack of support for XPointer
  • 19. THANK YOU FOR YOUR ATTENTION!