SlideShare une entreprise Scribd logo
1  sur  27
dboe@TEI
Remodelling a Database of Dialects into a Rich
LOD Resource
Jack Bowers, Daniel Schopper, Eveline Wandl-Vogt
Presented at TEI Members Meeting
October 2015
Lyon, France
Database of Bavarian
dialects of Austria (DBOE)
• Project collection of vocabulary
from Bavarian dialects of what
was then the Hapsburg
monarchy/Austro-Hungarian
Empire began in 1911…
• Estimated 4 million paper slips
containing lexical, geographical
and other information ….
• Collection of dialectal language
data from locations around the
former Austro-Hungarian
Empire
1911- Collection begins
1963 - dictionary (WBÖ)
1998 database (DBÖ)
2010 dbo@ema online
2012 wboed online
2013 wboe@SKOS
(OpenUp! NHM)1998 Collection ends
Task:
• Convert our legacy dialectal, language data and metadata
into formats that are:
• compatible with one another (SQL and TUSTEP
data);
• ….
• Make LOD compatible;
• Bring formatting in line with standards for best practice
Data:
• TUSTEP Database: 2 299 608 single data entry files; (728 drawers)
• MySQL (dbo@ema): +- 200,000 entries (4 drawers)
Overview of Scenario:
Linguistic and Metadata
• Phonetic representation(s) of dialectal
forms;
• Grammatical info:pos, number, case, etc.
• Details of word formation;
• Etymological information;
• Translation (&/or) definition of meaning;
• Related forms in other dialectal varieties;
• Location data for each record;
• Questionnaires used in elicitation of
dialectal data;
• Bibliographic references;
LOD Enhancement of DBOE
Motivations:
• LOD offers means to base semantics for a given
concept in predefined, inter/multi-lingual knowledge
hubs;
• Desire to have access to dialectal data according to
as many different lexical, semantic, geographic,
temporal, features as possible;
Sources & Services:
• dbpedia; wikidata; geoNames;
• NHM (OpenUp!) Data (plantnames)
• BableNet/Babelfy;
Applicable to:
• Semantics/sense info for language/dialectal data;
• Geographical data;
• Grammatical, lexicographic data (ISOcat)
• Fragebogen; (questionnaires)
Major Challenges in
Converting DBOE to TEI
• Heterogeneous formatting in TUSTEP records;
(1936), c(1906), (1913-23)…
********************
*A* HK 507, k5070426.foe, unkorr.
*HL* (r<ot)kopfecht:2
*NL* :
*VL* :
*QU* Tiersee Juffinger
*QN* {1C.3b03} nUInngeb.:nwNOTir.:öNTir.
*^@ FbB.JUFFINGER· (19xx)
[SFb./EFb./Mtlg.] *O* Thiers. NTir.
===
*NR* 53C5: rot in Komp. (rotfuchset,
Rotblättlein, feuerrot)
*LT1* rEotkopfat []
*BD/LT1* -- *ANMO* von gutaussehenden
Menschen
===
********************
*A* HK 191, t191^#249.52 =
t1910201.pla^#191.660
*HL* tschga:9
*QU* Schüttelkopf, Dt. TierN. in Kä., In:
Carinthia (1906) *S* II, 66 *N* 96
*QN* {wb} mkPubl./MdaWs.:TierN *^@
SCHÜTTELKOPF· (1906) S. [S-5637] *O*
===
*KT1* Tschgå, tschgå *KL* Fg:99 *O* Dtl.
*BD/KT1* Lockruf für Kälber
Dates can appear in a number of formats, with different levels of specificity:
********************
*A* HK 204, d204^#1779.1 = tut1028.res^#1.398
*HL* Tschufit:1
*QU* STir. PflN und TierN (Sammelbel.)
^@^@^@
===
*LT1* Tschuffit *ANM* Duregger *O* Meran
*LT2* Tschufitt *O* Kardaun
*BD/LT1* Zwergohreule; Ephiattes scops Prag
(19xx), (19xx), c(1943-19xx)
…many of which are incomplete:
e.g.
e.g.
Some entry records:
have empty fields;
have same category of data in different places
Major Challenges in
Converting DBOE to TEI
• Invalid (and invisible) non-unicode characters in text out put of
both TUSTEP and SQL;
Plâsepl: me
F: tterpl: me
Major Challenges in
Converting DBOE to TEI
• Labeling @xml:lang for dialectal data;
However,
xml:lang=“bav-AT-7” Tyrol (too vague)
Speech of places in the same region very often are very different, thus we need more
specifics:
xml:lang=“bav-AT-7-zillertal” (not valid)
But… individual towns, place names, or any other geographical identifiers do not have
official status in language identification systems
xml:lang=“bav-AT-7-x-zillertal” (private)
BCP 47 only specifies to the degree of country & administrative region;
(language)-(country)-(region);
(ISO639)-(ISO3166)-(3166-2)
There is only perhaps one case when this is specific enough;
xml:lang=“bav-AT-9” Vienna (ok)
Major Challenges in
Converting DBOE to TEI
• Phonetic transcriptions not in reliably consistent notation:
• Individual data collectors may or may not have followed instructions
to use common phonetic notation (Wienerisch Teuthonista)
Thus, the accuracy of any attempt to convert data to IPA cannot be guaranteed!
• Different conventions were often used by data scientists entering the
handwritten transcriptions into the DB;
• Representation of phonetic transcriptions in DB based on the notation of
the handwritten transcription and not the phonetic forms themselves!
Major Challenges in
Converting DBOE to TEI
• What is the best type of TEI output for data?
Onomasiological or Semisiological
Concept -> term Form -> Sense
TEI Dictionary?
• should we consider other formats?
(TBX-TEI, hybrid?)
en: yellow
de: Gelb
bav-AT-9:
bav-AT-7-x-zillertal:
<entry>
<sense><form>
… ….
….
Potential Output: TEI
Dictionary with Etymology
<entry xml:id="dottergelb-adj-wien-gersthof">
<form type="lemma">
<pron notation="tustep" xml:lang="ba-AT-9">
<seg xml:id="dtrGlbwnGer01">dotE¡r</seg>
<seg xml:id="dtrGlbwnGer02">gö¡lb</seg>
</pron>
<gramGrp>
<pos>adj</pos>
</gramGrp>
</form>
<sense>
<usg type="dom" corresp="http://dbpedia.org/resource/Color">Color</usg>
<usg type="geo">
<placeName type="district" ref="place:W">Wien-Gersthof</placeName>
</usg>
<cit type="translation">
<oRef xml:lang="de">dottergelb</oRef>
</cit>
<etym type="compounding">
<etym type="metonomy">
<cit type=“etymon" corresp="#dotter-Subs-wien">
<pRef corresp="#dtrGlbwnGer01">dotE¡r</pRef>
</cit>
</etym>
<cit type="etymon" corresp="#gelb-adj-wien">
<pRef corresp="#dtrGlbwnGer02">gö¡lb</pRef>
</cit>
</etym>
</sense>
</entry>
Formatting in according to Bowers & Romary (Upcoming)
dialect-form: dotE¡rgö¡lb
de: dottergelb
etymological-form :(dotter)gëlb:2
dialect-place :Gm. Wien
Workflow:
TUSTEP > XML > TEI
TUSTEP entry
BASIC XML
*A* HK 288, F288^#115.1 = F2880826.eck^#72.1
*HL* Fuchs:1
*QU* Schlagen b. Gmunden, Loitlesberger
*QN* {5.3a06} nöSkgt.:swTraunv.:OÖ *^@ Slg.LOITLESBERGER·
(19xx) [Slg."PflN+TierN.Gmunden"+Erg.] *O* Schlagen in Gm.
Gmunden OÖ
===
*NR* 53C6: rot versch. Abstufg. (purpurn, der Purpur)
*LT1* Fuks
*BD/LT1* rotes oder rötliches Pferd
<record n="21">
<A>HK 288, F288^#115.1 = F2880826.eck^#72.1</A>
<HL>Fuchs:1</HL>
<QU>Schlagen b. Gmunden, Loitlesberger</QU>
<QN>
<bibl>{5.3a06} nöSkgt.:swTraunv.:OÖ *^@
Slg.LOITLESBERGER· (19xx)
[Slg."PflN+TierN.Gmunden"+Erg.]</bibl>
<O>Schlagen in Gm. Gmunden OÖ</O>
</QN>
<NR>53C6: rot versch. Abstufg. (purpurn, der Purpur)</NR>
<LT1>Fuks</LT1>
<BD_LT1>rotes oder rötliches Pferd</BD_LT1>
<orig>….</orig>
</record>
<item xml:id="d1e437" n="21">
<ref type="archive">HK 288, F288^#115.1 = F2880826.eck^#72.1</ref>
<cit type="etymon">
<pRef>Fuchs</pRef>
<gramGrp>
<pos sameAs="lemma_wortat-tei-fs.xml#Sub">Sub</pos>
</gramGrp>
</cit>
<ref type="source">Schlagen b. Gmunden, Loitlesberger</ref>
<bibl>
<ref>5.3a06</ref> nöSkgt.:swTraunv.:OÖ
Slg.LOITLESBERGER· (<date>19xx</date>)
[Slg."PflN+TierN.Gmunden"+Erg.]</bibl>
<placeName>Schlagen in Gm. Gmunden OÖ</placeName>
<ref type="questionnaire">53C6: rot versch. Abstufg. (purpurn, der Purpur)</ref>
<cit type="lautung" xml:id="d1e451" n="1">
<pRef>Fuks</pRef>
</cit>
<cit type="translation" corresp="#d1e451">
<oRef xml:lang="de">rotes oder rötliches Pferd</oRef>
</cit>
</item>
TEI <fs>
(grammatical features)
TEI(list)
TEI (listPlace)
• Inventory of all known
places from which we
have language data;
• Establishing inventory of
such references enables
the option of tagging with
attribute or element
values in using data;
• Reusable and extensible
w/in our project and
institution;
• Enhance original
information from LOD
sources like GeoNames..
Key TEI Documents:
listPlace
<listPlace>
<place @xml:id @type @corresp>
<relation @xml:id @type @active
@passive>
<placeName @role @full>
(1..n)
(1..n)
<listPlace>
(1..n)
<location>
(0..n)
<geo>
<note>
(0..n)
(0..n)
Defining Location:
Relational Hierarchies
…
<place xml:id="Öst" type="region" subtype=“administrative”
corresp="http://sws.geonames.org/2782113/about.rdf">
<placeName role="main">
<placeName full="abb" xml:lang="de">Öst.</placeName>
<placeName full="yes" xml:lang="de">Österreich</placeName>
</placeName>
<location>
<geo corresp=“#gis_region_id-862"/>
</location>
</place>
…
<relation name="parentRegion" active="#Öst" passive="#Bgl #Kä #NÖ #OÖ #Sa #St #Tir #W #OÖst #SÖst #NÖSt #WÖSt”/>
<relation name="parentRegion" active="#Tir” passive="#NTir #OTir #TirHocht #sbairTir #TirÜGeb”/>
<place xml:id="Zillert" type=“region" corresp="http://sws.geonames.org/2760567/about.rdf">
<placeName role="main">
<placeName full="abb" xml:lang="de">Zillert.</placeName>
<placeName full="yes" xml:lang="de">Zillertal</placeName>
</placeName>
<location>
<geo corresp=“#gis_region_id-771"/>
</location>
<note>häufige Abk. "Zill.": Tal der Ziller samt einmündender Alpentäler im SO vom —: #mNTir.# --
- vom Sammler Michael Juffinger mit der Ziffer IV bezeichnet ---</note>
</place>
<relation name="parentRegion" active="#mNTir” passive="#Stub #Zillert #Sillgeb #mInntInnsbr"/>
<relation name="parentRegion" active="#NTir" passive="#wNTir #öNTir #mNTir #Innt"/>
Mittel Nord Tirol
Nord Tirol
Tirol
Österreich
Zillertal
(i..n)
**
***
*
de: Zillertal <place xml:id="Zillert" type=“city" corresp="http://sws.geonames.org/2760567/about.rdf">
<placeName role="main">
<placeName full="abb" xml:lang="de">Zillert.</placeName>
<placeName full="yes" xml:lang="de">Zillertal</placeName>
</placeName>
<location>
<geo corresp=“#gis_region_id-771"/>
</location>
<note>häufige Abk. "Zill.": Tal der Ziller samt einmündender Alpentäler im SO vom —: #mNTir.#
--- vom Sammler Michael Juffinger mit der Ziffer IV bezeichnet ---</note>
</place>
Defining Location:
LOD Aspect
TEI <place> geoNames uri
geoNames: linked data, related concepts
geoNames: geo co-ordinates,
demographics,etc. for <place>
Place:
en: Tyrol
de: Tirol
…
<place xml:id="Tir" type="region" subtype=“administrative” corresp="http://sws.geonames.org/2764958/about.rdf">
<placeName role="main">
<placeName full="abb" xml:lang="de">Tir.</placeName>
<placeName full="yes" xml:lang="de">Tirol</placeName>
</placeName>
<location>
<geo corresp=“#gis_region_id-762"/>
</location>
<note>nicht eindeutige und daher eher zu vermeidende Bez.: "Alttirol" für --:
#OTir.#, #NTir.#, #STir.# und Bld. Tirol mit --: #OTir.#, #NTir.# (EKü.)</note>
</place>
…
TEI <place> geoNames uri
geoNames: geo co-
ordinates,
demographics,etc.
for <place>geoNames: linked data, related concepts
Defining Location:
LOD Aspect
• Original (TUSTEP) entries has extensive inventory of grammatical
and lexical features;
• TEI feature structures compiled manually according to
documentation of guidelines;
• Basic part of speech tag (subs, verb, adj, adv, interj, conj,…etc)
inherited from etymological descendant form (Hauptlemma);
• Further detail as to subcategorization, context, inflection form(s), and
collocation specified for each dialectal form (Lautung);
Key TEI Documents:
Feature Structures for Lexical,
Grammatical Categories
Key TEI Documents:
Feature Structures for Lexical,
Grammatical Categories
<fs @xml:id @type @xmlns:dcr @feats>
<f @xml:id @name @xmlns:dcr>
<string @xml:lang>
(1..n)
(0..n)
<fs> used for categorical features, and for complex categories (needing to reference
combine multiple features);
<f> used for most basic categories and for features for which we want to specify
value in one or more strings;
*A* HK 165, d165^#1993.1 = T1650708.sch^#18
*HL* T<ötling:1
*QU* Schüttelkopf Dt. Tiern in Kä. - In: Carinthia II,96 (1906) *S* 66
*QN* {wb} mkPubl./MdaWs.:TierN *^@ SCHÜTTELKOPF· (1906) S.
[S-5637] *O* ob. Mllt.
===
*LT1* Tötling [m]
*BD/LT1* abgespäntes Kalb
<f name="Substantiv" n="1" xml:id="Sub"
xmlns:dcr="http://www.isocat.org/ns/dcr"
dcr:datcat="http://www.isocat.org/datcat/DC-3347">
….
</f>
…
<fs type="Gender" xmlns:dcr="http://www.isocat.org/ns/dcr"
dcr:datcat=" http://www.isocat.org/datcat/DC-3217">
<f name="MasculineGender" xml:id="Masc"
xmlns:dcr="http://www.isocat.org/ns/dcr"
dcr:datcat="http://www.isocat.org/datcat/DC-3312">
<string>m</string>
</f>
……
</fs>
*LT1* [m] = Masculine
e.g.
*HL* :1 = Substative
Grammatical information in
dialectal data
<f name="Substantiv" n="1" xml:id="Sub" xmlns:dcr="http://www.isocat.org/ns/dcr"
dcr:datcat="http://www.isocat.org/datcat/DC-3347">
….
<fs type="diminutive" xml:id="SubD" dcr:datcat="http://www.isocat.org/datcat/DC-4922">
<!-- D = Deminutiv/Diminutive = Verkleinerungsform/Koseform (ohne näh.
Unterscheidung);-->
<f name="diminutive-lein" xml:id="SubD1"/>
<!-- D1 = 1. Deminutiv: Häusl, Dearfö (Dörflein), Stüble (lemmatisiert -lein)-->
<f name="diminutive-ellein" xml:id="SubD2"/>
<!-- D2 = 2. Deminutiv: Häuserl, Hansei, Hantale, Fußile (lemmatisiert -ellein) -->
<f name="diminutive-i" xml:id="SubD3"/>
<!-- D3 = Koseform auf -i: Greti, Pferti, Katzi -->
</fs>
….
*A* HK 295, f295^#2793.1 = f2951202.pir^#108.1
*HL* Procken:1
*QU* Tullnitz Mühlhauser
*QN* {8.3c04} Jaispitzt.:nöSMä. *^@ FbB.MÜHLHAUSER· (1922-35)
[SFb.1-6,12-108/EFb.4-9/WrFb.1/Mtlg.] *O* Tullnitz SMä.
===
*NR* 30C23 (F): gr./kl. Brotstück (Reankn, Bröckel,*); Ra.; Komp.
*LT1* Brèkl [D1]
*LT2* Brèka¡rl [D2]
*BD/LT1* kleines und kleinstes Stückchen Brot
*LT1* [D1] = Diminutive (dialectal
cognate to High German ‘-lein’)
*LT2* [D2] = Diminutive (dialectal
cognate to High German ‘-ellein’)
e.g.
*HL* :1 = Substative
Grammatical information in
dialectal data
*A* HK 230, f230^#1382.1 = falh0922.pir^#9.1
*HL* Falb:1
*QU* Linz Wagner
*QN* {5.2a41} Linz:nHausrv.:OÖ *^@ FbB.WAGNER· (19xx)
[SFb./EFb.] *O* Linz OÖ
===
*NR* 53E8: der Falb(e)/Falch(e) als TierN etc.
*LT1* da fåib [m+A]
*LT2* d' fåim [pl+A]
*BD/LT1* der Falb (gelblichgraue Pferde)
<f name="Substantiv" n="1" xml:id="Sub" xmlns:dcr="http://www.isocat.org/ns/dcr"
dcr:datcat="http://www.isocat.org/datcat/DC-3347">
….
</f>
<fs type="Gender" xmlns:dcr="http://www.isocat.org/ns/dcr"
dcr:datcat=" http://www.isocat.org/datcat/DC-3217”>
<f name="MasculineGender" xml:id="Masc" xmlns:dcr="http://www.isocat.org/ns/dcr"
dcr:datcat="http://www.isocat.org/datcat/DC-3312">
<string>m</string>
</f>
</fs>
<fs type="Article" xml:id="Article" xmlns:dcr="http://www.isocat.org/ns/dcr"
dcr:datcat="http://www.isocat.org/datcat/DC-1892">
<f name="definiteArticle" xml:id="A-def" xmlns:dcr="http://www.isocat.org/ns/dcr"
dcr:datcat="http://www.isocat.org/datcat/DC-1430">
<string>A</string>
</f>
</fs>
e.g.
*HL* :1 = Substative
*LT1* [m+A] = Masculine + Definite Article
*LT2* [pl+A] = Plural + Definite Article
(Note: gender specified in first *LT* applies to all *LT* for given entry)
Grammatical information in
dialectal data
Key TEI Documents:
Feature Structures for Concepts
• Can contain all and any uri for a given concept in separate <f>’s
under the same <fs>;
• Reusable and extensible w/in our project and institution;
• Usable for both the dialectal data and for certain parts of the original
data such as the field questionnaires that were used to elicit lexical
data from speakers;
• (where possible) concepts will correspond to the sense of a given
lexical item;
• LOD/RDF compatible;
Key TEI Documents:
Feature Structures for Concepts
<fs @xml:id @type @feats>
<f @name=“name”>
<string @xml:lang>
(1..n)
(1..n)
<f @name=“entityType>
(1..n)
<string @xml:lang>
<f @name=“conceptURL” @xml:id @corresp>
(0..n)
(1..n)
….
…
….
<f @name=“conceptURL” @xml:id @corresp>
….
<fs type="concept" xml:id="d1e31">
<f name="name">
<string xml:lang="la">Taraxacum officinale Weber</string>
</f>
<f name="name">
<string xml:lang="de">Wiesen-Löwenzahn</string>
</f>
<f name="entityType">
<string xml:lang="en">plant</string>
</f>
<f name="conceptURL"
xml:id="d1e31_cURL1"
corresp=“http://dbpedia.org/resource/Taraxacum_officinale"/>
<f name="conceptURL"
xml:id="d1e31_cURL2"
corresp=“http://openup.nhm-wien.ac.at/commonNames/references/
scientificName/3926"/>
</fs>
……
<list type="wordforms" corresp="concept:d1e31">
...
<item>
<cit type="example" xml:id="o151503">
<quote>Hasenohrwaschel</quote>
<usg type="geo">
<placeName type="region" ref="place:Zillert">Zillert.</placeName>
</usg>
</cit>
</item>
...
...
<place xml:id="Zillert" type=“region” corresp="http://sws.geonames.org/2760567/about.rdf">
<placeName role="main">
<placeName full="abb" xml:lang="de">Zillert.</placeName>
<placeName full="yes" xml:lang="de">Zillertal</placeName>
</placeName>
<location>
<geo corresp=“#gis_region_id-771"/>
</location>
<note>häufige Abk. "Zill.": Tal der Ziller samt einmündender Alpentäler im SO vom —:
#mNTir.# --- vom Sammler Michael Juffinger mit der Ziffer IV bezeichnet ---</note>
</place>
Conceptual Reference <fs>
Place Reference <listPlace>
Output with Links:
Plantnames
Sample TEI record (from SQL data):
Taraxacum officinale Weber
dialect-form: göl¡bw
de: gelb
etymological-form: gëlb
dialect-place: Kl.Feistritz
Linking of dialectal data with
conceptual infrastructures
<fs type="concept" xml:id="c1e3186">
<f name="name">
<string xml:lang="de">Gelb</string>
</f>
<f name="entityType">
<string>color</string>
</f>
<f name="conceptURL" xml:id="c1e3186_cURL1"
corresp="http://dbpedia.org/resource/yellow"/>
</fs>
Concept:
de: gelb
(en: yellow)
Dialectal entries: (from TUSTEP)
Feature structure
BabelNet entry
dialect-form: gäb
de: gelb
etymological-form: gëlb
dialect-place: Kufstn. NTir.
related concept
Next Steps:
• Continue refining conversion process
• Integrate concepts <fs> with more output (NLP,
manual)
• Begin converting groups of vocabulary
• Develop phonetic <fs> inventory for all transcription
systems used in DBOE
• TEI Dictionary (???)
• Enhanced markup for
• Etymology; LOD semantics, inflection,…
• Conversion to RDF(?)
• Analyses of linguistic content
• Visualize data from DBOE (TUSTEP) and
(DBO@ema) by geographic, temporal and lexical
features
Contact
Jack.Bowers@oeaw.ac.at
http://acdh.oeaw.ac.at
Danke!

Contenu connexe

Tendances

Jsonsaga
JsonsagaJsonsaga
Jsonsaganohmad
 
A couple of things about PostgreSQL...
A couple of things  about PostgreSQL...A couple of things  about PostgreSQL...
A couple of things about PostgreSQL...Federico Campoli
 
Don't panic! - Postgres introduction
Don't panic! - Postgres introductionDon't panic! - Postgres introduction
Don't panic! - Postgres introductionFederico Campoli
 
Improving the Solr Update Chain
Improving the Solr Update ChainImproving the Solr Update Chain
Improving the Solr Update ChainCominvent AS
 
What is DataCite?
What is DataCite?What is DataCite?
What is DataCite?datacite
 
Fixed Field Values By Horizon Name
Fixed Field Values By Horizon NameFixed Field Values By Horizon Name
Fixed Field Values By Horizon NameLowell Ashley
 
Relational Database Examples
Relational Database ExamplesRelational Database Examples
Relational Database ExamplesJohn Cutajar
 
Hypertable - massively scalable nosql database
Hypertable - massively scalable nosql databaseHypertable - massively scalable nosql database
Hypertable - massively scalable nosql databasebigdatagurus_meetup
 
Database Architectures and Hypertable
Database Architectures and HypertableDatabase Architectures and Hypertable
Database Architectures and Hypertablehypertable
 
Block replication on HDFS
Block replication on HDFSBlock replication on HDFS
Block replication on HDFSKoos van Strien
 
Hypertable
HypertableHypertable
Hypertablebetaisao
 
NoSQL Couchbase Lite & BigData HPCC Systems
NoSQL Couchbase Lite & BigData HPCC SystemsNoSQL Couchbase Lite & BigData HPCC Systems
NoSQL Couchbase Lite & BigData HPCC SystemsFujio Turner
 
Thinking in documents
Thinking in documentsThinking in documents
Thinking in documentsCésar Rodas
 

Tendances (14)

Jsonsaga
JsonsagaJsonsaga
Jsonsaga
 
A couple of things about PostgreSQL...
A couple of things  about PostgreSQL...A couple of things  about PostgreSQL...
A couple of things about PostgreSQL...
 
Don't panic! - Postgres introduction
Don't panic! - Postgres introductionDon't panic! - Postgres introduction
Don't panic! - Postgres introduction
 
Improving the Solr Update Chain
Improving the Solr Update ChainImproving the Solr Update Chain
Improving the Solr Update Chain
 
What is DataCite?
What is DataCite?What is DataCite?
What is DataCite?
 
Fixed Field Values By Horizon Name
Fixed Field Values By Horizon NameFixed Field Values By Horizon Name
Fixed Field Values By Horizon Name
 
Relational Database Examples
Relational Database ExamplesRelational Database Examples
Relational Database Examples
 
Hypertable - massively scalable nosql database
Hypertable - massively scalable nosql databaseHypertable - massively scalable nosql database
Hypertable - massively scalable nosql database
 
Database Architectures and Hypertable
Database Architectures and HypertableDatabase Architectures and Hypertable
Database Architectures and Hypertable
 
Block replication on HDFS
Block replication on HDFSBlock replication on HDFS
Block replication on HDFS
 
cyclades eswc2016
cyclades eswc2016cyclades eswc2016
cyclades eswc2016
 
Hypertable
HypertableHypertable
Hypertable
 
NoSQL Couchbase Lite & BigData HPCC Systems
NoSQL Couchbase Lite & BigData HPCC SystemsNoSQL Couchbase Lite & BigData HPCC Systems
NoSQL Couchbase Lite & BigData HPCC Systems
 
Thinking in documents
Thinking in documentsThinking in documents
Thinking in documents
 

Similaire à Remodelling of a Database of Bavarian Dialects into TEI XML and LOD

Exploring data models for heterogenous dialect data: the case of e​xplore.bre...
Exploring data models for heterogenous dialect data: the case of e​xplore.bre...Exploring data models for heterogenous dialect data: the case of e​xplore.bre...
Exploring data models for heterogenous dialect data: the case of e​xplore.bre...Jack Bowers
 
Etymology Markup in TEI XML
Etymology Markup in TEI XMLEtymology Markup in TEI XML
Etymology Markup in TEI XMLJack Bowers
 
Relations matter: Maintaining and Publishing Links in Library Metadata
Relations matter: Maintaining and Publishing Links in Library Metadata Relations matter: Maintaining and Publishing Links in Library Metadata
Relations matter: Maintaining and Publishing Links in Library Metadata Lars G. Svensson
 
Corpus annotation for corpus linguistics (nov2009)
Corpus annotation for corpus linguistics (nov2009)Corpus annotation for corpus linguistics (nov2009)
Corpus annotation for corpus linguistics (nov2009)Jorge Baptista
 
Getty Vocabulary Program LOD: Ontologies and Semantic Representation
Getty Vocabulary Program LOD: Ontologies and Semantic RepresentationGetty Vocabulary Program LOD: Ontologies and Semantic Representation
Getty Vocabulary Program LOD: Ontologies and Semantic RepresentationVladimir Alexiev, PhD, PMP
 
Furore devdays 2017- rdf1(solbrig)
Furore devdays 2017- rdf1(solbrig)Furore devdays 2017- rdf1(solbrig)
Furore devdays 2017- rdf1(solbrig)DevDays
 
Getty Vocabulary Program LOD: Ontologies and Semantic Representation
Getty Vocabulary Program LOD: Ontologies and Semantic RepresentationGetty Vocabulary Program LOD: Ontologies and Semantic Representation
Getty Vocabulary Program LOD: Ontologies and Semantic RepresentationVladimir Alexiev, PhD, PMP
 

Similaire à Remodelling of a Database of Bavarian Dialects into TEI XML and LOD (10)

Exploring data models for heterogenous dialect data: the case of e​xplore.bre...
Exploring data models for heterogenous dialect data: the case of e​xplore.bre...Exploring data models for heterogenous dialect data: the case of e​xplore.bre...
Exploring data models for heterogenous dialect data: the case of e​xplore.bre...
 
Etymology Markup in TEI XML
Etymology Markup in TEI XMLEtymology Markup in TEI XML
Etymology Markup in TEI XML
 
Relations matter: Maintaining and Publishing Links in Library Metadata
Relations matter: Maintaining and Publishing Links in Library Metadata Relations matter: Maintaining and Publishing Links in Library Metadata
Relations matter: Maintaining and Publishing Links in Library Metadata
 
Corpus annotation for corpus linguistics (nov2009)
Corpus annotation for corpus linguistics (nov2009)Corpus annotation for corpus linguistics (nov2009)
Corpus annotation for corpus linguistics (nov2009)
 
RDA Choices: Choosing between the Different RDA and LC/PCC CORE Descriptive E...
RDA Choices: Choosing between the Different RDA and LC/PCC CORE Descriptive E...RDA Choices: Choosing between the Different RDA and LC/PCC CORE Descriptive E...
RDA Choices: Choosing between the Different RDA and LC/PCC CORE Descriptive E...
 
Descriptive Cataloging of Scores in RDA
Descriptive Cataloging of Scores in RDADescriptive Cataloging of Scores in RDA
Descriptive Cataloging of Scores in RDA
 
Getty Vocabulary Program LOD: Ontologies and Semantic Representation
Getty Vocabulary Program LOD: Ontologies and Semantic RepresentationGetty Vocabulary Program LOD: Ontologies and Semantic Representation
Getty Vocabulary Program LOD: Ontologies and Semantic Representation
 
Furore devdays 2017- rdf1(solbrig)
Furore devdays 2017- rdf1(solbrig)Furore devdays 2017- rdf1(solbrig)
Furore devdays 2017- rdf1(solbrig)
 
Chapter4
Chapter4Chapter4
Chapter4
 
Getty Vocabulary Program LOD: Ontologies and Semantic Representation
Getty Vocabulary Program LOD: Ontologies and Semantic RepresentationGetty Vocabulary Program LOD: Ontologies and Semantic Representation
Getty Vocabulary Program LOD: Ontologies and Semantic Representation
 

Dernier

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 

Dernier (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 

Remodelling of a Database of Bavarian Dialects into TEI XML and LOD

  • 1. dboe@TEI Remodelling a Database of Dialects into a Rich LOD Resource Jack Bowers, Daniel Schopper, Eveline Wandl-Vogt Presented at TEI Members Meeting October 2015 Lyon, France
  • 2. Database of Bavarian dialects of Austria (DBOE) • Project collection of vocabulary from Bavarian dialects of what was then the Hapsburg monarchy/Austro-Hungarian Empire began in 1911… • Estimated 4 million paper slips containing lexical, geographical and other information …. • Collection of dialectal language data from locations around the former Austro-Hungarian Empire 1911- Collection begins 1963 - dictionary (WBÖ) 1998 database (DBÖ) 2010 dbo@ema online 2012 wboed online 2013 wboe@SKOS (OpenUp! NHM)1998 Collection ends
  • 3. Task: • Convert our legacy dialectal, language data and metadata into formats that are: • compatible with one another (SQL and TUSTEP data); • …. • Make LOD compatible; • Bring formatting in line with standards for best practice Data: • TUSTEP Database: 2 299 608 single data entry files; (728 drawers) • MySQL (dbo@ema): +- 200,000 entries (4 drawers) Overview of Scenario:
  • 4. Linguistic and Metadata • Phonetic representation(s) of dialectal forms; • Grammatical info:pos, number, case, etc. • Details of word formation; • Etymological information; • Translation (&/or) definition of meaning; • Related forms in other dialectal varieties; • Location data for each record; • Questionnaires used in elicitation of dialectal data; • Bibliographic references;
  • 5. LOD Enhancement of DBOE Motivations: • LOD offers means to base semantics for a given concept in predefined, inter/multi-lingual knowledge hubs; • Desire to have access to dialectal data according to as many different lexical, semantic, geographic, temporal, features as possible; Sources & Services: • dbpedia; wikidata; geoNames; • NHM (OpenUp!) Data (plantnames) • BableNet/Babelfy; Applicable to: • Semantics/sense info for language/dialectal data; • Geographical data; • Grammatical, lexicographic data (ISOcat) • Fragebogen; (questionnaires)
  • 6. Major Challenges in Converting DBOE to TEI • Heterogeneous formatting in TUSTEP records; (1936), c(1906), (1913-23)… ******************** *A* HK 507, k5070426.foe, unkorr. *HL* (r<ot)kopfecht:2 *NL* : *VL* : *QU* Tiersee Juffinger *QN* {1C.3b03} nUInngeb.:nwNOTir.:öNTir. *^@ FbB.JUFFINGER· (19xx) [SFb./EFb./Mtlg.] *O* Thiers. NTir. === *NR* 53C5: rot in Komp. (rotfuchset, Rotblättlein, feuerrot) *LT1* rEotkopfat [] *BD/LT1* -- *ANMO* von gutaussehenden Menschen === ******************** *A* HK 191, t191^#249.52 = t1910201.pla^#191.660 *HL* tschga:9 *QU* Schüttelkopf, Dt. TierN. in Kä., In: Carinthia (1906) *S* II, 66 *N* 96 *QN* {wb} mkPubl./MdaWs.:TierN *^@ SCHÜTTELKOPF· (1906) S. [S-5637] *O* === *KT1* Tschgå, tschgå *KL* Fg:99 *O* Dtl. *BD/KT1* Lockruf für Kälber Dates can appear in a number of formats, with different levels of specificity: ******************** *A* HK 204, d204^#1779.1 = tut1028.res^#1.398 *HL* Tschufit:1 *QU* STir. PflN und TierN (Sammelbel.) ^@^@^@ === *LT1* Tschuffit *ANM* Duregger *O* Meran *LT2* Tschufitt *O* Kardaun *BD/LT1* Zwergohreule; Ephiattes scops Prag (19xx), (19xx), c(1943-19xx) …many of which are incomplete: e.g. e.g. Some entry records: have empty fields; have same category of data in different places
  • 7. Major Challenges in Converting DBOE to TEI • Invalid (and invisible) non-unicode characters in text out put of both TUSTEP and SQL; Plâsepl: me F: tterpl: me
  • 8. Major Challenges in Converting DBOE to TEI • Labeling @xml:lang for dialectal data; However, xml:lang=“bav-AT-7” Tyrol (too vague) Speech of places in the same region very often are very different, thus we need more specifics: xml:lang=“bav-AT-7-zillertal” (not valid) But… individual towns, place names, or any other geographical identifiers do not have official status in language identification systems xml:lang=“bav-AT-7-x-zillertal” (private) BCP 47 only specifies to the degree of country & administrative region; (language)-(country)-(region); (ISO639)-(ISO3166)-(3166-2) There is only perhaps one case when this is specific enough; xml:lang=“bav-AT-9” Vienna (ok)
  • 9. Major Challenges in Converting DBOE to TEI • Phonetic transcriptions not in reliably consistent notation: • Individual data collectors may or may not have followed instructions to use common phonetic notation (Wienerisch Teuthonista) Thus, the accuracy of any attempt to convert data to IPA cannot be guaranteed! • Different conventions were often used by data scientists entering the handwritten transcriptions into the DB; • Representation of phonetic transcriptions in DB based on the notation of the handwritten transcription and not the phonetic forms themselves!
  • 10. Major Challenges in Converting DBOE to TEI • What is the best type of TEI output for data? Onomasiological or Semisiological Concept -> term Form -> Sense TEI Dictionary? • should we consider other formats? (TBX-TEI, hybrid?) en: yellow de: Gelb bav-AT-9: bav-AT-7-x-zillertal: <entry> <sense><form> … …. ….
  • 11. Potential Output: TEI Dictionary with Etymology <entry xml:id="dottergelb-adj-wien-gersthof"> <form type="lemma"> <pron notation="tustep" xml:lang="ba-AT-9"> <seg xml:id="dtrGlbwnGer01">dotE¡r</seg> <seg xml:id="dtrGlbwnGer02">gö¡lb</seg> </pron> <gramGrp> <pos>adj</pos> </gramGrp> </form> <sense> <usg type="dom" corresp="http://dbpedia.org/resource/Color">Color</usg> <usg type="geo"> <placeName type="district" ref="place:W">Wien-Gersthof</placeName> </usg> <cit type="translation"> <oRef xml:lang="de">dottergelb</oRef> </cit> <etym type="compounding"> <etym type="metonomy"> <cit type=“etymon" corresp="#dotter-Subs-wien"> <pRef corresp="#dtrGlbwnGer01">dotE¡r</pRef> </cit> </etym> <cit type="etymon" corresp="#gelb-adj-wien"> <pRef corresp="#dtrGlbwnGer02">gö¡lb</pRef> </cit> </etym> </sense> </entry> Formatting in according to Bowers & Romary (Upcoming) dialect-form: dotE¡rgö¡lb de: dottergelb etymological-form :(dotter)gëlb:2 dialect-place :Gm. Wien
  • 12. Workflow: TUSTEP > XML > TEI TUSTEP entry BASIC XML *A* HK 288, F288^#115.1 = F2880826.eck^#72.1 *HL* Fuchs:1 *QU* Schlagen b. Gmunden, Loitlesberger *QN* {5.3a06} nöSkgt.:swTraunv.:OÖ *^@ Slg.LOITLESBERGER· (19xx) [Slg."PflN+TierN.Gmunden"+Erg.] *O* Schlagen in Gm. Gmunden OÖ === *NR* 53C6: rot versch. Abstufg. (purpurn, der Purpur) *LT1* Fuks *BD/LT1* rotes oder rötliches Pferd <record n="21"> <A>HK 288, F288^#115.1 = F2880826.eck^#72.1</A> <HL>Fuchs:1</HL> <QU>Schlagen b. Gmunden, Loitlesberger</QU> <QN> <bibl>{5.3a06} nöSkgt.:swTraunv.:OÖ *^@ Slg.LOITLESBERGER· (19xx) [Slg."PflN+TierN.Gmunden"+Erg.]</bibl> <O>Schlagen in Gm. Gmunden OÖ</O> </QN> <NR>53C6: rot versch. Abstufg. (purpurn, der Purpur)</NR> <LT1>Fuks</LT1> <BD_LT1>rotes oder rötliches Pferd</BD_LT1> <orig>….</orig> </record> <item xml:id="d1e437" n="21"> <ref type="archive">HK 288, F288^#115.1 = F2880826.eck^#72.1</ref> <cit type="etymon"> <pRef>Fuchs</pRef> <gramGrp> <pos sameAs="lemma_wortat-tei-fs.xml#Sub">Sub</pos> </gramGrp> </cit> <ref type="source">Schlagen b. Gmunden, Loitlesberger</ref> <bibl> <ref>5.3a06</ref> nöSkgt.:swTraunv.:OÖ Slg.LOITLESBERGER· (<date>19xx</date>) [Slg."PflN+TierN.Gmunden"+Erg.]</bibl> <placeName>Schlagen in Gm. Gmunden OÖ</placeName> <ref type="questionnaire">53C6: rot versch. Abstufg. (purpurn, der Purpur)</ref> <cit type="lautung" xml:id="d1e451" n="1"> <pRef>Fuks</pRef> </cit> <cit type="translation" corresp="#d1e451"> <oRef xml:lang="de">rotes oder rötliches Pferd</oRef> </cit> </item> TEI <fs> (grammatical features) TEI(list) TEI (listPlace)
  • 13. • Inventory of all known places from which we have language data; • Establishing inventory of such references enables the option of tagging with attribute or element values in using data; • Reusable and extensible w/in our project and institution; • Enhance original information from LOD sources like GeoNames.. Key TEI Documents: listPlace <listPlace> <place @xml:id @type @corresp> <relation @xml:id @type @active @passive> <placeName @role @full> (1..n) (1..n) <listPlace> (1..n) <location> (0..n) <geo> <note> (0..n) (0..n)
  • 14. Defining Location: Relational Hierarchies … <place xml:id="Öst" type="region" subtype=“administrative” corresp="http://sws.geonames.org/2782113/about.rdf"> <placeName role="main"> <placeName full="abb" xml:lang="de">Öst.</placeName> <placeName full="yes" xml:lang="de">Österreich</placeName> </placeName> <location> <geo corresp=“#gis_region_id-862"/> </location> </place> … <relation name="parentRegion" active="#Öst" passive="#Bgl #Kä #NÖ #OÖ #Sa #St #Tir #W #OÖst #SÖst #NÖSt #WÖSt”/> <relation name="parentRegion" active="#Tir” passive="#NTir #OTir #TirHocht #sbairTir #TirÜGeb”/> <place xml:id="Zillert" type=“region" corresp="http://sws.geonames.org/2760567/about.rdf"> <placeName role="main"> <placeName full="abb" xml:lang="de">Zillert.</placeName> <placeName full="yes" xml:lang="de">Zillertal</placeName> </placeName> <location> <geo corresp=“#gis_region_id-771"/> </location> <note>häufige Abk. "Zill.": Tal der Ziller samt einmündender Alpentäler im SO vom —: #mNTir.# -- - vom Sammler Michael Juffinger mit der Ziffer IV bezeichnet ---</note> </place> <relation name="parentRegion" active="#mNTir” passive="#Stub #Zillert #Sillgeb #mInntInnsbr"/> <relation name="parentRegion" active="#NTir" passive="#wNTir #öNTir #mNTir #Innt"/> Mittel Nord Tirol Nord Tirol Tirol Österreich Zillertal (i..n) ** *** *
  • 15. de: Zillertal <place xml:id="Zillert" type=“city" corresp="http://sws.geonames.org/2760567/about.rdf"> <placeName role="main"> <placeName full="abb" xml:lang="de">Zillert.</placeName> <placeName full="yes" xml:lang="de">Zillertal</placeName> </placeName> <location> <geo corresp=“#gis_region_id-771"/> </location> <note>häufige Abk. "Zill.": Tal der Ziller samt einmündender Alpentäler im SO vom —: #mNTir.# --- vom Sammler Michael Juffinger mit der Ziffer IV bezeichnet ---</note> </place> Defining Location: LOD Aspect TEI <place> geoNames uri geoNames: linked data, related concepts geoNames: geo co-ordinates, demographics,etc. for <place>
  • 16. Place: en: Tyrol de: Tirol … <place xml:id="Tir" type="region" subtype=“administrative” corresp="http://sws.geonames.org/2764958/about.rdf"> <placeName role="main"> <placeName full="abb" xml:lang="de">Tir.</placeName> <placeName full="yes" xml:lang="de">Tirol</placeName> </placeName> <location> <geo corresp=“#gis_region_id-762"/> </location> <note>nicht eindeutige und daher eher zu vermeidende Bez.: "Alttirol" für --: #OTir.#, #NTir.#, #STir.# und Bld. Tirol mit --: #OTir.#, #NTir.# (EKü.)</note> </place> … TEI <place> geoNames uri geoNames: geo co- ordinates, demographics,etc. for <place>geoNames: linked data, related concepts Defining Location: LOD Aspect
  • 17. • Original (TUSTEP) entries has extensive inventory of grammatical and lexical features; • TEI feature structures compiled manually according to documentation of guidelines; • Basic part of speech tag (subs, verb, adj, adv, interj, conj,…etc) inherited from etymological descendant form (Hauptlemma); • Further detail as to subcategorization, context, inflection form(s), and collocation specified for each dialectal form (Lautung); Key TEI Documents: Feature Structures for Lexical, Grammatical Categories
  • 18. Key TEI Documents: Feature Structures for Lexical, Grammatical Categories <fs @xml:id @type @xmlns:dcr @feats> <f @xml:id @name @xmlns:dcr> <string @xml:lang> (1..n) (0..n) <fs> used for categorical features, and for complex categories (needing to reference combine multiple features); <f> used for most basic categories and for features for which we want to specify value in one or more strings;
  • 19. *A* HK 165, d165^#1993.1 = T1650708.sch^#18 *HL* T<ötling:1 *QU* Schüttelkopf Dt. Tiern in Kä. - In: Carinthia II,96 (1906) *S* 66 *QN* {wb} mkPubl./MdaWs.:TierN *^@ SCHÜTTELKOPF· (1906) S. [S-5637] *O* ob. Mllt. === *LT1* Tötling [m] *BD/LT1* abgespäntes Kalb <f name="Substantiv" n="1" xml:id="Sub" xmlns:dcr="http://www.isocat.org/ns/dcr" dcr:datcat="http://www.isocat.org/datcat/DC-3347"> …. </f> … <fs type="Gender" xmlns:dcr="http://www.isocat.org/ns/dcr" dcr:datcat=" http://www.isocat.org/datcat/DC-3217"> <f name="MasculineGender" xml:id="Masc" xmlns:dcr="http://www.isocat.org/ns/dcr" dcr:datcat="http://www.isocat.org/datcat/DC-3312"> <string>m</string> </f> …… </fs> *LT1* [m] = Masculine e.g. *HL* :1 = Substative Grammatical information in dialectal data
  • 20. <f name="Substantiv" n="1" xml:id="Sub" xmlns:dcr="http://www.isocat.org/ns/dcr" dcr:datcat="http://www.isocat.org/datcat/DC-3347"> …. <fs type="diminutive" xml:id="SubD" dcr:datcat="http://www.isocat.org/datcat/DC-4922"> <!-- D = Deminutiv/Diminutive = Verkleinerungsform/Koseform (ohne näh. Unterscheidung);--> <f name="diminutive-lein" xml:id="SubD1"/> <!-- D1 = 1. Deminutiv: Häusl, Dearfö (Dörflein), Stüble (lemmatisiert -lein)--> <f name="diminutive-ellein" xml:id="SubD2"/> <!-- D2 = 2. Deminutiv: Häuserl, Hansei, Hantale, Fußile (lemmatisiert -ellein) --> <f name="diminutive-i" xml:id="SubD3"/> <!-- D3 = Koseform auf -i: Greti, Pferti, Katzi --> </fs> …. *A* HK 295, f295^#2793.1 = f2951202.pir^#108.1 *HL* Procken:1 *QU* Tullnitz Mühlhauser *QN* {8.3c04} Jaispitzt.:nöSMä. *^@ FbB.MÜHLHAUSER· (1922-35) [SFb.1-6,12-108/EFb.4-9/WrFb.1/Mtlg.] *O* Tullnitz SMä. === *NR* 30C23 (F): gr./kl. Brotstück (Reankn, Bröckel,*); Ra.; Komp. *LT1* Brèkl [D1] *LT2* Brèka¡rl [D2] *BD/LT1* kleines und kleinstes Stückchen Brot *LT1* [D1] = Diminutive (dialectal cognate to High German ‘-lein’) *LT2* [D2] = Diminutive (dialectal cognate to High German ‘-ellein’) e.g. *HL* :1 = Substative Grammatical information in dialectal data
  • 21. *A* HK 230, f230^#1382.1 = falh0922.pir^#9.1 *HL* Falb:1 *QU* Linz Wagner *QN* {5.2a41} Linz:nHausrv.:OÖ *^@ FbB.WAGNER· (19xx) [SFb./EFb.] *O* Linz OÖ === *NR* 53E8: der Falb(e)/Falch(e) als TierN etc. *LT1* da fåib [m+A] *LT2* d' fåim [pl+A] *BD/LT1* der Falb (gelblichgraue Pferde) <f name="Substantiv" n="1" xml:id="Sub" xmlns:dcr="http://www.isocat.org/ns/dcr" dcr:datcat="http://www.isocat.org/datcat/DC-3347"> …. </f> <fs type="Gender" xmlns:dcr="http://www.isocat.org/ns/dcr" dcr:datcat=" http://www.isocat.org/datcat/DC-3217”> <f name="MasculineGender" xml:id="Masc" xmlns:dcr="http://www.isocat.org/ns/dcr" dcr:datcat="http://www.isocat.org/datcat/DC-3312"> <string>m</string> </f> </fs> <fs type="Article" xml:id="Article" xmlns:dcr="http://www.isocat.org/ns/dcr" dcr:datcat="http://www.isocat.org/datcat/DC-1892"> <f name="definiteArticle" xml:id="A-def" xmlns:dcr="http://www.isocat.org/ns/dcr" dcr:datcat="http://www.isocat.org/datcat/DC-1430"> <string>A</string> </f> </fs> e.g. *HL* :1 = Substative *LT1* [m+A] = Masculine + Definite Article *LT2* [pl+A] = Plural + Definite Article (Note: gender specified in first *LT* applies to all *LT* for given entry) Grammatical information in dialectal data
  • 22. Key TEI Documents: Feature Structures for Concepts • Can contain all and any uri for a given concept in separate <f>’s under the same <fs>; • Reusable and extensible w/in our project and institution; • Usable for both the dialectal data and for certain parts of the original data such as the field questionnaires that were used to elicit lexical data from speakers; • (where possible) concepts will correspond to the sense of a given lexical item; • LOD/RDF compatible;
  • 23. Key TEI Documents: Feature Structures for Concepts <fs @xml:id @type @feats> <f @name=“name”> <string @xml:lang> (1..n) (1..n) <f @name=“entityType> (1..n) <string @xml:lang> <f @name=“conceptURL” @xml:id @corresp> (0..n) (1..n) …. … …. <f @name=“conceptURL” @xml:id @corresp> ….
  • 24. <fs type="concept" xml:id="d1e31"> <f name="name"> <string xml:lang="la">Taraxacum officinale Weber</string> </f> <f name="name"> <string xml:lang="de">Wiesen-Löwenzahn</string> </f> <f name="entityType"> <string xml:lang="en">plant</string> </f> <f name="conceptURL" xml:id="d1e31_cURL1" corresp=“http://dbpedia.org/resource/Taraxacum_officinale"/> <f name="conceptURL" xml:id="d1e31_cURL2" corresp=“http://openup.nhm-wien.ac.at/commonNames/references/ scientificName/3926"/> </fs> …… <list type="wordforms" corresp="concept:d1e31"> ... <item> <cit type="example" xml:id="o151503"> <quote>Hasenohrwaschel</quote> <usg type="geo"> <placeName type="region" ref="place:Zillert">Zillert.</placeName> </usg> </cit> </item> ... ... <place xml:id="Zillert" type=“region” corresp="http://sws.geonames.org/2760567/about.rdf"> <placeName role="main"> <placeName full="abb" xml:lang="de">Zillert.</placeName> <placeName full="yes" xml:lang="de">Zillertal</placeName> </placeName> <location> <geo corresp=“#gis_region_id-771"/> </location> <note>häufige Abk. "Zill.": Tal der Ziller samt einmündender Alpentäler im SO vom —: #mNTir.# --- vom Sammler Michael Juffinger mit der Ziffer IV bezeichnet ---</note> </place> Conceptual Reference <fs> Place Reference <listPlace> Output with Links: Plantnames Sample TEI record (from SQL data): Taraxacum officinale Weber
  • 25. dialect-form: göl¡bw de: gelb etymological-form: gëlb dialect-place: Kl.Feistritz Linking of dialectal data with conceptual infrastructures <fs type="concept" xml:id="c1e3186"> <f name="name"> <string xml:lang="de">Gelb</string> </f> <f name="entityType"> <string>color</string> </f> <f name="conceptURL" xml:id="c1e3186_cURL1" corresp="http://dbpedia.org/resource/yellow"/> </fs> Concept: de: gelb (en: yellow) Dialectal entries: (from TUSTEP) Feature structure BabelNet entry dialect-form: gäb de: gelb etymological-form: gëlb dialect-place: Kufstn. NTir. related concept
  • 26. Next Steps: • Continue refining conversion process • Integrate concepts <fs> with more output (NLP, manual) • Begin converting groups of vocabulary • Develop phonetic <fs> inventory for all transcription systems used in DBOE • TEI Dictionary (???) • Enhanced markup for • Etymology; LOD semantics, inflection,… • Conversion to RDF(?) • Analyses of linguistic content • Visualize data from DBOE (TUSTEP) and (DBO@ema) by geographic, temporal and lexical features

Notes de l'éditeur

  1. Examples of heterogeneous TUSTEP formatting; Incomplete and/or inconsistent records; dates: (yyyy); (yyyy-yyyy); (yyyy-yy); ca(yyyy);..(yyxx), (yyyy-yyxx)… Labeling @xml:lang; The data is labeled by place and no dialectal variety has been assigned (which is good); BUT, this leaves the task of labeling the place and language tag: BCP 47: (can use major language tag (‘de’) + the official ISO (3166) subregion tag; after that there are no more standardized codes that can be used d —interesting note: different data sources, and types, lend themselves differently to different format in TEI (omosiological/semisiological); > TUSTEP file entries
  2. Examples of heterogeneous TUSTEP formatting; Incomplete and/or inconsistent records; dates: (yyyy); (yyyy-yyyy); (yyyy-yy); ca(yyyy);..(yyxx), (yyyy-yyxx)… Labeling @xml:lang; The data is labeled by place and no dialectal variety has been assigned (which is good); BUT, this leaves the task of labeling the place and language tag: BCP 47: (can use major language tag (‘de’) + the official ISO (3166) subregion tag; after that there are no more standardized codes that can be used d —interesting note: different data sources, and types, lend themselves differently to different format in TEI (omosiological/semisiological); > TUSTEP file entries
  3. Examples of heterogeneous TUSTEP formatting; Incomplete and/or inconsistent records; dates: (yyyy); (yyyy-yyyy); (yyyy-yy); ca(yyyy);..(yyxx), (yyyy-yyxx)… Labeling @xml:lang; The data is labeled by place and no dialectal variety has been assigned (which is good); BUT, this leaves the task of labeling the place and language tag: BCP 47: (can use major language tag (‘de’) + the official ISO (3166) subregion tag; after that there are no more standardized codes that can be used d —interesting note: different data sources, and types, lend themselves differently to different format in TEI (omosiological/semisiological); > TUSTEP file entries
  4. Examples of heterogeneous TUSTEP formatting; Incomplete and/or inconsistent records; dates: (yyyy); (yyyy-yyyy); (yyyy-yy); ca(yyyy);..(yyxx), (yyyy-yyxx)… Labeling @xml:lang; The data is labeled by place and no dialectal variety has been assigned (which is good); BUT, this leaves the task of labeling the place and language tag: BCP 47: (can use major language tag (‘de’) + the official ISO (3166) subregion tag; after that there are no more standardized codes that can be used d —interesting note: different data sources, and types, lend themselves differently to different format in TEI (omosiological/semisiological); > TUSTEP file entries
  5. Examples of heterogeneous TUSTEP formatting; Incomplete and/or inconsistent records; dates: (yyyy); (yyyy-yyyy); (yyyy-yy); ca(yyyy);..(yyxx), (yyyy-yyxx)… Labeling @xml:lang; The data is labeled by place and no dialectal variety has been assigned (which is good); BUT, this leaves the task of labeling the place and language tag: BCP 47: (can use major language tag (‘de’) + the official ISO (3166) subregion tag; after that there are no more standardized codes that can be used d —interesting note: different data sources, and types, lend themselves differently to different format in TEI (omosiological/semisiological); > TUSTEP file entries
  6. mention problems with typology
  7. *Note: while there are only 9 official administrative federal states (Bundesländer) in Austria, the relations element defining the subordinately related places/regions to the parent region Austria has 3 additional places defined at that level (#SÖst #NÖSt #WÖSt); they instead inherit one of 3 categories from the SQL database which are not always defined in line with administrative definitions. **Note: Middle North Tyrol (mNTir) is not an official administrative or cultural region thus has no externally defined boundaries, uri, etc… ***Note: as is the case in the first note, the @type of each place are not actually defined according to formal/official categorization, they instead inherit one of 3 categories from the SQL database which are not always defined in line with administrative definitions. The instance of Middle North Tyrol, and many other similar regions in DB is defined as a geographic reference point, possibly, but not necessarily due to some geological (landscape), cultural or other features unique to the given zone.
  8. mention problems with typology
  9. mention problems with typology
  10. Entries have different parts of this info labeled in two (primary) places; in ‘Hauptlemma’ (number 1-9); in the ‘Lautung’ w/in brackets “[ ]” (more….)
  11. Entries have different parts of this info labeled in two (primary) places; in ‘Hauptlemma’ (number 1-9); in the ‘Lautung’ w/in brackets “[ ]” (more….)
  12. Entries have different parts of this info labeled in two (primary) places; in ‘Hauptlemma’ (number 1-9); in the ‘Lautung’ w/in brackets “[ ]” (more….)
  13. Entries have different parts of this info labeled in two (primary) places; in ‘Hauptlemma’ (number 1-9); in the ‘Lautung’ w/in brackets “[ ]” (more….)
  14. Entries have different parts of this info labeled in two (primary) places; in ‘Hauptlemma’ (number 1-9); in the ‘Lautung’ w/in brackets “[ ]” (more….)
  15. (obviously is easier with concrete concepts)
  16. Notes: Uri’s are collected manually and automatically <f @name=“entityType”>; used for specifying conceptual domain of entity (where possible) <f @name=“conceptURL”>; can occur for as many different uri/l’s as are collected in Babelfy retrieval process
  17. the use of the prefix: “concept:” points to the feature structure inventory containing the concepts corresponding to the lexical data in the project directory (using Oxygen); in the @corresp is allowed by the declaration of <prefixDef> in the header of the document in which it is used; (see: http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-prefixDef.html); in the current case it is declared as follows: <prefixDef ident="ref" matchPattern="([a-z]+)" replacementPattern="file:…/ACDH/data/dboe-concept-features-fs-lod-tei.xml#$1"> <p> In the context of this project, private URIs with the prefix "ref" point to <gi>div</gi> elements in the project's global references.xml file. </p> </prefixDef> This example is from the SQL data; the <list><item> is the current formatting of that output, which like the TUSTEP data, is not finalized and is intended as a placeholder format; each concept gets its own list and the sense is defined in @corresp which points to the concept’s feature structure entry; this list structure is unique to where data has multiple dialectal tokens for a given concept/sense The progress and level of detail in this example contains links to the conceptual inventory feature structures, because of the differences in the source data structures, automating this process (through XQuery script + Babelfy) is not yet feasable for the data from TUSTEP;
  18. (obviously is easier with concrete concepts)
  19. (obviously is easier with concrete concepts)