SlideShare une entreprise Scribd logo
1  sur  26
Télécharger pour lire hors ligne
Data description & conversion
Application and examples
Lemon-aid: using lemon to aid quantitative
historical linguistic analysis
Steven Moran & Martin Br¨ummer
University of Zurich & University of Leipzig
1 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
Data description & conversion
Application and examples
Goals
Convert dictionary and wordlist data into the Lexicon Model
for Ontologies, aka lemon
Leverage Linked Data (LD) to combine disparate lexical
resources (50+) from the QuantHistLing research unit
Resulting LD resources provide researchers with:
more linguistic data in the Linguistic Linked Open Data Cloud
(LLOD)
a translation graph to query across the underlying lexicons and
dictionaries to extract semantically-aligned wordlists
2 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
Data description & conversion
Application and examples
Talk map
Data description & conversion
Ontological model
Application and examples
3 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
Data description & conversion
Application and examples
Overview
Data
Conversion
QuantHistLing project
Quantitative Historical Linguistics (QuantHistLing) research
unit aims to uncover and clarify phylogenetic relationships
between native South American languages using quantitative
methods
http://quanthistling.info/
There are two main objectives of the project:
Digitalization of lexical resources on South American
languages and
The development of computer-assisted methods and
algorithms to quantitatively analyze the digitized data
4 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
Data description & conversion
Application and examples
Overview
Data
Conversion
QuantHistLing data
QuantHistLing aims to digitize around 500 works, most of
which are currently only available in print and many of which
are the only resources available for the languages that they
describe
http://quanthistling.info/index.php?id=resources
5 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
Data description & conversion
Application and examples
Overview
Data
Conversion
QuantHistLing data
Simple data output format that contains metadata (prefixed
with “@”) and tab-delimited lexical output
6 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
Data description & conversion
Application and examples
Overview
Data
Conversion
QuantHistLing data format
The first row following the metadata contains the data header
with the fields: QLCID, HEAD, HEAD DOCULECT,
TRANSLATION, TRANSLATION DOCULECT
They correspond respectively to the internal QLC unique
identifier, the headword in the dictionary, the doculect of the
headword (or in other words the language which this
particular document describes), the translation for the given
headword, and the doculect that the translation is given in
For each resource a data dump with the same format is
provided by the project
7 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
Data description & conversion
Application and examples
Overview
Data
Conversion
Conversion from QHL-to-LD
We convert the QLC data into Linked Data that conforms to
the Lemon model with a simple Python script
Lemon is an ontological model for modeling lexicons and
machine-readable dictionaries for linking to the Semantic Web
and the Linked Data cloud
http://lemon-model.net/
Lemon developers also active in the W3C Ontology-Lexica
Community Group
Goal is to “develop models for the representation of lexica (and
machine readable dictionaries) relative to ontologies”
http://www.w3.org/community/ontolex/
8 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
Data description & conversion
Application and examples
Overview
Data
Conversion
Implementation of QHL data in Lemon
9 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
Data description & conversion
Application and examples
Overview
Data
Conversion
Implementation of QHL data in Lemon
10 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
Data description & conversion
Application and examples
Overview
Data
Conversion
Implementation of QHL data in Lemon
11 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
Data description & conversion
Application and examples
Overview
Data
Conversion
Implementation of QHL data in Lemon
12 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
Data description & conversion
Application and examples
Overview
Data
Conversion
Implementation of QHL data in Lemon
13 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
Data description & conversion
Application and examples
Overview
Data
Conversion
Implementation of QHL data in Lemon
According to the Lemon model, senses should link to an
ontological entity like a DBpedia resource
However, we only have the strings representing the word forms
which is not enough data to reliably link to other knowledge
bases
we argue that it is linguistically correct to link the senses of
the entries (instead of their word forms) to their respective
translations
so the ‘sense’ resources serve a purpose, even if they don’t
link the meaning of the entry
14 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
Data description & conversion
Application and examples
Application
Examples
Conclusion
Application
A major goal in historical-comparative linguistics is the
identification of cognates, i.e. sets of words in genealogically
related languages that have been derived from a common
word or root (e.g. English ‘is’, German ‘ist’, Latin ‘est’, from
Indo-European ‘esti’).
Modeling dictionaries and lexicons in a pivot ontology using
overlaps in translations is one way to merge several resources
into one RDF graph for querying and extracting
semantically-aligned wordlists, which can then be used as
input into computational historical linguistics tools such as
LingPy (List and Moran, 2013).
15 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
Data description & conversion
Application and examples
Application
Examples
Conclusion
Application
As a first step, we have converted the QHL data into RDF
and it is available online through a SPARQL endpoint.
http://linked-data.org/sparql/ (preliminary)
http://quanthistlist.info/lod/ (coming soon)
a dump is available at http://linked-data.org/datasets/
Querying the combined dictionaries and lexicons is
straightforward
16 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
Data description & conversion
Application and examples
Application
Examples
Conclusion
Return all triples
returns over 3.8 million triples
17 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
Data description & conversion
Application and examples
Application
Examples
Conclusion
Pairs of languages in the translation graph that contain
written forms for the lexical sense “casa”
18 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
Data description & conversion
Application and examples
Application
Examples
Conclusion
Implementation of QHL data in Lemon
19 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
Data description & conversion
Application and examples
Application
Examples
Conclusion
Languages in the translation graph that contain written
forms for the lexical sense “casa”
66 language pairs
marked entry shows word forms are not normalized and
contain data that can be analyzed further
20 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
Data description & conversion
Application and examples
Application
Examples
Conclusion
Queries
Can be easily extended to incorporate entire wordlists, such as
the Swadesh list (Swadesh, 1952) or Leipzig-Jakarta list
(Tadmor et al., 2010)
The combination of disparate data from many dictionaries and
lexicons is a first step in a computational historical linguistics
pipeline
Results are given in the source documents’ orthographic
representations
They must be normalized into an interlingual pivot, such as
the International Phonetic Alphabet, if phonetic or phonemic
analysis is to be applied to the data
Next step before producing phonetic alignments and cognate
judgements based on metrics and algorithms for calculating
lexical similarity
21 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
Data description & conversion
Application and examples
Application
Examples
Conclusion
Conclusion
From data being digitized and extracted from print resources,
we are creating machine-readable lexicons that are both
interoperable with each other (we link semantic senses using
the Lemon ontology model) and with other linguistics sources
We also interlink the resulting dictionary resources with other
language resources in the LLOD via ISO639-3 codes
22 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
Data description & conversion
Application and examples
Application
Examples
Conclusion
Future work - simplify!
Build algorithms that identify semantically similar
translation-pairs from terse translations
Identify that doculect translations like
“coarsely grind”
“grind up, crush well”
“grind lightly (chili pepper, millet for a quick snack)”
“grind lightly (groundnuts) with stones”
for different languages can be mapped to a simpler form such
as “to crush/grind” for initial comparative analysis
23 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
Data description & conversion
Application and examples
Application
Examples
Conclusion
Future work - annotate!
Use NLP Interchange Format (Hellmann et al., 2012) to keep
track of where information in the dictionaries comes from - or
in other words, use NIF combined with Lemon to annotate the
QHL data sources for provenance
24 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
Data description & conversion
Application and examples
Application
Examples
Conclusion
Future work - link!
Link to further resources that contain linguistic and
non-linguistic information
Typological data and geographic variables that may provide
useful information for determining the genealogical and
geographical relatedness of languages
25 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
Data description & conversion
Application and examples
Application
Examples
Conclusion
Many thanks!
Organiziers and participants of LDL-2013
QuantHistLing Research Unit - Univeristy of Marburg
(Michael Cysouw, PI)
University of Zurich, University of Marburg and University of
Leipzig
26 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing

Contenu connexe

Tendances

Boolean Retrieval
Boolean RetrievalBoolean Retrieval
Boolean Retrieval
mghgk
 
Semantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic DataSemantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic Data
Matthew Rowe
 
Short Report Bridges performance gap between Relational and RDF
Short Report Bridges performance gap between Relational and RDFShort Report Bridges performance gap between Relational and RDF
Short Report Bridges performance gap between Relational and RDF
Akram Abbasi
 
Scalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large FoldersScalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large Folders
feiwin
 
Query Translation for Data Sources with Heterogeneous Content Semantics
Query Translation for Data Sources with Heterogeneous Content Semantics Query Translation for Data Sources with Heterogeneous Content Semantics
Query Translation for Data Sources with Heterogeneous Content Semantics
Jie Bao
 
Deploying PHP applications using Virtuoso as Application Server
Deploying PHP applications using Virtuoso as Application ServerDeploying PHP applications using Virtuoso as Application Server
Deploying PHP applications using Virtuoso as Application Server
webhostingguy
 
Address standardization with latent semantic association
Address standardization with latent semantic associationAddress standardization with latent semantic association
Address standardization with latent semantic association
jyhuangtc
 
A wiki for_business_rules_in_open_vocabulary_executable_english
A wiki for_business_rules_in_open_vocabulary_executable_englishA wiki for_business_rules_in_open_vocabulary_executable_english
A wiki for_business_rules_in_open_vocabulary_executable_english
Adrian Walker
 

Tendances (20)

Boolean Retrieval
Boolean RetrievalBoolean Retrieval
Boolean Retrieval
 
RDF and Java
RDF and JavaRDF and Java
RDF and Java
 
Range Query on Big Data Based on Map Reduce
Range Query on Big Data Based on Map ReduceRange Query on Big Data Based on Map Reduce
Range Query on Big Data Based on Map Reduce
 
Semantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic DataSemantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic Data
 
Coling2014:Single Document Keyphrase Extraction Using Label Information
Coling2014:Single Document Keyphrase Extraction Using Label InformationColing2014:Single Document Keyphrase Extraction Using Label Information
Coling2014:Single Document Keyphrase Extraction Using Label Information
 
Unit i data structure FYCS MUMBAI UNIVERSITY SEM II
Unit i  data structure FYCS MUMBAI UNIVERSITY SEM II Unit i  data structure FYCS MUMBAI UNIVERSITY SEM II
Unit i data structure FYCS MUMBAI UNIVERSITY SEM II
 
Short Report Bridges performance gap between Relational and RDF
Short Report Bridges performance gap between Relational and RDFShort Report Bridges performance gap between Relational and RDF
Short Report Bridges performance gap between Relational and RDF
 
Linked sensor data
Linked sensor dataLinked sensor data
Linked sensor data
 
5010
50105010
5010
 
Jarrar: RDF Stores -Challenges and Solutions
Jarrar: RDF Stores -Challenges and SolutionsJarrar: RDF Stores -Challenges and Solutions
Jarrar: RDF Stores -Challenges and Solutions
 
Scalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large FoldersScalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large Folders
 
Using Page Size for Controlling Duplicate Query Results in Semantic Web
Using Page Size for Controlling Duplicate Query Results in Semantic WebUsing Page Size for Controlling Duplicate Query Results in Semantic Web
Using Page Size for Controlling Duplicate Query Results in Semantic Web
 
Query Translation for Data Sources with Heterogeneous Content Semantics
Query Translation for Data Sources with Heterogeneous Content Semantics Query Translation for Data Sources with Heterogeneous Content Semantics
Query Translation for Data Sources with Heterogeneous Content Semantics
 
Deploying PHP applications using Virtuoso as Application Server
Deploying PHP applications using Virtuoso as Application ServerDeploying PHP applications using Virtuoso as Application Server
Deploying PHP applications using Virtuoso as Application Server
 
TextRank: Bringing Order into Texts
TextRank: Bringing Order into TextsTextRank: Bringing Order into Texts
TextRank: Bringing Order into Texts
 
RDF for PubMedCentral
RDF for PubMedCentral RDF for PubMedCentral
RDF for PubMedCentral
 
Semantic Web Nature
Semantic Web NatureSemantic Web Nature
Semantic Web Nature
 
Linked Data Fragments
Linked Data FragmentsLinked Data Fragments
Linked Data Fragments
 
Address standardization with latent semantic association
Address standardization with latent semantic associationAddress standardization with latent semantic association
Address standardization with latent semantic association
 
A wiki for_business_rules_in_open_vocabulary_executable_english
A wiki for_business_rules_in_open_vocabulary_executable_englishA wiki for_business_rules_in_open_vocabulary_executable_english
A wiki for_business_rules_in_open_vocabulary_executable_english
 

En vedette

Perry bradley and tasmin
Perry   bradley and tasminPerry   bradley and tasmin
Perry bradley and tasmin
SusannaBadger
 
Language analysis essay writing
Language analysis essay writingLanguage analysis essay writing
Language analysis essay writing
Ty171
 

En vedette (11)

Perry bradley and tasmin
Perry   bradley and tasminPerry   bradley and tasmin
Perry bradley and tasmin
 
AS AO1 English Language
AS AO1 English LanguageAS AO1 English Language
AS AO1 English Language
 
Verbal and non- verbal linguistic devices in Pinter’s ‘ The Birthday Party’.
Verbal and non- verbal linguistic devices in Pinter’s ‘ The Birthday Party’.Verbal and non- verbal linguistic devices in Pinter’s ‘ The Birthday Party’.
Verbal and non- verbal linguistic devices in Pinter’s ‘ The Birthday Party’.
 
Gruffalo Linguistic Analysis
Gruffalo Linguistic AnalysisGruffalo Linguistic Analysis
Gruffalo Linguistic Analysis
 
Jamaican creole
Jamaican creoleJamaican creole
Jamaican creole
 
A Level English Language Exam Prep from AQA 2011
A Level English Language Exam Prep from AQA 2011A Level English Language Exam Prep from AQA 2011
A Level English Language Exam Prep from AQA 2011
 
Language skills for higher level responses WJEC
Language skills for higher level responses WJEC Language skills for higher level responses WJEC
Language skills for higher level responses WJEC
 
Jamaican Creole
Jamaican CreoleJamaican Creole
Jamaican Creole
 
Language analysis essay writing
Language analysis essay writingLanguage analysis essay writing
Language analysis essay writing
 
The gruffalo
The gruffaloThe gruffalo
The gruffalo
 
L2 propaganda and linguistic devices
L2 propaganda and linguistic devicesL2 propaganda and linguistic devices
L2 propaganda and linguistic devices
 

Similaire à Lemon-aid: using Lemon to aid quantitative historical linguistic analysis

Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
Dustin Smith
 

Similaire à Lemon-aid: using Lemon to aid quantitative historical linguistic analysis (20)

Bourne RDAP11 Data Publication Repositories
Bourne RDAP11 Data Publication RepositoriesBourne RDAP11 Data Publication Repositories
Bourne RDAP11 Data Publication Repositories
 
Linked Data and Services
Linked Data and ServicesLinked Data and Services
Linked Data and Services
 
The Nature of Information
The Nature of InformationThe Nature of Information
The Nature of Information
 
The Linked Data Lifecycle
The Linked Data LifecycleThe Linked Data Lifecycle
The Linked Data Lifecycle
 
Omitola birmingham cityuniv
Omitola birmingham cityunivOmitola birmingham cityuniv
Omitola birmingham cityuniv
 
#opentourism - Linked Open Data Publishing and Discovery Workshop
#opentourism - Linked Open Data Publishing and Discovery Workshop#opentourism - Linked Open Data Publishing and Discovery Workshop
#opentourism - Linked Open Data Publishing and Discovery Workshop
 
RDAP 033111
RDAP 033111RDAP 033111
RDAP 033111
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology Constraints
 
Semantic Web Adoption
Semantic Web AdoptionSemantic Web Adoption
Semantic Web Adoption
 
Lider Reference Model ld4lt session March, 3rd, 2015
Lider Reference Model ld4lt session  March, 3rd, 2015Lider Reference Model ld4lt session  March, 3rd, 2015
Lider Reference Model ld4lt session March, 3rd, 2015
 
Semantic Search Component
Semantic Search ComponentSemantic Search Component
Semantic Search Component
 
Linking Big Data to Rich Process Descriptions
Linking Big Data to Rich Process DescriptionsLinking Big Data to Rich Process Descriptions
Linking Big Data to Rich Process Descriptions
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
WEB SEARCH ENGINE BASED SEMANTIC SIMILARITY MEASURE BETWEEN WORDS USING PATTE...
WEB SEARCH ENGINE BASED SEMANTIC SIMILARITY MEASURE BETWEEN WORDS USING PATTE...WEB SEARCH ENGINE BASED SEMANTIC SIMILARITY MEASURE BETWEEN WORDS USING PATTE...
WEB SEARCH ENGINE BASED SEMANTIC SIMILARITY MEASURE BETWEEN WORDS USING PATTE...
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
Online Presentation
Online PresentationOnline Presentation
Online Presentation
 
Enabling Language Resources to Expose Translations as Linked Data on the Web
Enabling Language Resources to Expose Translations as Linked Data on the WebEnabling Language Resources to Expose Translations as Linked Data on the Web
Enabling Language Resources to Expose Translations as Linked Data on the Web
 
Establishing the Connection: Creating a Linked Data Version of the BNB
Establishing the Connection: Creating a Linked Data Version of the BNBEstablishing the Connection: Creating a Linked Data Version of the BNB
Establishing the Connection: Creating a Linked Data Version of the BNB
 
iLastic: Linked Data Generation Workflow and User Interface for iMinds Schola...
iLastic: Linked Data Generation Workflow and User Interface for iMinds Schola...iLastic: Linked Data Generation Workflow and User Interface for iMinds Schola...
iLastic: Linked Data Generation Workflow and User Interface for iMinds Schola...
 
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
 

Plus de mbruemmer

Oscar Muñoz: Content Analytics for Media Agencies
Oscar Muñoz: Content Analytics for Media AgenciesOscar Muñoz: Content Analytics for Media Agencies
Oscar Muñoz: Content Analytics for Media Agencies
mbruemmer
 
Alessio Bosca: Linked Data for Content Analytics in CELI
Alessio Bosca: Linked Data for Content Analytics in CELIAlessio Bosca: Linked Data for Content Analytics in CELI
Alessio Bosca: Linked Data for Content Analytics in CELI
mbruemmer
 
Mark Zöpfgen: Software-Supported Bibliographic Recording and Linked Data
Mark Zöpfgen: Software-Supported Bibliographic Recording and Linked DataMark Zöpfgen: Software-Supported Bibliographic Recording and Linked Data
Mark Zöpfgen: Software-Supported Bibliographic Recording and Linked Data
mbruemmer
 
Ilan Kernerman: Generating Multilingual Lexicographic Resources
Ilan Kernerman: Generating Multilingual Lexicographic ResourcesIlan Kernerman: Generating Multilingual Lexicographic Resources
Ilan Kernerman: Generating Multilingual Lexicographic Resources
mbruemmer
 

Plus de mbruemmer (7)

Dirk Goldhahn: Introduction to the German Wortschatz Project
Dirk Goldhahn: Introduction to the German Wortschatz ProjectDirk Goldhahn: Introduction to the German Wortschatz Project
Dirk Goldhahn: Introduction to the German Wortschatz Project
 
Oscar Muñoz: Content Analytics for Media Agencies
Oscar Muñoz: Content Analytics for Media AgenciesOscar Muñoz: Content Analytics for Media Agencies
Oscar Muñoz: Content Analytics for Media Agencies
 
Alessio Bosca: Linked Data for Content Analytics in CELI
Alessio Bosca: Linked Data for Content Analytics in CELIAlessio Bosca: Linked Data for Content Analytics in CELI
Alessio Bosca: Linked Data for Content Analytics in CELI
 
Marc Egger: Text Analytics for Brand Research -Non-reactive Concept Mapping t...
Marc Egger: Text Analytics for Brand Research -Non-reactive Concept Mapping t...Marc Egger: Text Analytics for Brand Research -Non-reactive Concept Mapping t...
Marc Egger: Text Analytics for Brand Research -Non-reactive Concept Mapping t...
 
Mark Zöpfgen: Software-Supported Bibliographic Recording and Linked Data
Mark Zöpfgen: Software-Supported Bibliographic Recording and Linked DataMark Zöpfgen: Software-Supported Bibliographic Recording and Linked Data
Mark Zöpfgen: Software-Supported Bibliographic Recording and Linked Data
 
Ilan Kernerman: Generating Multilingual Lexicographic Resources
Ilan Kernerman: Generating Multilingual Lexicographic ResourcesIlan Kernerman: Generating Multilingual Lexicographic Resources
Ilan Kernerman: Generating Multilingual Lexicographic Resources
 
Tatiana Gornostay: Language Meets Knowledge in Digital Content Management
Tatiana Gornostay: Language Meets Knowledge in Digital Content ManagementTatiana Gornostay: Language Meets Knowledge in Digital Content Management
Tatiana Gornostay: Language Meets Knowledge in Digital Content Management
 

Dernier

Dernier (20)

Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 

Lemon-aid: using Lemon to aid quantitative historical linguistic analysis

  • 1. Data description & conversion Application and examples Lemon-aid: using lemon to aid quantitative historical linguistic analysis Steven Moran & Martin Br¨ummer University of Zurich & University of Leipzig 1 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
  • 2. Data description & conversion Application and examples Goals Convert dictionary and wordlist data into the Lexicon Model for Ontologies, aka lemon Leverage Linked Data (LD) to combine disparate lexical resources (50+) from the QuantHistLing research unit Resulting LD resources provide researchers with: more linguistic data in the Linguistic Linked Open Data Cloud (LLOD) a translation graph to query across the underlying lexicons and dictionaries to extract semantically-aligned wordlists 2 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
  • 3. Data description & conversion Application and examples Talk map Data description & conversion Ontological model Application and examples 3 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
  • 4. Data description & conversion Application and examples Overview Data Conversion QuantHistLing project Quantitative Historical Linguistics (QuantHistLing) research unit aims to uncover and clarify phylogenetic relationships between native South American languages using quantitative methods http://quanthistling.info/ There are two main objectives of the project: Digitalization of lexical resources on South American languages and The development of computer-assisted methods and algorithms to quantitatively analyze the digitized data 4 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
  • 5. Data description & conversion Application and examples Overview Data Conversion QuantHistLing data QuantHistLing aims to digitize around 500 works, most of which are currently only available in print and many of which are the only resources available for the languages that they describe http://quanthistling.info/index.php?id=resources 5 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
  • 6. Data description & conversion Application and examples Overview Data Conversion QuantHistLing data Simple data output format that contains metadata (prefixed with “@”) and tab-delimited lexical output 6 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
  • 7. Data description & conversion Application and examples Overview Data Conversion QuantHistLing data format The first row following the metadata contains the data header with the fields: QLCID, HEAD, HEAD DOCULECT, TRANSLATION, TRANSLATION DOCULECT They correspond respectively to the internal QLC unique identifier, the headword in the dictionary, the doculect of the headword (or in other words the language which this particular document describes), the translation for the given headword, and the doculect that the translation is given in For each resource a data dump with the same format is provided by the project 7 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
  • 8. Data description & conversion Application and examples Overview Data Conversion Conversion from QHL-to-LD We convert the QLC data into Linked Data that conforms to the Lemon model with a simple Python script Lemon is an ontological model for modeling lexicons and machine-readable dictionaries for linking to the Semantic Web and the Linked Data cloud http://lemon-model.net/ Lemon developers also active in the W3C Ontology-Lexica Community Group Goal is to “develop models for the representation of lexica (and machine readable dictionaries) relative to ontologies” http://www.w3.org/community/ontolex/ 8 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
  • 9. Data description & conversion Application and examples Overview Data Conversion Implementation of QHL data in Lemon 9 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
  • 10. Data description & conversion Application and examples Overview Data Conversion Implementation of QHL data in Lemon 10 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
  • 11. Data description & conversion Application and examples Overview Data Conversion Implementation of QHL data in Lemon 11 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
  • 12. Data description & conversion Application and examples Overview Data Conversion Implementation of QHL data in Lemon 12 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
  • 13. Data description & conversion Application and examples Overview Data Conversion Implementation of QHL data in Lemon 13 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
  • 14. Data description & conversion Application and examples Overview Data Conversion Implementation of QHL data in Lemon According to the Lemon model, senses should link to an ontological entity like a DBpedia resource However, we only have the strings representing the word forms which is not enough data to reliably link to other knowledge bases we argue that it is linguistically correct to link the senses of the entries (instead of their word forms) to their respective translations so the ‘sense’ resources serve a purpose, even if they don’t link the meaning of the entry 14 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
  • 15. Data description & conversion Application and examples Application Examples Conclusion Application A major goal in historical-comparative linguistics is the identification of cognates, i.e. sets of words in genealogically related languages that have been derived from a common word or root (e.g. English ‘is’, German ‘ist’, Latin ‘est’, from Indo-European ‘esti’). Modeling dictionaries and lexicons in a pivot ontology using overlaps in translations is one way to merge several resources into one RDF graph for querying and extracting semantically-aligned wordlists, which can then be used as input into computational historical linguistics tools such as LingPy (List and Moran, 2013). 15 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
  • 16. Data description & conversion Application and examples Application Examples Conclusion Application As a first step, we have converted the QHL data into RDF and it is available online through a SPARQL endpoint. http://linked-data.org/sparql/ (preliminary) http://quanthistlist.info/lod/ (coming soon) a dump is available at http://linked-data.org/datasets/ Querying the combined dictionaries and lexicons is straightforward 16 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
  • 17. Data description & conversion Application and examples Application Examples Conclusion Return all triples returns over 3.8 million triples 17 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
  • 18. Data description & conversion Application and examples Application Examples Conclusion Pairs of languages in the translation graph that contain written forms for the lexical sense “casa” 18 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
  • 19. Data description & conversion Application and examples Application Examples Conclusion Implementation of QHL data in Lemon 19 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
  • 20. Data description & conversion Application and examples Application Examples Conclusion Languages in the translation graph that contain written forms for the lexical sense “casa” 66 language pairs marked entry shows word forms are not normalized and contain data that can be analyzed further 20 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
  • 21. Data description & conversion Application and examples Application Examples Conclusion Queries Can be easily extended to incorporate entire wordlists, such as the Swadesh list (Swadesh, 1952) or Leipzig-Jakarta list (Tadmor et al., 2010) The combination of disparate data from many dictionaries and lexicons is a first step in a computational historical linguistics pipeline Results are given in the source documents’ orthographic representations They must be normalized into an interlingual pivot, such as the International Phonetic Alphabet, if phonetic or phonemic analysis is to be applied to the data Next step before producing phonetic alignments and cognate judgements based on metrics and algorithms for calculating lexical similarity 21 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
  • 22. Data description & conversion Application and examples Application Examples Conclusion Conclusion From data being digitized and extracted from print resources, we are creating machine-readable lexicons that are both interoperable with each other (we link semantic senses using the Lemon ontology model) and with other linguistics sources We also interlink the resulting dictionary resources with other language resources in the LLOD via ISO639-3 codes 22 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
  • 23. Data description & conversion Application and examples Application Examples Conclusion Future work - simplify! Build algorithms that identify semantically similar translation-pairs from terse translations Identify that doculect translations like “coarsely grind” “grind up, crush well” “grind lightly (chili pepper, millet for a quick snack)” “grind lightly (groundnuts) with stones” for different languages can be mapped to a simpler form such as “to crush/grind” for initial comparative analysis 23 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
  • 24. Data description & conversion Application and examples Application Examples Conclusion Future work - annotate! Use NLP Interchange Format (Hellmann et al., 2012) to keep track of where information in the dictionaries comes from - or in other words, use NIF combined with Lemon to annotate the QHL data sources for provenance 24 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
  • 25. Data description & conversion Application and examples Application Examples Conclusion Future work - link! Link to further resources that contain linguistic and non-linguistic information Typological data and geographic variables that may provide useful information for determining the genealogical and geographical relatedness of languages 25 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing
  • 26. Data description & conversion Application and examples Application Examples Conclusion Many thanks! Organiziers and participants of LDL-2013 QuantHistLing Research Unit - Univeristy of Marburg (Michael Cysouw, PI) University of Zurich, University of Marburg and University of Leipzig 26 Steven Moran & Martin Br¨ummer Lemon-aid: using lemon for QuantHistLing