SlideShare a Scribd company logo
1 of 49
Download to read offline
Data Analysis for Ancient Corpora
Cody Kingham and Dirk Roorda
FAMES, Cambridge, 2019-01-31
0
50
100
150
200
250
conj nmpr subs adjv prep art
Parts of Speech after Atnach in ETCBC Phrase
background
description
mini-study
new horizons
• Put researchers in control of their
data.
• Empower researchers to fully
harness the data available to them.
• Encourage a new paradigm in the
humanities
🤔
"# data
💰
what’s important
limits
researchers
they decide
Text-Fabric and Hebrew Data
• Free, accessible corpus annotation and analysis tool.
• Published the Amsterdam Hebrew data on Github with free,
open-source license.
• Encouraged researchers to step out of their technological
comfort zones.
A Different Vision
• Researchers are in charge of their data and set the agenda for
its use.
• Researchers are empowered with the tools needed for
powerful data analysis.
• Data is made open-source, freely available
Text-Fabric
• Graph model: words, phrases, etc. are “nodes,” relationships
between them are edges.
• We can model complex data structures better than other
methods (e.g. XML).
• All stored in easy-to-understand, plain-text files. No messy
XML, SQL, etc.
&P005381 = MSVO 3, 70
#atf: lang qpc
@tablet
@obverse
@column 1
1.a. 2(N14) , SZE~a SAL TUR3~a NUN~a
1.b. 3(N19) , |GISZ.TE|
2. 1(N14) , NAR NUN~a SIG7
3. 2(N04)# , PIRIG~b1 SIG7 URI3~a NUN~a
@column 2
1. 3(N04) , |GISZ.TE| GAR |SZU2.((HI+1(N57))+(HI+1(N57)))| GI4~a
2. , GU7 AZ SI4~f
@reverse
@column 1
1. 3(N14) , SZE~a
2. 3(N19) 5(N04) ,
3. , GU7
@column 2
1. , AZ SI4~f
CTBA|CTBA#CTBA#CTB###0#0#0#3#1#0#2#0#0#2#0#0#2#0#0#0#0#0 D;L;DOTH|;L;DOT#;L;DOTA#;LD#D#H#0#0#0#3#1#0#3#0#0#2#0#0#2#1#1#3#0#0
D;WOE|;WOE#;WOE#;WOE#D##0#0#0#0#0#0#0#0#0#1#0#0#2#0#0#0#0#0 MW;KA|MW;KA#MW;KA#MWK###0#1#0#3#1#0#2#0#0#0#0#2#0#0#0#0#0#0 BRH|
BR#BRA#BR##H#0#0#0#3#1#0#2#0#0#2#0#0#2#1#1#3#0#0 DDO;D|DO;D#DO;D#DO;D#D##0#0#0#0#0#0#0#0#0#1#0#0#2#0#0#0#0#0 BRH|
BR#BRA#BR##H#0#0#0#3#1#0#2#0#0#2#0#0#2#1#1#3#0#0 DABRHM|ABRHM#ABRHM#ABRHM#D##0#0#0#0#0#0#0#0#0#1#0#0#2#0#0#0#0#0
ABRHM|ABRHM#ABRHM#ABRHM###0#0#0#0#0#0#0#0#0#1#0#0#2#0#0#0#0#0 AOLD|AOLD#;LD#;LD###0#5#1#0#1#3#2#0#0#0#0#0#0#0#0#0#0#0 LA;SKX|
A;SKX#A;SKX#A;SKX#L##0#0#0#0#0#0#0#0#0#1#0#0#2#0#0#0#0#0 A;SKX|A;SKX#A;SKX#A;SKX###0#0#0#0#0#0#0#0#0#1#0#0#2#0#0#0#0#0 AOLD|
Syriac NT (Sedra database)
DEUT33,02 >C- >;71C 1.000 >;71C- >C-
DEUT33,02 DT D.@73T 1.000 D.@73T DT
DEUT33,09 BNW B.@N@73JW 1.000 B.@N@73W BNW
EST 01,16 MWMKN M:MW.K@81N 1.000 M:WM.K@81N MWMKN
EST 03,04 B- K.:- 1.000 B.:- B-
EST 03,04 >MRM >@M:R@70M 1.000 >@M:R@70M >MRM
Hebrew Ketiv-Qere (ETCBC)
Cuneiform Uruk (CDLI)
(1:1:1:1) bi P PREFIX|bi+
(1:1:1:2) somi N STEM|POS:N|LEM:{som|ROOT:smw|M|GEN
(1:1:2:1) {ll~ahi PN STEM|POS:PN|LEM:{ll~ah|ROOT:Alh|GEN
(1:1:3:1) {l DET PREFIX|Al+
(1:1:3:2) r~aHoma`ni ADJ STEM|POS:ADJ|LEM:r~aHoma`n|ROOT:rHm|MS|GEN
(1:1:4:1) {l DET PREFIX|Al+
(1:1:4:2) r~aHiymi ADJ STEM|POS:ADJ|LEM:r~aHiym|ROOT:rHm|MS|GEN
(1:2:1:1) {lo DET PREFIX|Al+
(1:2:1:2) Hamodu N STEM|POS:N|LEM:Hamod|ROOT:Hmd|M|NOM
Arabic Quran (Tanzil)
Source data of a corpus
TEI, Markdown, ASCII, Database
Data structure of TF - the IKEA spirit
node
order! order!
stacks of components
uniquely identified
words
phrases
chapters
verses
Conversion to TF
TF does more than half of the work
# Consider Phlebas
$ author=Iain M. Banks
## 1
Everything about us,
everything around us,
everything we know [and can know of] is composed ultimately of
patterns of nothing;
that’s the bottom line, the final truth.
So where we find we have any control over those patterns,
why not make the most elegant ones, the most enjoyable and good
ones,
in our own terms?
## 2
Besides,
it left the humans in the Culture free to take care of the things that
really mattered in life,
such as [sports, games, romance,] studying dead languages,
barbarian societies and impossible problems,
and climbing high mountains without the aid of a safety harness.
@node
@compiler=Dirk Roorda
@description=the letters of a word
@name=Culture quotes from Iain
Banks
@source=Good Reads
@url=https://www.goodreads.com/
work/quotes/14366-consider-phlebas
@valueType=str
@writtenBy=Text-Fabric
@dateWritten=2019-01-30T22:20:19Z
Everything
about
us
everything
around
us
everything
we
know
and
can
know
of
is
composed
ultimately
of
patterns
of
nothing
that’s
the
bottom
line
the
final
truth
So letters
@node
@compiler=Dirk Roorda
@description=the punctuation after
a word
@name=Culture quotes from Iain
Banks
@source=Good Reads
@url=https://www.goodreads.com/
work/quotes/14366-consider-phlebas
@valueType=str
@writtenBy=Text-Fabric
@dateWritten=2019-01-30T22:20:19Z
3 ,
6 ,
20 ;
24 ,
27 .
38 ,
45 ,
51 ,
55 ?
,
75 ,
78 ,
,
,
83 ,
88 ,
99 .
punc
banks/tf/
author.tf
gap.tf
letters.tf
number.tf
oslots.tf
otext.tf
otype.tf
punc.tf
terminator.tf
title.tf
TF dataset
otype
@node
@compiler=Dirk Roorda
@name=Culture quotes from Iain Banks
@source=Good Reads
@url=https://www.goodreads.com/work/quotes/14366-consider-phlebas
@valueType=str
@writtenBy=Text-Fabric
@dateWritten=2019-01-30T22:20:19Z
1-99 word
100 book
101-102 chapter
103-114 line
115-117 sentence
oslots
@edge
@compiler=Dirk Roorda
@name=Culture quotes from Iain Banks
@source=Good Reads
@url=https://www.goodreads.com/work/quotes/14366-consider-phlebas
@valueType=str
@writtenBy=Text-Fabric
@dateWritten=2019-01-30T22:20:19Z
100 1-99
1-55
56-99
1-3
4-6
7-9,14-20
21-27
28-38
39-51
52-55
56
57-75
76-77,81-83
84-88
89-99
1-27
28-55
56-99
1-99 word
100 book
101-102 chapter
103-114 line
115-117 sentence
## 1
Everything about us,
everything around us,
everything we know [and can know of] is composed ultimately of patterns of
nothing;
that’s the bottom line, the final truth.
So where we find we have any control over those patterns,
why not make the most elegant ones, the most enjoyable and good ones,
in our own terms?
## 2
Besides,
it left the humans in the Culture free to take care of the things that really
mattered in life,
such as [sports, games, romance,] studying dead languages,
barbarian societies and impossible problems,
and climbing high mountains without the aid of a safety harness.
otext
@config
@compiler=Dirk Roorda
@fmt:text-orig-full={letters}{punc}
@name=Culture quotes from Iain Banks
@sectionFeatures=title,number
@sectionTypes=book,chapter
@source=Good Reads
@url=https://www.goodreads.com/work/quotes/14366-consider-phlebas
@writtenBy=Text-Fabric
@dateWritten=2019-01-30T22:20:19Z
Computing - Python - Jupyter notebooks
https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/bhsa/start.ipynb
BHSA
Quran
https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/quran/start.ipynb
https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/syrnt/start.ipynb
Syriac NT
Old Babylon'
https://shebanq.ancient-data.org/hebrew/query?version=4b&id=1050 SHEBANQ
Computing - more power!
https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/bhsa/searchFromMQL.ipynb
BHSA
Quran
https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/quran/search.ipynb
Quran
https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/quran/search.ipynb
Syriac NT
https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/syrnt/search.ipynb
Old Babylon'
Uruk
https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/uruk/search.ipynb
UrukPower to you! (without the programming)
Uruk
Uruk
Mini-Study:
Atnachs and Phrase Divisions
• How often do atnach accents disagree with the ETCBC phrase
divisions?
• Why?
Sharing and re-using data
Text-Fabric has been developed by a DANS-employee
as a consequence:
Data export is built in ✅
Provenance tracking is built in ✅
Redistribution of newly created data is built in ✅
sharing #1: GitHub & NBviewer
work done in a Jupyter Notebook inside a GitHub repository
is very sharable
https://github.com/Nino-cunei/primers/blob/master/oldbabylonian/OB-primer1.ipynb
sharing #2: Export from TF-browser
sharing #3: Zenodo
sharing #4: Create new features
https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/bhsa/share.ipynb
• etcbc/valence/tf : the results of the verbal valence work of Janet Dyk in the
SYNVAR project;

• etcbc/lingo/heads/tf : head words for phrases, work done by Cody Kingham;

• ch-jensen/Semantic-mapping-of-participants/actor/tf : participant analysis in
progress by Christian Høygaard-Jensen;

• cmerwich/bh-reference-system/tf: participant analysis in progress by
Christiaan Erwich;

• or whatever you have in the making!

• HINT: semantic/fuzzy/plurality for collective nouns (Chip Hardy?)
https://github.com/ETCBC/lingo/tree/master/easter/tf/c
https://github.com/ETCBC/lingo/tree/master/easter/tf/c
Open Science Rocks
thank you
Cody Kingham codykingham@icloud.com
Dirk Roorda dirk.roorda@dans.knaw.nl

More Related Content

Similar to Ancient corpora analysis

From Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesFrom Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science Tales
Bertram Ludäscher
 
Defrosting the Digital Library: A survey of bibliographic tools for the next ...
Defrosting the Digital Library: A survey of bibliographic tools for the next ...Defrosting the Digital Library: A survey of bibliographic tools for the next ...
Defrosting the Digital Library: A survey of bibliographic tools for the next ...
Duncan Hull
 
DB-IR-ranking
DB-IR-rankingDB-IR-ranking
DB-IR-ranking
FELIX75
 

Similar to Ancient corpora analysis (20)

Digital Humanities: A brief introduction to the field
Digital Humanities: A brief introduction to the fieldDigital Humanities: A brief introduction to the field
Digital Humanities: A brief introduction to the field
 
Entities for Augmented Intelligence
Entities for Augmented IntelligenceEntities for Augmented Intelligence
Entities for Augmented Intelligence
 
The world is y0ur$: Geolocation-based wordlist generation with wordsmith
The world is y0ur$: Geolocation-based wordlist generation with wordsmithThe world is y0ur$: Geolocation-based wordlist generation with wordsmith
The world is y0ur$: Geolocation-based wordlist generation with wordsmith
 
bridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the webbridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the web
 
Complex queries in a distributed multi-model database
Complex queries in a distributed multi-model databaseComplex queries in a distributed multi-model database
Complex queries in a distributed multi-model database
 
Topic models, vector semantics and applications
Topic models, vector semantics and applicationsTopic models, vector semantics and applications
Topic models, vector semantics and applications
 
A Matching Approach Based on Term Clusters for eRecruitment
A Matching Approach Based on Term Clusters for eRecruitmentA Matching Approach Based on Term Clusters for eRecruitment
A Matching Approach Based on Term Clusters for eRecruitment
 
Sarah Rees Jones (York) and Helen Petrie: 'Chartex overview and next steps'
Sarah Rees Jones (York) and Helen Petrie: 'Chartex overview and next steps' Sarah Rees Jones (York) and Helen Petrie: 'Chartex overview and next steps'
Sarah Rees Jones (York) and Helen Petrie: 'Chartex overview and next steps'
 
Dark Data In the Long Tail of Science:   Examples in Biology
Dark Data In the Long Tail of Science:  Examples in BiologyDark Data In the Long Tail of Science:  Examples in Biology
Dark Data In the Long Tail of Science:   Examples in Biology
 
From Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesFrom Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science Tales
 
Defrosting the Digital Library: A survey of bibliographic tools for the next ...
Defrosting the Digital Library: A survey of bibliographic tools for the next ...Defrosting the Digital Library: A survey of bibliographic tools for the next ...
Defrosting the Digital Library: A survey of bibliographic tools for the next ...
 
Describing Everything - Open Web standards and classification
Describing Everything - Open Web standards and classificationDescribing Everything - Open Web standards and classification
Describing Everything - Open Web standards and classification
 
Empirical Semantics
Empirical SemanticsEmpirical Semantics
Empirical Semantics
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
 
Presentatie nl.dbpedia.org Datasalon 8 Gent 24 Februari 2012
Presentatie nl.dbpedia.org Datasalon 8 Gent 24 Februari 2012Presentatie nl.dbpedia.org Datasalon 8 Gent 24 Februari 2012
Presentatie nl.dbpedia.org Datasalon 8 Gent 24 Februari 2012
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
 
Modelling and Querying Lists in RDF. A Pragmatic Study
Modelling and Querying Lists in RDF. A Pragmatic StudyModelling and Querying Lists in RDF. A Pragmatic Study
Modelling and Querying Lists in RDF. A Pragmatic Study
 
Machine-Interpretable Dataset and Service Descriptions for Heterogeneous Data...
Machine-Interpretable Dataset and Service Descriptions for Heterogeneous Data...Machine-Interpretable Dataset and Service Descriptions for Heterogeneous Data...
Machine-Interpretable Dataset and Service Descriptions for Heterogeneous Data...
 
Recommandation sociale : filtrage collaboratif et par le contenu
Recommandation sociale : filtrage collaboratif et par le contenuRecommandation sociale : filtrage collaboratif et par le contenu
Recommandation sociale : filtrage collaboratif et par le contenu
 
DB-IR-ranking
DB-IR-rankingDB-IR-ranking
DB-IR-ranking
 

More from Dirk Roorda

Verbal Valency in Hebrew Verbs
Verbal Valency in Hebrew VerbsVerbal Valency in Hebrew Verbs
Verbal Valency in Hebrew Verbs
Dirk Roorda
 

More from Dirk Roorda (20)

TF-FAIR.pdf
TF-FAIR.pdfTF-FAIR.pdf
TF-FAIR.pdf
 
Textpy
TextpyTextpy
Textpy
 
General Missives
General MissivesGeneral Missives
General Missives
 
Text Display (when it gets tricky)
Text Display (when it gets tricky)Text Display (when it gets tricky)
Text Display (when it gets tricky)
 
Tf in-context
Tf in-contextTf in-context
Tf in-context
 
Qdf2tf
Qdf2tfQdf2tf
Qdf2tf
 
Text fabric
Text fabricText fabric
Text fabric
 
Verbal Valency in Hebrew Verbs
Verbal Valency in Hebrew VerbsVerbal Valency in Hebrew Verbs
Verbal Valency in Hebrew Verbs
 
Data management for researchers
Data management for researchersData management for researchers
Data management for researchers
 
Annotating the Hebrew Bible
Annotating the Hebrew BibleAnnotating the Hebrew Bible
Annotating the Hebrew Bible
 
20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissen20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissen
 
Text as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew BibleText as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew Bible
 
Datamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDatamanagement for Research: A Case Study
Datamanagement for Research: A Case Study
 
Award
AwardAward
Award
 
Datamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDatamanagement for Research: A Case Study
Datamanagement for Research: A Case Study
 
Hebrew Bible as Data: Laboratory, Sharing, Lessons
Hebrew Bible as Data: Laboratory, Sharing, LessonsHebrew Bible as Data: Laboratory, Sharing, Lessons
Hebrew Bible as Data: Laboratory, Sharing, Lessons
 
Laf fabric-dh benelux2014
Laf fabric-dh benelux2014Laf fabric-dh benelux2014
Laf fabric-dh benelux2014
 
Data Analysis in the Hebrew Bible
Data Analysis in the Hebrew BibleData Analysis in the Hebrew Bible
Data Analysis in the Hebrew Bible
 
LAF Fabric
LAF FabricLAF Fabric
LAF Fabric
 
Auto ingest demo-werklunch 2013-11-05
Auto ingest demo-werklunch 2013-11-05Auto ingest demo-werklunch 2013-11-05
Auto ingest demo-werklunch 2013-11-05
 

Recently uploaded

Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 

Recently uploaded (20)

ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 

Ancient corpora analysis

  • 1. Data Analysis for Ancient Corpora Cody Kingham and Dirk Roorda FAMES, Cambridge, 2019-01-31 0 50 100 150 200 250 conj nmpr subs adjv prep art Parts of Speech after Atnach in ETCBC Phrase
  • 3. • Put researchers in control of their data. • Empower researchers to fully harness the data available to them. • Encourage a new paradigm in the humanities
  • 4.
  • 5.
  • 6.
  • 8.
  • 9.
  • 10.
  • 12.
  • 13.
  • 14. Text-Fabric and Hebrew Data • Free, accessible corpus annotation and analysis tool. • Published the Amsterdam Hebrew data on Github with free, open-source license. • Encouraged researchers to step out of their technological comfort zones.
  • 15.
  • 16. A Different Vision • Researchers are in charge of their data and set the agenda for its use. • Researchers are empowered with the tools needed for powerful data analysis. • Data is made open-source, freely available
  • 17. Text-Fabric • Graph model: words, phrases, etc. are “nodes,” relationships between them are edges. • We can model complex data structures better than other methods (e.g. XML). • All stored in easy-to-understand, plain-text files. No messy XML, SQL, etc.
  • 18. &P005381 = MSVO 3, 70 #atf: lang qpc @tablet @obverse @column 1 1.a. 2(N14) , SZE~a SAL TUR3~a NUN~a 1.b. 3(N19) , |GISZ.TE| 2. 1(N14) , NAR NUN~a SIG7 3. 2(N04)# , PIRIG~b1 SIG7 URI3~a NUN~a @column 2 1. 3(N04) , |GISZ.TE| GAR |SZU2.((HI+1(N57))+(HI+1(N57)))| GI4~a 2. , GU7 AZ SI4~f @reverse @column 1 1. 3(N14) , SZE~a 2. 3(N19) 5(N04) , 3. , GU7 @column 2 1. , AZ SI4~f CTBA|CTBA#CTBA#CTB###0#0#0#3#1#0#2#0#0#2#0#0#2#0#0#0#0#0 D;L;DOTH|;L;DOT#;L;DOTA#;LD#D#H#0#0#0#3#1#0#3#0#0#2#0#0#2#1#1#3#0#0 D;WOE|;WOE#;WOE#;WOE#D##0#0#0#0#0#0#0#0#0#1#0#0#2#0#0#0#0#0 MW;KA|MW;KA#MW;KA#MWK###0#1#0#3#1#0#2#0#0#0#0#2#0#0#0#0#0#0 BRH| BR#BRA#BR##H#0#0#0#3#1#0#2#0#0#2#0#0#2#1#1#3#0#0 DDO;D|DO;D#DO;D#DO;D#D##0#0#0#0#0#0#0#0#0#1#0#0#2#0#0#0#0#0 BRH| BR#BRA#BR##H#0#0#0#3#1#0#2#0#0#2#0#0#2#1#1#3#0#0 DABRHM|ABRHM#ABRHM#ABRHM#D##0#0#0#0#0#0#0#0#0#1#0#0#2#0#0#0#0#0 ABRHM|ABRHM#ABRHM#ABRHM###0#0#0#0#0#0#0#0#0#1#0#0#2#0#0#0#0#0 AOLD|AOLD#;LD#;LD###0#5#1#0#1#3#2#0#0#0#0#0#0#0#0#0#0#0 LA;SKX| A;SKX#A;SKX#A;SKX#L##0#0#0#0#0#0#0#0#0#1#0#0#2#0#0#0#0#0 A;SKX|A;SKX#A;SKX#A;SKX###0#0#0#0#0#0#0#0#0#1#0#0#2#0#0#0#0#0 AOLD| Syriac NT (Sedra database) DEUT33,02 >C- >;71C 1.000 >;71C- >C- DEUT33,02 DT D.@73T 1.000 D.@73T DT DEUT33,09 BNW B.@N@73JW 1.000 B.@N@73W BNW EST 01,16 MWMKN M:MW.K@81N 1.000 M:WM.K@81N MWMKN EST 03,04 B- K.:- 1.000 B.:- B- EST 03,04 >MRM >@M:R@70M 1.000 >@M:R@70M >MRM Hebrew Ketiv-Qere (ETCBC) Cuneiform Uruk (CDLI) (1:1:1:1) bi P PREFIX|bi+ (1:1:1:2) somi N STEM|POS:N|LEM:{som|ROOT:smw|M|GEN (1:1:2:1) {ll~ahi PN STEM|POS:PN|LEM:{ll~ah|ROOT:Alh|GEN (1:1:3:1) {l DET PREFIX|Al+ (1:1:3:2) r~aHoma`ni ADJ STEM|POS:ADJ|LEM:r~aHoma`n|ROOT:rHm|MS|GEN (1:1:4:1) {l DET PREFIX|Al+ (1:1:4:2) r~aHiymi ADJ STEM|POS:ADJ|LEM:r~aHiym|ROOT:rHm|MS|GEN (1:2:1:1) {lo DET PREFIX|Al+ (1:2:1:2) Hamodu N STEM|POS:N|LEM:Hamod|ROOT:Hmd|M|NOM Arabic Quran (Tanzil) Source data of a corpus TEI, Markdown, ASCII, Database
  • 19. Data structure of TF - the IKEA spirit node order! order! stacks of components uniquely identified words phrases chapters verses
  • 20. Conversion to TF TF does more than half of the work
  • 21. # Consider Phlebas $ author=Iain M. Banks ## 1 Everything about us, everything around us, everything we know [and can know of] is composed ultimately of patterns of nothing; that’s the bottom line, the final truth. So where we find we have any control over those patterns, why not make the most elegant ones, the most enjoyable and good ones, in our own terms? ## 2 Besides, it left the humans in the Culture free to take care of the things that really mattered in life, such as [sports, games, romance,] studying dead languages, barbarian societies and impossible problems, and climbing high mountains without the aid of a safety harness.
  • 22. @node @compiler=Dirk Roorda @description=the letters of a word @name=Culture quotes from Iain Banks @source=Good Reads @url=https://www.goodreads.com/ work/quotes/14366-consider-phlebas @valueType=str @writtenBy=Text-Fabric @dateWritten=2019-01-30T22:20:19Z Everything about us everything around us everything we know and can know of is composed ultimately of patterns of nothing that’s the bottom line the final truth So letters @node @compiler=Dirk Roorda @description=the punctuation after a word @name=Culture quotes from Iain Banks @source=Good Reads @url=https://www.goodreads.com/ work/quotes/14366-consider-phlebas @valueType=str @writtenBy=Text-Fabric @dateWritten=2019-01-30T22:20:19Z 3 , 6 , 20 ; 24 , 27 . 38 , 45 , 51 , 55 ? , 75 , 78 , , , 83 , 88 , 99 . punc banks/tf/ author.tf gap.tf letters.tf number.tf oslots.tf otext.tf otype.tf punc.tf terminator.tf title.tf TF dataset
  • 23. otype @node @compiler=Dirk Roorda @name=Culture quotes from Iain Banks @source=Good Reads @url=https://www.goodreads.com/work/quotes/14366-consider-phlebas @valueType=str @writtenBy=Text-Fabric @dateWritten=2019-01-30T22:20:19Z 1-99 word 100 book 101-102 chapter 103-114 line 115-117 sentence
  • 24. oslots @edge @compiler=Dirk Roorda @name=Culture quotes from Iain Banks @source=Good Reads @url=https://www.goodreads.com/work/quotes/14366-consider-phlebas @valueType=str @writtenBy=Text-Fabric @dateWritten=2019-01-30T22:20:19Z 100 1-99 1-55 56-99 1-3 4-6 7-9,14-20 21-27 28-38 39-51 52-55 56 57-75 76-77,81-83 84-88 89-99 1-27 28-55 56-99 1-99 word 100 book 101-102 chapter 103-114 line 115-117 sentence ## 1 Everything about us, everything around us, everything we know [and can know of] is composed ultimately of patterns of nothing; that’s the bottom line, the final truth. So where we find we have any control over those patterns, why not make the most elegant ones, the most enjoyable and good ones, in our own terms? ## 2 Besides, it left the humans in the Culture free to take care of the things that really mattered in life, such as [sports, games, romance,] studying dead languages, barbarian societies and impossible problems, and climbing high mountains without the aid of a safety harness.
  • 25. otext @config @compiler=Dirk Roorda @fmt:text-orig-full={letters}{punc} @name=Culture quotes from Iain Banks @sectionFeatures=title,number @sectionTypes=book,chapter @source=Good Reads @url=https://www.goodreads.com/work/quotes/14366-consider-phlebas @writtenBy=Text-Fabric @dateWritten=2019-01-30T22:20:19Z
  • 26. Computing - Python - Jupyter notebooks https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/bhsa/start.ipynb BHSA
  • 31. Computing - more power! https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/bhsa/searchFromMQL.ipynb BHSA
  • 37. UrukPower to you! (without the programming)
  • 38. Uruk
  • 39. Uruk
  • 40. Mini-Study: Atnachs and Phrase Divisions • How often do atnach accents disagree with the ETCBC phrase divisions? • Why?
  • 41. Sharing and re-using data Text-Fabric has been developed by a DANS-employee as a consequence: Data export is built in ✅ Provenance tracking is built in ✅ Redistribution of newly created data is built in ✅
  • 42. sharing #1: GitHub & NBviewer work done in a Jupyter Notebook inside a GitHub repository is very sharable
  • 44. sharing #2: Export from TF-browser
  • 46. sharing #4: Create new features https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/bhsa/share.ipynb • etcbc/valence/tf : the results of the verbal valence work of Janet Dyk in the SYNVAR project; • etcbc/lingo/heads/tf : head words for phrases, work done by Cody Kingham; • ch-jensen/Semantic-mapping-of-participants/actor/tf : participant analysis in progress by Christian Høygaard-Jensen; • cmerwich/bh-reference-system/tf: participant analysis in progress by Christiaan Erwich; • or whatever you have in the making! • HINT: semantic/fuzzy/plurality for collective nouns (Chip Hardy?)
  • 49. Open Science Rocks thank you Cody Kingham codykingham@icloud.com Dirk Roorda dirk.roorda@dans.knaw.nl