SlideShare une entreprise Scribd logo
1  sur  15
Télécharger pour lire hors ligne
Structured and Unstructured:
                 Extracting Information From Classics
                            Scholarly Texts

                                              Matteo Romanello1
                                     1 Centre   for Computing in the Humanities
                                                 King’s College London


                                 Graduate Colloquium - DHSI 2010
                               University of Victoria BC - 8th June 2010



Romanello                                                                         CCH
Extracting Information From Scholarly Texts
The Project at a glance



               Project started in October 2009;
               Disciplines: Digital Humanities, Classics, Computer
               Science;
               co-supervised by:
                       Willard McCarty (KCL, Department of Digital Humanities)
                       Jonathan Ginzburg (KCL, Department of Computer
                       Science)
               project supported by an AHRC (Arts and Humanities
               Research Council) award



Romanello                                                                        CCH
Extracting Information From Scholarly Texts
Goal

       Devising an automatic system to improve semantic
       information retrieval over a discipline-specific corpus of
       unstructured texts
               focus on secondary sources (e.g. journal papers) as
               opposed to primary sources (i.e. Ancient Texts)
               automatic -> scalable with huge amount of data
               information retrieval -> the task of retrieving information
               unstructured texts -> raw texts (e.g. .txt files) as opposed
               to the structured/encoded XML

       Example
       “Hom. Il. XII 1”: sequence of 14 characters meaning “first line
       of the twelfth book of Homer’s Iliad”
Romanello                                                                    CCH
Extracting Information From Scholarly Texts
Semantic Information Retrieval




                                 Semantic vs String Matching based IR
Romanello                                                               CCH
Extracting Information From Scholarly Texts
Named Entities as Entry Point to Information




       Entities to be extracted:
            1   Place Names (ancient and modern);
            2   Relevant Person Names (mythological names, ancient authors,
                modern scholars)
            3   References to primary and secondary sources (canonical
                texts and modern publications about them)
Romanello                                                                     CCH
Extracting Information From Scholarly Texts
Work Phases




Romanello                                     CCH
Extracting Information From Scholarly Texts
Corpus building




       Getting materials
       Crawling online archives

       Extracting the text from collected documents
               Tools for text extraction from PDF -> open issues with
               Ancient Greek encoding
               re-OCR documents even the native digital ones




Romanello                                                               CCH
Extracting Information From Scholarly Texts
Corpus Building II


       Corpora
               open access, multilingual
               Princeton/Stanford Working Papers in Classics (PSWPC)
               Lexis online
               470 articles in 2 corpora

       OCR
          Finereader
               Ocropus (layout analysis)
               text extracted from PDFs (tools like pdftotext etc.)
               Alignment of multiple OCR outputs

Romanello                                                              CCH
Extracting Information From Scholarly Texts
Building the Knowledge Base (KB)

       Goal: integrate different data sources into a single KB
       Why?
               Information about the same entities spread over several
               data sources
               Data sources might use different output formats (raw text,
               DBs, HTML, XML etc.)
               partial overlappings but no interoperability

       How?
          Use of high level ontologies to map records related to the
          same entity
               Result: KB containing semantic data

Romanello                                                                   CCH
Extracting Information From Scholarly Texts
Corpus Processing



       Tasks
            1   sentence identification
            2   entities extraction (named entities recognition +
                disambiguation)
                       KB implied to build up an entity context
            3   canonical references extraction
                    KB provides training data
            4   modern bibliographic references extraction
                   KB provides list of journals/name places/authors to improve
                   the perfomances of the tool



Romanello                                                                        CCH
Extracting Information From Scholarly Texts
Canonical References




Romanello                                     CCH
Extracting Information From Scholarly Texts
Canonical References Extraction

            1   citations used specifically for primary sources (i.e. works of
                ancient authors)
            2   essential entry point to information: refer to the research
                object, i.e. ancient texts
            3   logical instead of physical citation scheme (e.g., chapter/paragr
                vs. page)
            4   variation -> time, style, language (regexp insufficient!)

       Example
       Hom. Il. XII 1
       Aesch. ’Sept.’ 565-67, 628-30; Ar. ’Arch.’ 803
       Hes. fr. 321 M.-W.
       Callimaco, ’ep.’ 28 Pf., 5-6

Romanello                                                                           CCH
Extracting Information From Scholarly Texts
So What?




       New Possible Research Questions:
          how citing primary sources in Classics changed?
               what are the characteristics of citation and co-citation
               networks?
               the traditional IR tools in Classics are actually exhaustive?




Romanello                                                                      CCH
Extracting Information From Scholarly Texts
Why a Digital Humanities project?



               Better understanding of
                       the discipline specifities
                       users’ needs
               Writing code to develop a project means
                       formalizing the way a given result is obtained
                       creating a repeatable and thus confutable process
                       introducing a reasoning based on the analysis of
                       quantitative data into Classics
               Being able to
                       apply the product of a DH research to traditional scholarship




Romanello                                                                              CCH
Extracting Information From Scholarly Texts
Thanks for your attention!
       matteo.romanello@kcl.ac.uk
       http://kcl.academia.edu/MatteoRomanello




Romanello                                        CCH
Extracting Information From Scholarly Texts

Contenu connexe

Similaire à Structured and Unstructured:Extracting Information From Classics Scholarly Texts

Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageBuild Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageOntotext
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology:  A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology:  A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasAngelo Salatino
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasAngelo Salatino
 
Lri Owl And Ontologies 04 04
Lri Owl And Ontologies 04 04Lri Owl And Ontologies 04 04
Lri Owl And Ontologies 04 04Rinke Hoekstra
 
Rethinking Critical Editions of Fragments by Ontologies
Rethinking Critical Editions of Fragments by OntologiesRethinking Critical Editions of Fragments by Ontologies
Rethinking Critical Editions of Fragments by OntologiesMatteo Romanello
 
Global Library of Life: The Biodiversity Heritage Library
Global Library of Life: The Biodiversity Heritage LibraryGlobal Library of Life: The Biodiversity Heritage Library
Global Library of Life: The Biodiversity Heritage LibraryMartin Kalfatovic
 
Du Literary and linguistic computing aux Digital Humanities : retour sur 40 a...
Du Literary and linguistic computing aux Digital Humanities : retour sur 40 a...Du Literary and linguistic computing aux Digital Humanities : retour sur 40 a...
Du Literary and linguistic computing aux Digital Humanities : retour sur 40 a...OpenEdition
 
M.Romanello Ecal Presentation
M.Romanello Ecal PresentationM.Romanello Ecal Presentation
M.Romanello Ecal PresentationMatteo Romanello
 
Eswcsummerschool2010 ontologies final
Eswcsummerschool2010 ontologies finalEswcsummerschool2010 ontologies final
Eswcsummerschool2010 ontologies finalElena Simperl
 
Writing Right: Teaching Writing Conventions Specific to a Discipline
Writing Right: Teaching Writing Conventions Specific to a DisciplineWriting Right: Teaching Writing Conventions Specific to a Discipline
Writing Right: Teaching Writing Conventions Specific to a DisciplineRobert Domanski
 
An Ontological View of Canonical Citations
An Ontological View of Canonical CitationsAn Ontological View of Canonical Citations
An Ontological View of Canonical CitationsMichele Pasin
 
An International Cooperative Digital Library for Taxonomic Literature: The Bi...
An International Cooperative Digital Library for Taxonomic Literature: The Bi...An International Cooperative Digital Library for Taxonomic Literature: The Bi...
An International Cooperative Digital Library for Taxonomic Literature: The Bi...Martin Kalfatovic
 
A Global Library of Life: The Biodiversity Heritage Library
A Global Library of Life: The Biodiversity Heritage LibraryA Global Library of Life: The Biodiversity Heritage Library
A Global Library of Life: The Biodiversity Heritage LibraryMartin Kalfatovic
 
Annotated Bibliographical Reference Corpora In Digital Humanities
Annotated Bibliographical Reference Corpora In Digital HumanitiesAnnotated Bibliographical Reference Corpora In Digital Humanities
Annotated Bibliographical Reference Corpora In Digital HumanitiesFaith Brown
 
Semantic Libraries: the Container, the Content and the Contenders
Semantic Libraries: the Container, the Content and the ContendersSemantic Libraries: the Container, the Content and the Contenders
Semantic Libraries: the Container, the Content and the ContendersStefan Gradmann
 
SciDataCon 2014 TDM Workshop Intro Slides
SciDataCon 2014 TDM Workshop Intro SlidesSciDataCon 2014 TDM Workshop Intro Slides
SciDataCon 2014 TDM Workshop Intro SlidesJenny Molloy
 
PA5-2_iconf08.doc.doc
PA5-2_iconf08.doc.docPA5-2_iconf08.doc.doc
PA5-2_iconf08.doc.docbutest
 

Similaire à Structured and Unstructured:Extracting Information From Classics Scholarly Texts (20)

Romanello tokyo
Romanello tokyoRomanello tokyo
Romanello tokyo
 
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageBuild Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology:  A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology:  A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
 
Lri Owl And Ontologies 04 04
Lri Owl And Ontologies 04 04Lri Owl And Ontologies 04 04
Lri Owl And Ontologies 04 04
 
Rethinking Critical Editions of Fragments by Ontologies
Rethinking Critical Editions of Fragments by OntologiesRethinking Critical Editions of Fragments by Ontologies
Rethinking Critical Editions of Fragments by Ontologies
 
Global Library of Life: The Biodiversity Heritage Library
Global Library of Life: The Biodiversity Heritage LibraryGlobal Library of Life: The Biodiversity Heritage Library
Global Library of Life: The Biodiversity Heritage Library
 
Du Literary and linguistic computing aux Digital Humanities : retour sur 40 a...
Du Literary and linguistic computing aux Digital Humanities : retour sur 40 a...Du Literary and linguistic computing aux Digital Humanities : retour sur 40 a...
Du Literary and linguistic computing aux Digital Humanities : retour sur 40 a...
 
M.Romanello Ecal Presentation
M.Romanello Ecal PresentationM.Romanello Ecal Presentation
M.Romanello Ecal Presentation
 
Eswcsummerschool2010 ontologies final
Eswcsummerschool2010 ontologies finalEswcsummerschool2010 ontologies final
Eswcsummerschool2010 ontologies final
 
Writing Right: Teaching Writing Conventions Specific to a Discipline
Writing Right: Teaching Writing Conventions Specific to a DisciplineWriting Right: Teaching Writing Conventions Specific to a Discipline
Writing Right: Teaching Writing Conventions Specific to a Discipline
 
An Ontological View of Canonical Citations
An Ontological View of Canonical CitationsAn Ontological View of Canonical Citations
An Ontological View of Canonical Citations
 
An International Cooperative Digital Library for Taxonomic Literature: The Bi...
An International Cooperative Digital Library for Taxonomic Literature: The Bi...An International Cooperative Digital Library for Taxonomic Literature: The Bi...
An International Cooperative Digital Library for Taxonomic Literature: The Bi...
 
A Global Library of Life: The Biodiversity Heritage Library
A Global Library of Life: The Biodiversity Heritage LibraryA Global Library of Life: The Biodiversity Heritage Library
A Global Library of Life: The Biodiversity Heritage Library
 
Annotated Bibliographical Reference Corpora In Digital Humanities
Annotated Bibliographical Reference Corpora In Digital HumanitiesAnnotated Bibliographical Reference Corpora In Digital Humanities
Annotated Bibliographical Reference Corpora In Digital Humanities
 
Semantic Libraries: the Container, the Content and the Contenders
Semantic Libraries: the Container, the Content and the ContendersSemantic Libraries: the Container, the Content and the Contenders
Semantic Libraries: the Container, the Content and the Contenders
 
SciDataCon 2014 TDM Workshop Intro Slides
SciDataCon 2014 TDM Workshop Intro SlidesSciDataCon 2014 TDM Workshop Intro Slides
SciDataCon 2014 TDM Workshop Intro Slides
 
Miao
MiaoMiao
Miao
 
PA5-2_iconf08.doc.doc
PA5-2_iconf08.doc.docPA5-2_iconf08.doc.doc
PA5-2_iconf08.doc.doc
 
Esad 12may2010
Esad 12may2010Esad 12may2010
Esad 12may2010
 

Plus de Matteo Romanello

Towards the Automatic Retrieval of Cited Parallel Passages from Secondary Lit...
Towards the Automatic Retrieval of Cited Parallel Passages from Secondary Lit...Towards the Automatic Retrieval of Cited Parallel Passages from Secondary Lit...
Towards the Automatic Retrieval of Cited Parallel Passages from Secondary Lit...Matteo Romanello
 
Scaling up the Extraction of Canonical Citations in Classics
Scaling up the Extraction of Canonical Citations in ClassicsScaling up the Extraction of Canonical Citations in Classics
Scaling up the Extraction of Canonical Citations in ClassicsMatteo Romanello
 
Transforming Indexes Locorum into Citation Networks
Transforming Indexes Locorum into Citation NetworksTransforming Indexes Locorum into Citation Networks
Transforming Indexes Locorum into Citation NetworksMatteo Romanello
 
Enhancing and Extending the Digital Study of Intertextuality (pt. 2): Reveali...
Enhancing and Extending the Digital Study of Intertextuality (pt. 2): Reveali...Enhancing and Extending the Digital Study of Intertextuality (pt. 2): Reveali...
Enhancing and Extending the Digital Study of Intertextuality (pt. 2): Reveali...Matteo Romanello
 
Introduction to the Text Reuse panel at DH 2014
Introduction to the Text Reuse panel at DH 2014Introduction to the Text Reuse panel at DH 2014
Introduction to the Text Reuse panel at DH 2014Matteo Romanello
 
Exploring Citation Networks to Study Intertextuality in Classics
Exploring Citation Networks to Study Intertextuality in ClassicsExploring Citation Networks to Study Intertextuality in Classics
Exploring Citation Networks to Study Intertextuality in ClassicsMatteo Romanello
 
DARIAH Geo-browser: Exploring Data through Time and Space
DARIAH Geo-browser: Exploring Data through Time and SpaceDARIAH Geo-browser: Exploring Data through Time and Space
DARIAH Geo-browser: Exploring Data through Time and SpaceMatteo Romanello
 
Greedy Enough for the Grid?
Greedy Enough for the Grid?Greedy Enough for the Grid?
Greedy Enough for the Grid?Matteo Romanello
 
DIGITAL HUMANITIES E FILOLOGIA Un'introduzione
DIGITAL HUMANITIES   E FILOLOGIA   Un'introduzioneDIGITAL HUMANITIES   E FILOLOGIA   Un'introduzione
DIGITAL HUMANITIES E FILOLOGIA Un'introduzioneMatteo Romanello
 
Presentatio @ ELPUB 2008, Toronto
Presentatio @ ELPUB 2008, TorontoPresentatio @ ELPUB 2008, Toronto
Presentatio @ ELPUB 2008, TorontoMatteo Romanello
 
Linking Primary and Secondary by Microformats
Linking Primary and Secondary by MicroformatsLinking Primary and Secondary by Microformats
Linking Primary and Secondary by MicroformatsMatteo Romanello
 
M. Romanello, E-scholia: scenari digitali per la comunicazione scientifica in...
M. Romanello, E-scholia: scenari digitali per la comunicazione scientifica in...M. Romanello, E-scholia: scenari digitali per la comunicazione scientifica in...
M. Romanello, E-scholia: scenari digitali per la comunicazione scientifica in...Matteo Romanello
 

Plus de Matteo Romanello (13)

Towards the Automatic Retrieval of Cited Parallel Passages from Secondary Lit...
Towards the Automatic Retrieval of Cited Parallel Passages from Secondary Lit...Towards the Automatic Retrieval of Cited Parallel Passages from Secondary Lit...
Towards the Automatic Retrieval of Cited Parallel Passages from Secondary Lit...
 
Scaling up the Extraction of Canonical Citations in Classics
Scaling up the Extraction of Canonical Citations in ClassicsScaling up the Extraction of Canonical Citations in Classics
Scaling up the Extraction of Canonical Citations in Classics
 
Transforming Indexes Locorum into Citation Networks
Transforming Indexes Locorum into Citation NetworksTransforming Indexes Locorum into Citation Networks
Transforming Indexes Locorum into Citation Networks
 
Enhancing and Extending the Digital Study of Intertextuality (pt. 2): Reveali...
Enhancing and Extending the Digital Study of Intertextuality (pt. 2): Reveali...Enhancing and Extending the Digital Study of Intertextuality (pt. 2): Reveali...
Enhancing and Extending the Digital Study of Intertextuality (pt. 2): Reveali...
 
Introduction to the Text Reuse panel at DH 2014
Introduction to the Text Reuse panel at DH 2014Introduction to the Text Reuse panel at DH 2014
Introduction to the Text Reuse panel at DH 2014
 
Exploring Citation Networks to Study Intertextuality in Classics
Exploring Citation Networks to Study Intertextuality in ClassicsExploring Citation Networks to Study Intertextuality in Classics
Exploring Citation Networks to Study Intertextuality in Classics
 
DARIAH Geo-browser: Exploring Data through Time and Space
DARIAH Geo-browser: Exploring Data through Time and SpaceDARIAH Geo-browser: Exploring Data through Time and Space
DARIAH Geo-browser: Exploring Data through Time and Space
 
Greedy Enough for the Grid?
Greedy Enough for the Grid?Greedy Enough for the Grid?
Greedy Enough for the Grid?
 
DIGITAL HUMANITIES E FILOLOGIA Un'introduzione
DIGITAL HUMANITIES   E FILOLOGIA   Un'introduzioneDIGITAL HUMANITIES   E FILOLOGIA   Un'introduzione
DIGITAL HUMANITIES E FILOLOGIA Un'introduzione
 
Ht159 Poster
Ht159 PosterHt159 Poster
Ht159 Poster
 
Presentatio @ ELPUB 2008, Toronto
Presentatio @ ELPUB 2008, TorontoPresentatio @ ELPUB 2008, Toronto
Presentatio @ ELPUB 2008, Toronto
 
Linking Primary and Secondary by Microformats
Linking Primary and Secondary by MicroformatsLinking Primary and Secondary by Microformats
Linking Primary and Secondary by Microformats
 
M. Romanello, E-scholia: scenari digitali per la comunicazione scientifica in...
M. Romanello, E-scholia: scenari digitali per la comunicazione scientifica in...M. Romanello, E-scholia: scenari digitali per la comunicazione scientifica in...
M. Romanello, E-scholia: scenari digitali per la comunicazione scientifica in...
 

Dernier

Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Association for Project Management
 
Indexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdfIndexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdfChristalin Nelson
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQuiz Club NITW
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Tree View Decoration Attribute in the Odoo 17
Tree View Decoration Attribute in the Odoo 17Tree View Decoration Attribute in the Odoo 17
Tree View Decoration Attribute in the Odoo 17Celine George
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWQuiz Club NITW
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSMae Pangan
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 
CHEST Proprioceptive neuromuscular facilitation.pptx
CHEST Proprioceptive neuromuscular facilitation.pptxCHEST Proprioceptive neuromuscular facilitation.pptx
CHEST Proprioceptive neuromuscular facilitation.pptxAneriPatwari
 
Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1GloryAnnCastre1
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdfMr Bounab Samir
 
4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptxmary850239
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfPrerana Jadhav
 
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxSayali Powar
 
How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17Celine George
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 

Dernier (20)

Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
 
Indexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdfIndexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdf
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Tree View Decoration Attribute in the Odoo 17
Tree View Decoration Attribute in the Odoo 17Tree View Decoration Attribute in the Odoo 17
Tree View Decoration Attribute in the Odoo 17
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITW
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHS
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
CHEST Proprioceptive neuromuscular facilitation.pptx
CHEST Proprioceptive neuromuscular facilitation.pptxCHEST Proprioceptive neuromuscular facilitation.pptx
CHEST Proprioceptive neuromuscular facilitation.pptx
 
Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdf
 
4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdf
 
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
 
How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 

Structured and Unstructured:Extracting Information From Classics Scholarly Texts

  • 1. Structured and Unstructured: Extracting Information From Classics Scholarly Texts Matteo Romanello1 1 Centre for Computing in the Humanities King’s College London Graduate Colloquium - DHSI 2010 University of Victoria BC - 8th June 2010 Romanello CCH Extracting Information From Scholarly Texts
  • 2. The Project at a glance Project started in October 2009; Disciplines: Digital Humanities, Classics, Computer Science; co-supervised by: Willard McCarty (KCL, Department of Digital Humanities) Jonathan Ginzburg (KCL, Department of Computer Science) project supported by an AHRC (Arts and Humanities Research Council) award Romanello CCH Extracting Information From Scholarly Texts
  • 3. Goal Devising an automatic system to improve semantic information retrieval over a discipline-specific corpus of unstructured texts focus on secondary sources (e.g. journal papers) as opposed to primary sources (i.e. Ancient Texts) automatic -> scalable with huge amount of data information retrieval -> the task of retrieving information unstructured texts -> raw texts (e.g. .txt files) as opposed to the structured/encoded XML Example “Hom. Il. XII 1”: sequence of 14 characters meaning “first line of the twelfth book of Homer’s Iliad” Romanello CCH Extracting Information From Scholarly Texts
  • 4. Semantic Information Retrieval Semantic vs String Matching based IR Romanello CCH Extracting Information From Scholarly Texts
  • 5. Named Entities as Entry Point to Information Entities to be extracted: 1 Place Names (ancient and modern); 2 Relevant Person Names (mythological names, ancient authors, modern scholars) 3 References to primary and secondary sources (canonical texts and modern publications about them) Romanello CCH Extracting Information From Scholarly Texts
  • 6. Work Phases Romanello CCH Extracting Information From Scholarly Texts
  • 7. Corpus building Getting materials Crawling online archives Extracting the text from collected documents Tools for text extraction from PDF -> open issues with Ancient Greek encoding re-OCR documents even the native digital ones Romanello CCH Extracting Information From Scholarly Texts
  • 8. Corpus Building II Corpora open access, multilingual Princeton/Stanford Working Papers in Classics (PSWPC) Lexis online 470 articles in 2 corpora OCR Finereader Ocropus (layout analysis) text extracted from PDFs (tools like pdftotext etc.) Alignment of multiple OCR outputs Romanello CCH Extracting Information From Scholarly Texts
  • 9. Building the Knowledge Base (KB) Goal: integrate different data sources into a single KB Why? Information about the same entities spread over several data sources Data sources might use different output formats (raw text, DBs, HTML, XML etc.) partial overlappings but no interoperability How? Use of high level ontologies to map records related to the same entity Result: KB containing semantic data Romanello CCH Extracting Information From Scholarly Texts
  • 10. Corpus Processing Tasks 1 sentence identification 2 entities extraction (named entities recognition + disambiguation) KB implied to build up an entity context 3 canonical references extraction KB provides training data 4 modern bibliographic references extraction KB provides list of journals/name places/authors to improve the perfomances of the tool Romanello CCH Extracting Information From Scholarly Texts
  • 11. Canonical References Romanello CCH Extracting Information From Scholarly Texts
  • 12. Canonical References Extraction 1 citations used specifically for primary sources (i.e. works of ancient authors) 2 essential entry point to information: refer to the research object, i.e. ancient texts 3 logical instead of physical citation scheme (e.g., chapter/paragr vs. page) 4 variation -> time, style, language (regexp insufficient!) Example Hom. Il. XII 1 Aesch. ’Sept.’ 565-67, 628-30; Ar. ’Arch.’ 803 Hes. fr. 321 M.-W. Callimaco, ’ep.’ 28 Pf., 5-6 Romanello CCH Extracting Information From Scholarly Texts
  • 13. So What? New Possible Research Questions: how citing primary sources in Classics changed? what are the characteristics of citation and co-citation networks? the traditional IR tools in Classics are actually exhaustive? Romanello CCH Extracting Information From Scholarly Texts
  • 14. Why a Digital Humanities project? Better understanding of the discipline specifities users’ needs Writing code to develop a project means formalizing the way a given result is obtained creating a repeatable and thus confutable process introducing a reasoning based on the analysis of quantitative data into Classics Being able to apply the product of a DH research to traditional scholarship Romanello CCH Extracting Information From Scholarly Texts
  • 15. Thanks for your attention! matteo.romanello@kcl.ac.uk http://kcl.academia.edu/MatteoRomanello Romanello CCH Extracting Information From Scholarly Texts