SlideShare une entreprise Scribd logo
1  sur  1
Télécharger pour lire hors ligne
EXTRACTING INFORMATION FROM CLASSICS SCHOLARLY TEXTS
                                                                                                                                                                                           Matteo Romanello, matteo.romanello@kcl.ac.uk

                                                                                                                         THE PROJECT AT A GLANCE                                                                                           EXTRACTING CANONICAL
                                                                                                                                                                                    A KNOWLEDGE-BASED APPROACH
                                                                                                                         ●   PhD in Digital Humanities (DH)                                                                                REFERENCES
                                                                                                                                                                                    Idea:
                                                                                                                         ●   Project started in October 2009                                                                               Classicists usually refer to primary sources by
                                                                                                                         ●   Project co-supervised by:
                                                                                                                                                                                    ●exploiting accurate information contained in          means of abbreviated references, called canonical
                                                                                                                                                                                    structured data sources;                               references.
                                                                                                                               ●   Willard McCarty (CCH KCL)
                                                                                                                               ●   Jonathan Ginzburg (DCS KCL)
                                                                                                                                                                                    ● using information in the KB to train machine-        Explicit linking of primary and secondary sources
                                                                                                                                                                                    learning based components;                             contained in the Digital Library implies being able to
                                                                                                                         ●Supported by the Arts and Humanities Research
                                                                                                                         Council (AHRC)                                             Obstacles:                                             automatically interpret and extract such references,
                                                                                                                                                                                                                                           which is still an open issue.
                                                                                                                                                                                    ●different (heterogeneous) data formats, XML
                                                                                                                                                                                    dialects, DB schemas etc.;                             CRefEx is a tool I'm developing for the automatic
                                                                                                                                                                                                                                           extraction of canonical references.
                                                                                                                         INFORMATION RETRIEVAL IN CLASSICS                              lack of semantic interoperability.
The digital image of the Venetus A (Marc. Graec. Z.454) used for the graphics has been produced by the CHS (Harvard U)




                                                                                                                                                                                    ●



                                                                                                                         ● APh: state of the art in IR (Information Retrieval) in   Possible solution: using high level ontologies (e.g.   EXTRACTING BIBLIOGRAPHIC
                                                                                                                         the Classics                                               CIDOC CRM) to make data interoperable                  REFERENCES
                                                                                                                         ● Model: centralized, analytic (selective), manual         Once integrated into a Knowledge Base that             The task of extracting modern bibliographic for the
                                                                                                                         labour, highly (absolutely?) accurate, subscription-       information can be used to train machine-learning      purpose of Automatic Citation Indexing is well
                                                                                                                         based access                                               based components.                                      established in disciplines such as Computer
                                                                                                                         ● Future(s): How can we complement it? With which                                                                 Science.
                                                                                                                         tools? What can be improved in the tool used by the                   WHICH CORPUS TO BE MINED?                   However, some assumptions behind a tool such as
                                                                                                                         majority of Classicists to retrieve bibliography?                                                                 Parscit (used to build CiteseerX) are not always
                                                                                                                                                                                    3 scenarios:                                           applicable to Classics papers.
                                                                                                                                                                                    1) LEXIS: electronic version of a Classics journal.    This is an example of how the complexity of
                                                                                                                                                                                    Online papers are PDF produced scanning the            Humanities materials can lead to an improvement of
                                                                                                                         ...OTHER POSSIBLE MODEL(S)                                 actual printed copies (open access,no OCR, low         tools and algorithms drawn from other fields.
                                                                                                                         ●   Open source and open access                            quality images);
                                                                                                                             Decentralized: harvesting rather than centralising
                                                                                                                                                                                                                                           EMERGING RESEARCH QUESTIONS
                                                                                                                         ●
                                                                                                                                                                                    2)PSWPC (Princeton Stanford working papers in
                                                                                                                         ●Automatic: how to reduce the labour of manual             Classics): open archive; text stripped from PDFs       ●
                                                                                                                                                                                                                                             How did the practice of citing ancient texts change
                                                                                                                         annotation?                                                often is scarcely accurate (e.g. Greek sequences)
                                                                                                                                                                                                                                           after that tools such as the TLG were introduced?
                                                                                                                         ●   Errorful/noisy input and acceptable error rate         3) JSTOR's Data for Research API: no controls          ●   What the citation networks in Classics look like?
                                                                                                                         ●   Measurable completeness/exhaustivity                   over text processing; access to data for ~180k
                                                                                                                         ●   How the two approaches compare to each other?          Classics papers (to word frequency, extracted          ●What emerges from the comparison with citation
                                                                                                                                                                                    references, bigrams/trigrams, key terms).              networks in other disciplines?
                                                                                                                                                                                                                                           ●What are the access point to information
                                                                                                                                                                                                                                           meaningful for Classicists?


                                                                                                                                                                    Centre of Computing in the Humanities (CCH), King's College London

Contenu connexe

Plus de Matteo Romanello

Greedy Enough for the Grid?
Greedy Enough for the Grid?Greedy Enough for the Grid?
Greedy Enough for the Grid?Matteo Romanello
 
Structured and Unstructured:Extracting Information From Classics Scholarly Texts
Structured and Unstructured:Extracting Information From Classics Scholarly TextsStructured and Unstructured:Extracting Information From Classics Scholarly Texts
Structured and Unstructured:Extracting Information From Classics Scholarly TextsMatteo Romanello
 
Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts
Stuctured Vs Unstructured: Extracting Information from Classics Scholarly TextsStuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts
Stuctured Vs Unstructured: Extracting Information from Classics Scholarly TextsMatteo Romanello
 
[poster] Extracting Information From Classics Scholarly Texts
[poster] Extracting Information From Classics Scholarly Texts[poster] Extracting Information From Classics Scholarly Texts
[poster] Extracting Information From Classics Scholarly TextsMatteo Romanello
 
DIGITAL HUMANITIES E FILOLOGIA Un'introduzione
DIGITAL HUMANITIES   E FILOLOGIA   Un'introduzioneDIGITAL HUMANITIES   E FILOLOGIA   Un'introduzione
DIGITAL HUMANITIES E FILOLOGIA Un'introduzioneMatteo Romanello
 
Rethinking Critical Editions of Fragments by Ontologies
Rethinking Critical Editions of Fragments by OntologiesRethinking Critical Editions of Fragments by Ontologies
Rethinking Critical Editions of Fragments by OntologiesMatteo Romanello
 
Presentatio @ ELPUB 2008, Toronto
Presentatio @ ELPUB 2008, TorontoPresentatio @ ELPUB 2008, Toronto
Presentatio @ ELPUB 2008, TorontoMatteo Romanello
 
Linking Primary and Secondary by Microformats
Linking Primary and Secondary by MicroformatsLinking Primary and Secondary by Microformats
Linking Primary and Secondary by MicroformatsMatteo Romanello
 
M. Romanello, E-scholia: scenari digitali per la comunicazione scientifica in...
M. Romanello, E-scholia: scenari digitali per la comunicazione scientifica in...M. Romanello, E-scholia: scenari digitali per la comunicazione scientifica in...
M. Romanello, E-scholia: scenari digitali per la comunicazione scientifica in...Matteo Romanello
 

Plus de Matteo Romanello (11)

Greedy Enough for the Grid?
Greedy Enough for the Grid?Greedy Enough for the Grid?
Greedy Enough for the Grid?
 
Structured and Unstructured:Extracting Information From Classics Scholarly Texts
Structured and Unstructured:Extracting Information From Classics Scholarly TextsStructured and Unstructured:Extracting Information From Classics Scholarly Texts
Structured and Unstructured:Extracting Information From Classics Scholarly Texts
 
Romanello tokyo
Romanello tokyoRomanello tokyo
Romanello tokyo
 
Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts
Stuctured Vs Unstructured: Extracting Information from Classics Scholarly TextsStuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts
Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts
 
[poster] Extracting Information From Classics Scholarly Texts
[poster] Extracting Information From Classics Scholarly Texts[poster] Extracting Information From Classics Scholarly Texts
[poster] Extracting Information From Classics Scholarly Texts
 
DIGITAL HUMANITIES E FILOLOGIA Un'introduzione
DIGITAL HUMANITIES   E FILOLOGIA   Un'introduzioneDIGITAL HUMANITIES   E FILOLOGIA   Un'introduzione
DIGITAL HUMANITIES E FILOLOGIA Un'introduzione
 
Ht159 Poster
Ht159 PosterHt159 Poster
Ht159 Poster
 
Rethinking Critical Editions of Fragments by Ontologies
Rethinking Critical Editions of Fragments by OntologiesRethinking Critical Editions of Fragments by Ontologies
Rethinking Critical Editions of Fragments by Ontologies
 
Presentatio @ ELPUB 2008, Toronto
Presentatio @ ELPUB 2008, TorontoPresentatio @ ELPUB 2008, Toronto
Presentatio @ ELPUB 2008, Toronto
 
Linking Primary and Secondary by Microformats
Linking Primary and Secondary by MicroformatsLinking Primary and Secondary by Microformats
Linking Primary and Secondary by Microformats
 
M. Romanello, E-scholia: scenari digitali per la comunicazione scientifica in...
M. Romanello, E-scholia: scenari digitali per la comunicazione scientifica in...M. Romanello, E-scholia: scenari digitali per la comunicazione scientifica in...
M. Romanello, E-scholia: scenari digitali per la comunicazione scientifica in...
 

[poster] Structured and Unstructured: Extracting Information from Classics Scholarly Texts

  • 1. EXTRACTING INFORMATION FROM CLASSICS SCHOLARLY TEXTS Matteo Romanello, matteo.romanello@kcl.ac.uk THE PROJECT AT A GLANCE EXTRACTING CANONICAL A KNOWLEDGE-BASED APPROACH ● PhD in Digital Humanities (DH) REFERENCES Idea: ● Project started in October 2009 Classicists usually refer to primary sources by ● Project co-supervised by: ●exploiting accurate information contained in means of abbreviated references, called canonical structured data sources; references. ● Willard McCarty (CCH KCL) ● Jonathan Ginzburg (DCS KCL) ● using information in the KB to train machine- Explicit linking of primary and secondary sources learning based components; contained in the Digital Library implies being able to ●Supported by the Arts and Humanities Research Council (AHRC) Obstacles: automatically interpret and extract such references, which is still an open issue. ●different (heterogeneous) data formats, XML dialects, DB schemas etc.; CRefEx is a tool I'm developing for the automatic extraction of canonical references. INFORMATION RETRIEVAL IN CLASSICS lack of semantic interoperability. The digital image of the Venetus A (Marc. Graec. Z.454) used for the graphics has been produced by the CHS (Harvard U) ● ● APh: state of the art in IR (Information Retrieval) in Possible solution: using high level ontologies (e.g. EXTRACTING BIBLIOGRAPHIC the Classics CIDOC CRM) to make data interoperable REFERENCES ● Model: centralized, analytic (selective), manual Once integrated into a Knowledge Base that The task of extracting modern bibliographic for the labour, highly (absolutely?) accurate, subscription- information can be used to train machine-learning purpose of Automatic Citation Indexing is well based access based components. established in disciplines such as Computer ● Future(s): How can we complement it? With which Science. tools? What can be improved in the tool used by the WHICH CORPUS TO BE MINED? However, some assumptions behind a tool such as majority of Classicists to retrieve bibliography? Parscit (used to build CiteseerX) are not always 3 scenarios: applicable to Classics papers. 1) LEXIS: electronic version of a Classics journal. This is an example of how the complexity of Online papers are PDF produced scanning the Humanities materials can lead to an improvement of ...OTHER POSSIBLE MODEL(S) actual printed copies (open access,no OCR, low tools and algorithms drawn from other fields. ● Open source and open access quality images); Decentralized: harvesting rather than centralising EMERGING RESEARCH QUESTIONS ● 2)PSWPC (Princeton Stanford working papers in ●Automatic: how to reduce the labour of manual Classics): open archive; text stripped from PDFs ● How did the practice of citing ancient texts change annotation? often is scarcely accurate (e.g. Greek sequences) after that tools such as the TLG were introduced? ● Errorful/noisy input and acceptable error rate 3) JSTOR's Data for Research API: no controls ● What the citation networks in Classics look like? ● Measurable completeness/exhaustivity over text processing; access to data for ~180k ● How the two approaches compare to each other? Classics papers (to word frequency, extracted ●What emerges from the comparison with citation references, bigrams/trigrams, key terms). networks in other disciplines? ●What are the access point to information meaningful for Classicists? Centre of Computing in the Humanities (CCH), King's College London