SlideShare une entreprise Scribd logo
1  sur  28
Télécharger pour lire hors ligne
Institute for Language, 
                             Cognition and Computation	





 The Edinburgh Geoparser 
       and Chalice	


           Claire Grover
Kate Byrne, Richard Tobin, Jo Walsh




                                      www.inf.ed.ac.uk
Institute for Language,
                                                             Cognition and Computation	





Overview of the Edinburgh Geoparser
                                  	

•  System to automatically recognise place names in text and
   disambiguate them with respect to a gazetteer. (Athens, Springfield)
•  Patchy development over past few years funded by a variety of
   projects applied to a range of data sets:
   –  GeoCrossWalk
   –  BOPCRIS
   –  GeoDigRef (Histpop, BOPCRIS, BL)
   –  Embedding GeoCrossWalk (Stormont Papers)
   –  SYNC3 (online news)
   –  Chalice (EPNS)
   –  Unlock
•  Main concern has been to keep it generally usable while applying it to
   specific data sets.
Institute for Language,
                                                                                                                      Cognition and Computation	





Overview of the Edinburgh Geoparser	


                                            Geotagging	


    .txt	

   .html	

                     Format 	

                                      Tokenisation	

                                                                POS	

           Lemmatis-	

                                                                                                     Named	

                                                                                                     Entity	

      .geotagged.xml   	

   .xml  	

       conversion	

                              tagging	

           ation	

                                                                                                   Recognition	





               .geotagged.xml   	

        Gazetteer	

                                            lookup      	

        Resolution   	

             .gaz.xml	



                                          Georesolution
Institute for Language,
Cognition and Computation
Institute for Language,
Cognition and Computation
Institute for Language,
Cognition and Computation
Institute for Language,
Cognition and Computation
Institute for Language,
                                                           Cognition and Computation	





                      Evaluation (2009)
                                      	

SpatialML (gold geotagging)        GeoNames       Unlock
No. of place names                 3628           3628
No. for which gaz entries found    3538           3049
Correct within 5km                 2946           2143
As % of total                      81.2%          59.0%


      SpatialML (end-to-end)              GeoNames
      No. of place names                  3628
      No. for which gaz entries found     2923
      Correct within 5km                  2504
      As % of total                       69.0%
Institute for Language,
                                                            Cognition and Computation	





           Current Development Issues
                                    	

•  Open source release
•  Increased configurability
    –  Input formats: plain text, HTML, simple XML, ...
    –  User’s own text analysis: paragraphs, sentences, word tokens,
       place name mark-up
    –  Output formats: map visualisation, text mark-up, …
    –  User input: constrain by area, bounding box, …
•  Choice of gazetteer: GeoNames, Unlock, geonames-local, Pleiades+,
   Chalice historical gazetteer, ...
•  Performance monitoring/evaluation against test sets
Institute for Language,
                                                                  Cognition and Computation	




                  GAP project: Pleiades+	

•  Based on Pleiades set of ancient place names but extended in two ways:
•  by matching Pleiades place names against GeoNames place names in the
   same location and adding the GeoNames alternative names to the Pleiades+
   list:
   –  adds three alternative names for the single Pleiades entry for
      Autricum (Chartrez, Chartres, Shartr), because Autricum” is present
      in both Pleiades and GeoNames, with the same approximate location
•  at run-time, looking up place names found in the text against GeoNames (as
   well as against Pleiades+) and the using the alternative names from GeoNames
   to match against the Pleiades+ list
   –  Pleiades has no entry for Egypt”. We look up the name in GeoNames and
      use its alternative names (which include Aegyptus) to match back against
      Pleiades (which does include Aegyptus). (We don't want to simply take
      places directly from GeoNames because, when we tried it, we were
      swamped with irrelevant modern places having names corresponding to
      ancient toponyms.)
Institute for Language,
                                                                  Cognition and Computation	




                                Chalice
                                      	

•  Connecting Historical Authorities with Linked Data, Contexts, and Entities.
•  Funded under the JISC jiscEXPO programme on exposing digital content
   for education and research.
•  The project is exploring the viability of creating a historical gazetteer from
   digitized volumes from the English Place-Name Society (EPNS).
•  Partners:
    –  CDDA, Queen’s University, Belfast
    –  School of Informatics, Edinburgh
    –  EDINA, Edinburgh
    –  CeRch, Kings College London
•  Informatics role is to adapt our existing text mining/geoparsing technology
   to convert the textual documents that are output from OCR into structured
   data.
Institute for Language,
                                                           Cognition and Computation	





                         Chalice data
                                    	

•  Cheshire
   –  Cheshire Part I. EPNS Volume 44, 1970
   –  Cheshire Part II. EPNS Volume 45, 1970
   –  Cheshire Part III. EPNS Volume 46, 1971
   –  Cheshire Part IV. EPNS Volume 47, 1972
   –  Cheshire Part V (1 :i). EPNS Volume 48, 1981
   –  Cheshire Part V (1 :ii). EPNS Volume 54, 1981
•  Small samples from:
   –  Berkshire, Buckinghamshire (Vol. 2), Cambridgeshire (Vol 19),
      Derbyshire (Vols 27-29), Hertfordshire (Vol. 15)
•  Shropshire: Pimhill Hundred (born digital)
Institute for Language,
                                                                Cognition and Computation	




                                 EPNS	

•  Parishes are usually organised in terms of the hundreds in which they belong.
•  Towns and villages are usually referred to as townships and are organised in
   terms of the parish in which they belong.
•  Township descriptions often contain relatively unstructured information about
   smaller associated places such as buildings, bridges, lanes, woods and
   farms.
•  Township descriptions also frequently contain separately marked sections of
   information about field names and street names.
•  Information about river and major road names are described separately from
   the inhabited place descriptions.
•  Place names are the primary object of interest and descriptions of them
   contain information about alternative names and spellings that have been
   attested in historical sources and the etymology of names or name parts.
•  In Chalice we focus on capturing parishes, townships, sub-townships,
   attestation. We don’t deal with hundreds, field names, street names, rivers,
   roads etc.
Institute for Language,
Cognition and Computation
Institute for Language,
                      Cognition and Computation	





The start of the
entry for the
township of
Willaston in the
parish of Neston in
Wirral Hundred.
Institute for Language,
Cognition and Computation
Institute for Language,
Cognition and Computation
Institute for Language,
Cognition and Computation
Institute for Language,
Cognition and Computation
Institute for Language,
Cognition and Computation
Institute for Language,
Cognition and Computation
Institute for Language,
                                                                    Cognition and Computation	




                                    Issues
                                         	

•  OCR quality needs to be high: not just recognising characters correctly but
   getting font and layout information right. Failure to recognise bold and small
   caps fonts or the difference between a line break and a paragraph break can
   lead to major errors in the recognition process.
•  EPNS volumes vary in the use of layout and font to indicate structure (e.g.
   Cheshire parishes are signaled by centering combined with numbering with
   roman numerals while Hertfordshire ones are unnumbered but centered and in
   bold font.) In some volumes potentially useful information is contained in
   footnotes.
•  Different volumes reflect different decisions about where place name information
   should be put. In most cases the information about the parish name occurs next
   to the town in the parish that has the same name. In the Shropshire text some
   place name information occurs in an earlier volume and is not subsequently
   repeated, e.g. the description of the parish of Baschurch, containing a township
   of the same name, has no attestation or etymological information provided
   because the name was discussed in Part 1.
Institute for Language,
Cognition and Computation
Institute for Language,
Cognition and Computation
Institute for Language,
Cognition and Computation
Institute for Language,
Cognition and Computation
Institute for Language,
Cognition and Computation
Institute for Language,
               Cognition and Computation	





Thank you!

Contenu connexe

Similaire à Edin pelagios

LoCloud Historic Place Names Service, Rimvydas Laužikas, Justinas Jaronis and...
LoCloud Historic Place Names Service, Rimvydas Laužikas, Justinas Jaronis and...LoCloud Historic Place Names Service, Rimvydas Laužikas, Justinas Jaronis and...
LoCloud Historic Place Names Service, Rimvydas Laužikas, Justinas Jaronis and...locloud
 
Chalice / Edinburgh Geoparser at CA2011
Chalice / Edinburgh Geoparser at CA2011Chalice / Edinburgh Geoparser at CA2011
Chalice / Edinburgh Geoparser at CA2011Jo Walsh
 
UCT GIS Labs
UCT GIS LabsUCT GIS Labs
UCT GIS Labspvhead123
 
Real-time Visualisation of Cultural Heritage and Environmental Archaeology Da...
Real-time Visualisation of Cultural Heritage and Environmental Archaeology Da...Real-time Visualisation of Cultural Heritage and Environmental Archaeology Da...
Real-time Visualisation of Cultural Heritage and Environmental Archaeology Da...Marcus Smith
 
Drupal mapping
Drupal mappingDrupal mapping
Drupal mappingLev Tsypin
 
Archaeology, Informatics and Knowledge Representation
Archaeology, Informatics and Knowledge RepresentationArchaeology, Informatics and Knowledge Representation
Archaeology, Informatics and Knowledge RepresentationDART Project
 
Geo tagging & spatial indexing of text-specified data
Geo tagging & spatial indexing of text-specified dataGeo tagging & spatial indexing of text-specified data
Geo tagging & spatial indexing of text-specified dataShiv Shakti Ghosh
 
Efficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search EnginesEfficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search EnginesYen-Yu Chen
 
The Expert Library: Emergent needs in academic and special libraries
The Expert Library: Emergent needs in academic and special librariesThe Expert Library: Emergent needs in academic and special libraries
The Expert Library: Emergent needs in academic and special librariesLAICDG
 
Dmdh winter 2015 session #2
Dmdh winter 2015 session #2Dmdh winter 2015 session #2
Dmdh winter 2015 session #2sarahkh12
 
CAA 2014 - To Boldly or Bravely Go? Experiences of using Semantic Technologie...
CAA 2014 - To Boldly or Bravely Go? Experiences of using Semantic Technologie...CAA 2014 - To Boldly or Bravely Go? Experiences of using Semantic Technologie...
CAA 2014 - To Boldly or Bravely Go? Experiences of using Semantic Technologie...Keith.May
 
Demystifying Digital Humanities: Winter 2014 Workshop #2: Programming on the ...
Demystifying Digital Humanities: Winter 2014 Workshop #2: Programming on the ...Demystifying Digital Humanities: Winter 2014 Workshop #2: Programming on the ...
Demystifying Digital Humanities: Winter 2014 Workshop #2: Programming on the ...Paige Morgan
 
SQLBits X SQL Server 2012 Spatial
SQLBits X SQL Server 2012 SpatialSQLBits X SQL Server 2012 Spatial
SQLBits X SQL Server 2012 SpatialMichael Rys
 

Similaire à Edin pelagios (16)

LoCloud Historic Place Names Service, Rimvydas Laužikas, Justinas Jaronis and...
LoCloud Historic Place Names Service, Rimvydas Laužikas, Justinas Jaronis and...LoCloud Historic Place Names Service, Rimvydas Laužikas, Justinas Jaronis and...
LoCloud Historic Place Names Service, Rimvydas Laužikas, Justinas Jaronis and...
 
Chalice / Edinburgh Geoparser at CA2011
Chalice / Edinburgh Geoparser at CA2011Chalice / Edinburgh Geoparser at CA2011
Chalice / Edinburgh Geoparser at CA2011
 
Ai for cultural history
Ai for cultural historyAi for cultural history
Ai for cultural history
 
UCT GIS Labs
UCT GIS LabsUCT GIS Labs
UCT GIS Labs
 
Real-time Visualisation of Cultural Heritage and Environmental Archaeology Da...
Real-time Visualisation of Cultural Heritage and Environmental Archaeology Da...Real-time Visualisation of Cultural Heritage and Environmental Archaeology Da...
Real-time Visualisation of Cultural Heritage and Environmental Archaeology Da...
 
Drupal mapping
Drupal mappingDrupal mapping
Drupal mapping
 
Archaeology, Informatics and Knowledge Representation
Archaeology, Informatics and Knowledge RepresentationArchaeology, Informatics and Knowledge Representation
Archaeology, Informatics and Knowledge Representation
 
Geo tagging & spatial indexing of text-specified data
Geo tagging & spatial indexing of text-specified dataGeo tagging & spatial indexing of text-specified data
Geo tagging & spatial indexing of text-specified data
 
Efficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search EnginesEfficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search Engines
 
The Expert Library: Emergent needs in academic and special libraries
The Expert Library: Emergent needs in academic and special librariesThe Expert Library: Emergent needs in academic and special libraries
The Expert Library: Emergent needs in academic and special libraries
 
Dmdh winter 2015 session #2
Dmdh winter 2015 session #2Dmdh winter 2015 session #2
Dmdh winter 2015 session #2
 
CAA 2014 - To Boldly or Bravely Go? Experiences of using Semantic Technologie...
CAA 2014 - To Boldly or Bravely Go? Experiences of using Semantic Technologie...CAA 2014 - To Boldly or Bravely Go? Experiences of using Semantic Technologie...
CAA 2014 - To Boldly or Bravely Go? Experiences of using Semantic Technologie...
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Demystifying Digital Humanities: Winter 2014 Workshop #2: Programming on the ...
Demystifying Digital Humanities: Winter 2014 Workshop #2: Programming on the ...Demystifying Digital Humanities: Winter 2014 Workshop #2: Programming on the ...
Demystifying Digital Humanities: Winter 2014 Workshop #2: Programming on the ...
 
GIS ANALYTICS-2011
GIS ANALYTICS-2011GIS ANALYTICS-2011
GIS ANALYTICS-2011
 
SQLBits X SQL Server 2012 Spatial
SQLBits X SQL Server 2012 SpatialSQLBits X SQL Server 2012 Spatial
SQLBits X SQL Server 2012 Spatial
 

Dernier

Comparative Literature in India by Amiya dev.pptx
Comparative Literature in India by Amiya dev.pptxComparative Literature in India by Amiya dev.pptx
Comparative Literature in India by Amiya dev.pptxAvaniJani1
 
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptxMan or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptxDhatriParmar
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...Nguyen Thanh Tu Collection
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationdeepaannamalai16
 
How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17Celine George
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdfMr Bounab Samir
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Association for Project Management
 
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...Nguyen Thanh Tu Collection
 
Objectives n learning outcoms - MD 20240404.pptx
Objectives n learning outcoms - MD 20240404.pptxObjectives n learning outcoms - MD 20240404.pptx
Objectives n learning outcoms - MD 20240404.pptxMadhavi Dharankar
 
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDecoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDhatriParmar
 
PART 1 - CHAPTER 1 - CELL THE FUNDAMENTAL UNIT OF LIFE
PART 1 - CHAPTER 1 - CELL THE FUNDAMENTAL UNIT OF LIFEPART 1 - CHAPTER 1 - CELL THE FUNDAMENTAL UNIT OF LIFE
PART 1 - CHAPTER 1 - CELL THE FUNDAMENTAL UNIT OF LIFEMISSRITIMABIOLOGYEXP
 
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...DhatriParmar
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQuiz Club NITW
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWQuiz Club NITW
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfPrerana Jadhav
 
How to Uninstall a Module in Odoo 17 Using Command Line
How to Uninstall a Module in Odoo 17 Using Command LineHow to Uninstall a Module in Odoo 17 Using Command Line
How to Uninstall a Module in Odoo 17 Using Command LineCeline George
 
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxCLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxAnupam32727
 
An Overview of the Calendar App in Odoo 17 ERP
An Overview of the Calendar App in Odoo 17 ERPAn Overview of the Calendar App in Odoo 17 ERP
An Overview of the Calendar App in Odoo 17 ERPCeline George
 

Dernier (20)

Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of EngineeringFaculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
 
Comparative Literature in India by Amiya dev.pptx
Comparative Literature in India by Amiya dev.pptxComparative Literature in India by Amiya dev.pptx
Comparative Literature in India by Amiya dev.pptx
 
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptxMan or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentation
 
How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdf
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
 
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
 
Objectives n learning outcoms - MD 20240404.pptx
Objectives n learning outcoms - MD 20240404.pptxObjectives n learning outcoms - MD 20240404.pptx
Objectives n learning outcoms - MD 20240404.pptx
 
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDecoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
 
PART 1 - CHAPTER 1 - CELL THE FUNDAMENTAL UNIT OF LIFE
PART 1 - CHAPTER 1 - CELL THE FUNDAMENTAL UNIT OF LIFEPART 1 - CHAPTER 1 - CELL THE FUNDAMENTAL UNIT OF LIFE
PART 1 - CHAPTER 1 - CELL THE FUNDAMENTAL UNIT OF LIFE
 
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITW
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdf
 
How to Uninstall a Module in Odoo 17 Using Command Line
How to Uninstall a Module in Odoo 17 Using Command LineHow to Uninstall a Module in Odoo 17 Using Command Line
How to Uninstall a Module in Odoo 17 Using Command Line
 
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxCLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
 
An Overview of the Calendar App in Odoo 17 ERP
An Overview of the Calendar App in Odoo 17 ERPAn Overview of the Calendar App in Odoo 17 ERP
An Overview of the Calendar App in Odoo 17 ERP
 
CARNAVAL COM MAGIA E EUFORIA _
CARNAVAL COM MAGIA E EUFORIA            _CARNAVAL COM MAGIA E EUFORIA            _
CARNAVAL COM MAGIA E EUFORIA _
 

Edin pelagios

  • 1. Institute for Language, Cognition and Computation The Edinburgh Geoparser and Chalice Claire Grover Kate Byrne, Richard Tobin, Jo Walsh www.inf.ed.ac.uk
  • 2. Institute for Language, Cognition and Computation Overview of the Edinburgh Geoparser •  System to automatically recognise place names in text and disambiguate them with respect to a gazetteer. (Athens, Springfield) •  Patchy development over past few years funded by a variety of projects applied to a range of data sets: –  GeoCrossWalk –  BOPCRIS –  GeoDigRef (Histpop, BOPCRIS, BL) –  Embedding GeoCrossWalk (Stormont Papers) –  SYNC3 (online news) –  Chalice (EPNS) –  Unlock •  Main concern has been to keep it generally usable while applying it to specific data sets.
  • 3. Institute for Language, Cognition and Computation Overview of the Edinburgh Geoparser Geotagging .txt .html Format Tokenisation POS Lemmatis- Named Entity .geotagged.xml .xml conversion tagging ation Recognition .geotagged.xml Gazetteer lookup Resolution .gaz.xml Georesolution
  • 8. Institute for Language, Cognition and Computation Evaluation (2009) SpatialML (gold geotagging) GeoNames Unlock No. of place names 3628 3628 No. for which gaz entries found 3538 3049 Correct within 5km 2946 2143 As % of total 81.2% 59.0% SpatialML (end-to-end) GeoNames No. of place names 3628 No. for which gaz entries found 2923 Correct within 5km 2504 As % of total 69.0%
  • 9. Institute for Language, Cognition and Computation Current Development Issues •  Open source release •  Increased configurability –  Input formats: plain text, HTML, simple XML, ... –  User’s own text analysis: paragraphs, sentences, word tokens, place name mark-up –  Output formats: map visualisation, text mark-up, … –  User input: constrain by area, bounding box, … •  Choice of gazetteer: GeoNames, Unlock, geonames-local, Pleiades+, Chalice historical gazetteer, ... •  Performance monitoring/evaluation against test sets
  • 10. Institute for Language, Cognition and Computation GAP project: Pleiades+ •  Based on Pleiades set of ancient place names but extended in two ways: •  by matching Pleiades place names against GeoNames place names in the same location and adding the GeoNames alternative names to the Pleiades+ list: –  adds three alternative names for the single Pleiades entry for Autricum (Chartrez, Chartres, Shartr), because Autricum” is present in both Pleiades and GeoNames, with the same approximate location •  at run-time, looking up place names found in the text against GeoNames (as well as against Pleiades+) and the using the alternative names from GeoNames to match against the Pleiades+ list –  Pleiades has no entry for Egypt”. We look up the name in GeoNames and use its alternative names (which include Aegyptus) to match back against Pleiades (which does include Aegyptus). (We don't want to simply take places directly from GeoNames because, when we tried it, we were swamped with irrelevant modern places having names corresponding to ancient toponyms.)
  • 11. Institute for Language, Cognition and Computation Chalice •  Connecting Historical Authorities with Linked Data, Contexts, and Entities. •  Funded under the JISC jiscEXPO programme on exposing digital content for education and research. •  The project is exploring the viability of creating a historical gazetteer from digitized volumes from the English Place-Name Society (EPNS). •  Partners: –  CDDA, Queen’s University, Belfast –  School of Informatics, Edinburgh –  EDINA, Edinburgh –  CeRch, Kings College London •  Informatics role is to adapt our existing text mining/geoparsing technology to convert the textual documents that are output from OCR into structured data.
  • 12. Institute for Language, Cognition and Computation Chalice data •  Cheshire –  Cheshire Part I. EPNS Volume 44, 1970 –  Cheshire Part II. EPNS Volume 45, 1970 –  Cheshire Part III. EPNS Volume 46, 1971 –  Cheshire Part IV. EPNS Volume 47, 1972 –  Cheshire Part V (1 :i). EPNS Volume 48, 1981 –  Cheshire Part V (1 :ii). EPNS Volume 54, 1981 •  Small samples from: –  Berkshire, Buckinghamshire (Vol. 2), Cambridgeshire (Vol 19), Derbyshire (Vols 27-29), Hertfordshire (Vol. 15) •  Shropshire: Pimhill Hundred (born digital)
  • 13. Institute for Language, Cognition and Computation EPNS •  Parishes are usually organised in terms of the hundreds in which they belong. •  Towns and villages are usually referred to as townships and are organised in terms of the parish in which they belong. •  Township descriptions often contain relatively unstructured information about smaller associated places such as buildings, bridges, lanes, woods and farms. •  Township descriptions also frequently contain separately marked sections of information about field names and street names. •  Information about river and major road names are described separately from the inhabited place descriptions. •  Place names are the primary object of interest and descriptions of them contain information about alternative names and spellings that have been attested in historical sources and the etymology of names or name parts. •  In Chalice we focus on capturing parishes, townships, sub-townships, attestation. We don’t deal with hundreds, field names, street names, rivers, roads etc.
  • 15. Institute for Language, Cognition and Computation The start of the entry for the township of Willaston in the parish of Neston in Wirral Hundred.
  • 22. Institute for Language, Cognition and Computation Issues •  OCR quality needs to be high: not just recognising characters correctly but getting font and layout information right. Failure to recognise bold and small caps fonts or the difference between a line break and a paragraph break can lead to major errors in the recognition process. •  EPNS volumes vary in the use of layout and font to indicate structure (e.g. Cheshire parishes are signaled by centering combined with numbering with roman numerals while Hertfordshire ones are unnumbered but centered and in bold font.) In some volumes potentially useful information is contained in footnotes. •  Different volumes reflect different decisions about where place name information should be put. In most cases the information about the parish name occurs next to the town in the parish that has the same name. In the Shropshire text some place name information occurs in an earlier volume and is not subsequently repeated, e.g. the description of the parish of Baschurch, containing a township of the same name, has no attestation or etymological information provided because the name was discussed in Part 1.
  • 28. Institute for Language, Cognition and Computation Thank you!