Wrinkles in the_data

•Télécharger en tant que PPTX, PDF•

0 j'aime•248 vues

David King

A commentary on some of the issues that have emerged when working with digitised legacy biodiversity literature.

Données & analyses

Wrinkles in the data
Dr David King
BL Labs Text Mining Workshop
Thursday 27th November 2014

Standard OCR issues
Attelabus
Attelans Attelabüs
AttelabwAttelabtis

… with a twist
æquinoctialis
aequinoctialis
cequiQioctialis
asquinoctialis
sequinoctialis
cequinoctialis
wquinoctialis

… somewhat better
Otiorhynchinæ → Otiorhynchinœ
Alatæ → Alatse
Curculionidæ → Curculionidœ
Attelabinæ → Attelabinae
Pterocolinæ → Pterocolinœ
Allocoryninæ → Allocoryninee
Apioninæ → Apioninœ
Thecesterninæ → Thecesterninae
Otiorhynchinæ → Otiorhynchinre
Attelabinæ → Attelabinae
Pterocolinæ → Pterocolinse
Allocoryninæ → Allocoryninse
Thecesterninæ → Thecesterninse
Apioninae → Apioninae
Otiorhynchinæ → Otiorhynchinae
Otiorhynchinæ → Otiorhynchinae
Curculionidæ → Curculionidœ
Curculioninæ → Curculioninae
Calandrinæ → Calandrinse
Brenthidæ → Brenthidae
Scolytidæ → Scolytidae
Anthribidæ → Anthribidae
Hispidæ → Hispida?
Cassididæ → Cassididae
24 æ rendered as:
11 ae
5 œ
5 se
1 ee
1 re
1 a?

… depending where you are
Sciurus
70 times in body text 68 correct
7 times as isolated word 1 correct

… with a twist
Terms Levenshtein
denticulate → denticulata, denticulatæ 1: 0,0,1
denticulate → bidenticulate 2: 4,0,0
denticulate → reticulate 2: 3,2,0
denticulate → geniculate 2: 2,2,0

… and of course
Tachgs  Tachys
Pioa  Pica
Homa  Homo
Pieris  Pieris

Thank you!
And any questions…
Dr David King
Research Fellow,
Computing and Communications Department,
Faculty of Mathematics, Computing and Technology,
The Open University, MK7 6AA, UK.
mailto: david.king@open.ac.uk
webpage: http://www9.open.ac.uk/mct/people/david.king
tel: +44 (0)1908 652695

Contenu connexe

En vedette

Untitled Presentationdeathsolution

En la lunaSantiago Clement

CAN HO LEXINGTON RESIDENCE Q2canholexington

Diapositivas isanorxitha

Biciteca trípticoAMPA Ramon Laza

ՎաղարշապատElen Eranosyan

Visite chez paoVietnam Original Travel

tutorial-Sirosis hepatisEma Wahyuni

En vedette (8)

Untitled Presentation

En la luna

CAN HO LEXINGTON RESIDENCE Q2

Diapositivas isa

Biciteca tríptico

Վաղարշապատ

Visite chez pao

tutorial-Sirosis hepatis

Dernier

Cyber awareness ppt on the recorded dataTecnoIncentive

Semantic Shed - Squashing and Squeezing.pptxMike Bennett

Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy

English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml

Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics

Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann

convolutional neural network and its applications.pdfSubhamKumar3239

Digital Marketing Plan, how digital marketing worksdeepakthakur548787

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics

Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson

Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)

modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx

INTRODUCTION TO Natural language processingsocarem879

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali

Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics

Real-Time AI Streaming - AI Max PrincetonTimothy Spann

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics

The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics

Dernier (20)

Cyber awareness ppt on the recorded data

Semantic Shed - Squashing and Squeezing.pptx

Student Profile Sample report on improving academic performance by uniting gr...

English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf

Decoding Patterns: Customer Churn Prediction Data Analysis Project

Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines

convolutional neural network and its applications.pdf

Digital Marketing Plan, how digital marketing works

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...

Defining Constituents, Data Vizzes and Telling a Data Story

Student profile product demonstration on grades, ability, well-being and mind...

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...

modul pembelajaran robotic Workshop _ by Slidesgo.pptx

INTRODUCTION TO Natural language processing

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...

Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...

Real-Time AI Streaming - AI Max Princeton

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...

The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT

Wrinkles in the_data

1. Wrinkles in the data Dr David King BL Labs Text Mining Workshop Thursday 27th November 2014

2. So far, so conventional

3. Standard OCR issues Attelabus Attelans Attelabüs AttelabwAttelabtis

4. … with a twist æquinoctialis aequinoctialis cequiQioctialis asquinoctialis sequinoctialis cequinoctialis wquinoctialis

5. … somewhat better Otiorhynchinæ → Otiorhynchinœ Alatæ → Alatse Curculionidæ → Curculionidœ Attelabinæ → Attelabinae Pterocolinæ → Pterocolinœ Allocoryninæ → Allocoryninee Apioninæ → Apioninœ Thecesterninæ → Thecesterninae Otiorhynchinæ → Otiorhynchinre Attelabinæ → Attelabinae Pterocolinæ → Pterocolinse Allocoryninæ → Allocoryninse Thecesterninæ → Thecesterninse Apioninae → Apioninae Otiorhynchinæ → Otiorhynchinae Otiorhynchinæ → Otiorhynchinae Curculionidæ → Curculionidœ Curculioninæ → Curculioninae Calandrinæ → Calandrinse Brenthidæ → Brenthidae Scolytidæ → Scolytidae Anthribidæ → Anthribidae Hispidæ → Hispida? Cassididæ → Cassididae 24 æ rendered as: 11 ae 5 œ 5 se 1 ee 1 re 1 a?

6. … depending where you are Sciurus 70 times in body text 68 correct 7 times as isolated word 1 correct

7. … with a twist Terms Levenshtein denticulate → denticulata, denticulatæ 1: 0,0,1 denticulate → bidenticulate 2: 4,0,0 denticulate → reticulate 2: 3,2,0 denticulate → geniculate 2: 2,2,0

8. … and of course Tachgs  Tachys Pioa  Pica Homa  Homo Pieris  Pieris

9. Thank you! And any questions… Dr David King Research Fellow, Computing and Communications Department, Faculty of Mathematics, Computing and Technology, The Open University, MK7 6AA, UK. mailto: david.king@open.ac.uk webpage: http://www9.open.ac.uk/mct/people/david.king tel: +44 (0)1908 652695

Notes de l'éditeur

Being a brief run through of some of the issues arising in my work relating to legacy biodiversity literature over the last five years. The literature is important to today’s researchers because knowing where wild relatives of modern crops might live can inform new field work seeking to enhance the genetic composition of those crops. Similarly, the old literature can provide insights into the habitats and climates of hundred years ago and more, so giving modern climate researchers an extended baseline of data over which to track changes.
Here’s a screenshot of Manchester’s brat, used by me to mark-up some biodiversity literature at a detailed level. Why re-invent the wheel when there are good tools already? One issue not so obvious from here is that names change. Hence, we have to normalise, just as you do with place names that change through time, for example. However, an added wrinkle of historic biodiversity data is that developing tools to identify taxonomic names in modern, born digital literature is one thing, helped by the presence of internationally recognised naming conventions; the really old literature pre-dates the conventions and the old scientists could be really creative in their taxonomic names! Note in line 00, the Dendrececa is an OCR error for Dendreœca. See slide 4 for more about this wrinkle.
We have the conventional issues of identifying names in text, complicated by the absence of a gazetteer to act as a master reference to verify our findings. We’re into fuzzy matching and other techniques to address these issues. We at the OU have produced curation tools to help resolve candidate names, but our ideal would be to automate as much as possible simply because of the volume of text we want to work with. Two post-docs at Plazi marked-up 2,500 pages of hymenoptera related literature over a year using their in-house tool. That’s not tractable when dealing with potentially millions of pages.
The twist is that in these examples we will always have an error. This is an example of a document from the Smithsonian, Washington, who outsourced their digitisation to the Internet Archive, who OCRed the texts using a standard north American English dictionary… and character set, which means it lacks the æ ligature. Hence, we know that every occurrence of the æ ligature in the text will be rendered incorrectly. The same issue causes a real headache when mining biological texts with the mis-recognition of ♂ and ♀.
And even if the document is OCRed using a more appropriate character set, as in this example from the Natural History Museum, London, you have no guarantee that the results are much better! This example is taken from an introductory page in a book about beetles. It lists the families and sub-families to be discussed in the book. Nowadays, the taxonomic rank family is identified in the name by ending with -idae and sub-family with –inae. This old text, however, uses the ligature æ for ae. Of the 24 names on this page, all 24 æ ligatures are incorrect. For completeness, there was one valid use of œ in the source page, and it was rendered correctly by the OCR.
Investigating OCR accuracy further, I examined a text about Central American squirrels of the genus Sciurus. The word was not in the OCR engines dictionary, yet the accuracy rate was very high when the word was located in body text because there were enough accurately identified words for the engine to confidently identify individual characters. Isolated words, usually set in a different font to the body text, however, destroyed the accuracy. The headline from this work seems to be that while humans can read tables of contents and page headers to accurately navigate around a document, computers really need to process the whole text to build up a meaningful index to navigate the content.
However, we cannot simply group words and use them to determine a text’s content. Examples, from ‘denticulate’, which has 11 characters: The first example is of simple OCR errors, the terminal ‘e’ is mis-recognised as an ’a’ or ‘æ’, producing an edit distance score of one. The second example of correct OCR of a similar but distinct word, with an edit distance score of two. The next two examples show words that are not connected. It is happenstance that their common ending suggests they could be clustered based on their distance score of 2. Given these issues in grouping, in this case arising from the peculiarities of taxonomic names, some time ago I began to explore applying Levenshtein distance to parts of words rather than their whole. This is one of many ideas I would love to revisit despite the pretty crazy results you can see here when breaking the words into five character segments beginning, middle and end. This issue relates back to the earlier issue of taxonomic name similarity. See slide 5 with its list of family and sub-family names. Sometimes the same root is used for both taxonomic levels; Curculionidæ and Curculioninæ for example, they are distinct terms.
And to conclude with a few more wrinkles to think about: 1 - the OCR is correct. It has accurately rendered a typo in the original. 2 – there is no genus Pioa, so by reference to say the Catalog of Life we know it is wrong 3 – but here man has become insect, both Homo and its mis-OCRed variant Homa are valid genera so it is more difficult to pick up the mistake 4 – but which one is the butterfly and which the heather, and should I mention that other favourite overloaded name: Bacillus, a genus of stick insect or are we talking about bacteria? Context can assist, whether through collocation or other methods. There are many examples I could cite, such as the cichlids of Lake George; putting Lake Geoirge into a gazetteer I got ten townships in the US and Canada all called Lake George, I did not get the lake in Uganda and its associated metadata. Let’s talk semantic web shall we?  There are so many wonderfully challenging edge cases to explore! Therefore, I suggest that the wrinkles mean we have to look beyond named entity recognition techniques and draw on: concept extraction – I have done work in this area sentiment analysis – not done anything here, but I can see how to apply the techniques and possibly other areas too – but one step at a time!

Wrinkles in the_data

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (8)

Dernier

Dernier (20)

Wrinkles in the_data

Notes de l'éditeur