8. … and of course
Tachgs Tachys
Pioa Pica
Homa Homo
Pieris Pieris
9. Thank you!
And any questions…
Dr David King
Research Fellow,
Computing and Communications Department,
Faculty of Mathematics, Computing and Technology,
The Open University, MK7 6AA, UK.
mailto: david.king@open.ac.uk
webpage: http://www9.open.ac.uk/mct/people/david.king
tel: +44 (0)1908 652695
Notes de l'éditeur
Being a brief run through of some of the issues arising in my work relating to legacy biodiversity literature over the last five years.
The literature is important to today’s researchers because knowing where wild relatives of modern crops might live can inform new field work seeking to enhance the genetic composition of those crops. Similarly, the old literature can provide insights into the habitats and climates of hundred years ago and more, so giving modern climate researchers an extended baseline of data over which to track changes.
Here’s a screenshot of Manchester’s brat, used by me to mark-up some biodiversity literature at a detailed level.
Why re-invent the wheel when there are good tools already?
One issue not so obvious from here is that names change.
Hence, we have to normalise, just as you do with place names that change through time, for example.
However, an added wrinkle of historic biodiversity data is that developing tools to identify taxonomic names in modern, born digital literature is one thing, helped by the presence of internationally recognised naming conventions; the really old literature pre-dates the conventions and the old scientists could be really creative in their taxonomic names!
Note in line 00, the Dendrececa is an OCR error for Dendreœca. See slide 4 for more about this wrinkle.
We have the conventional issues of identifying names in text, complicated by the absence of a gazetteer to act as a master reference to verify our findings.
We’re into fuzzy matching and other techniques to address these issues.
We at the OU have produced curation tools to help resolve candidate names, but our ideal would be to automate as much as possible simply because of the volume of text we want to work with.
Two post-docs at Plazi marked-up 2,500 pages of hymenoptera related literature over a year using their in-house tool. That’s not tractable when dealing with potentially millions of pages.
The twist is that in these examples we will always have an error.
This is an example of a document from the Smithsonian, Washington, who outsourced their digitisation to the Internet Archive, who OCRed the texts using a standard north American English dictionary… and character set, which means it lacks the æ ligature.
Hence, we know that every occurrence of the æ ligature in the text will be rendered incorrectly.
The same issue causes a real headache when mining biological texts with the mis-recognition of ♂ and ♀.
And even if the document is OCRed using a more appropriate character set, as in this example from the Natural History Museum, London, you have no guarantee that the results are much better!
This example is taken from an introductory page in a book about beetles. It lists the families and sub-families to be discussed in the book. Nowadays, the taxonomic rank family is identified in the name by ending with -idae and sub-family with –inae. This old text, however, uses the ligature æ for ae. Of the 24 names on this page, all 24 æ ligatures are incorrect.
For completeness, there was one valid use of œ in the source page, and it was rendered correctly by the OCR.
Investigating OCR accuracy further, I examined a text about Central American squirrels of the genus Sciurus. The word was not in the OCR engines dictionary, yet the accuracy rate was very high when the word was located in body text because there were enough accurately identified words for the engine to confidently identify individual characters. Isolated words, usually set in a different font to the body text, however, destroyed the accuracy.
The headline from this work seems to be that while humans can read tables of contents and page headers to accurately navigate around a document, computers really need to process the whole text to build up a meaningful index to navigate the content.
However, we cannot simply group words and use them to determine a text’s content.
Examples, from ‘denticulate’, which has 11 characters:
The first example is of simple OCR errors, the terminal ‘e’ is mis-recognised as an ’a’ or ‘æ’, producing an edit distance score of one.
The second example of correct OCR of a similar but distinct word, with an edit distance score of two.
The next two examples show words that are not connected. It is happenstance that their common ending suggests they could be clustered based on their distance score of 2.
Given these issues in grouping, in this case arising from the peculiarities of taxonomic names, some time ago I began to explore applying Levenshtein distance to parts of words rather than their whole. This is one of many ideas I would love to revisit despite the pretty crazy results you can see here when breaking the words into five character segments beginning, middle and end.
This issue relates back to the earlier issue of taxonomic name similarity. See slide 5 with its list of family and sub-family names. Sometimes the same root is used for both taxonomic levels; Curculionidæ and Curculioninæ for example, they are distinct terms.
And to conclude with a few more wrinkles to think about:
1 - the OCR is correct. It has accurately rendered a typo in the original.
2 – there is no genus Pioa, so by reference to say the Catalog of Life we know it is wrong
3 – but here man has become insect, both Homo and its mis-OCRed variant Homa are valid genera so it is more difficult to pick up the mistake
4 – but which one is the butterfly and which the heather, and should I mention that other favourite overloaded name: Bacillus, a genus of stick insect or are we talking about bacteria?
Context can assist, whether through collocation or other methods. There are many examples I could cite, such as the cichlids of Lake George; putting Lake Geoirge into a gazetteer I got ten townships in the US and Canada all called Lake George, I did not get the lake in Uganda and its associated metadata. Let’s talk semantic web shall we?
There are so many wonderfully challenging edge cases to explore!
Therefore, I suggest that the wrinkles mean we have to look beyond named entity recognition techniques and draw on:
concept extraction – I have done work in this area
sentiment analysis – not done anything here, but I can see how to apply the techniques
and possibly other areas too – but one step at a time!