SlideShare une entreprise Scribd logo
1  sur  9
Wrinkles in the data
Dr David King
BL Labs Text Mining Workshop
Thursday 27th November 2014
So far, so conventional
Standard OCR issues
Attelabus
Attelans Attelabüs
AttelabwAttelabtis
… with a twist
æquinoctialis
aequinoctialis
cequiQioctialis
asquinoctialis
sequinoctialis
cequinoctialis
wquinoctialis
… somewhat better
Otiorhynchinæ → Otiorhynchinœ
Alatæ → Alatse
Curculionidæ → Curculionidœ
Attelabinæ → Attelabinae
Pterocolinæ → Pterocolinœ
Allocoryninæ → Allocoryninee
Apioninæ → Apioninœ
Thecesterninæ → Thecesterninae
Otiorhynchinæ → Otiorhynchinre
Attelabinæ → Attelabinae
Pterocolinæ → Pterocolinse
Allocoryninæ → Allocoryninse
Thecesterninæ → Thecesterninse
Apioninae → Apioninae
Otiorhynchinæ → Otiorhynchinae
Otiorhynchinæ → Otiorhynchinae
Curculionidæ → Curculionidœ
Curculioninæ → Curculioninae
Calandrinæ → Calandrinse
Brenthidæ → Brenthidae
Scolytidæ → Scolytidae
Anthribidæ → Anthribidae
Hispidæ → Hispida?
Cassididæ → Cassididae
24 æ rendered as:
11 ae
5 œ
5 se
1 ee
1 re
1 a?
… depending where you are
Sciurus
70 times in body text 68 correct
7 times as isolated word 1 correct
… with a twist
Terms Levenshtein
denticulate → denticulata, denticulatæ 1: 0,0,1
denticulate → bidenticulate 2: 4,0,0
denticulate → reticulate 2: 3,2,0
denticulate → geniculate 2: 2,2,0
… and of course
Tachgs  Tachys
Pioa  Pica
Homa  Homo
Pieris  Pieris
Thank you!
And any questions…
Dr David King
Research Fellow,
Computing and Communications Department,
Faculty of Mathematics, Computing and Technology,
The Open University, MK7 6AA, UK.
mailto: david.king@open.ac.uk
webpage: http://www9.open.ac.uk/mct/people/david.king
tel: +44 (0)1908 652695

Contenu connexe

En vedette

En vedette (8)

Untitled Presentation
Untitled PresentationUntitled Presentation
Untitled Presentation
 
En la luna
En la lunaEn la luna
En la luna
 
CAN HO LEXINGTON RESIDENCE Q2
CAN HO LEXINGTON RESIDENCE Q2CAN HO LEXINGTON RESIDENCE Q2
CAN HO LEXINGTON RESIDENCE Q2
 
Diapositivas isa
Diapositivas isaDiapositivas isa
Diapositivas isa
 
Biciteca tríptico
Biciteca trípticoBiciteca tríptico
Biciteca tríptico
 
Վաղարշապատ
ՎաղարշապատՎաղարշապատ
Վաղարշապատ
 
Visite chez pao
Visite chez paoVisite chez pao
Visite chez pao
 
tutorial-Sirosis hepatis
tutorial-Sirosis hepatistutorial-Sirosis hepatis
tutorial-Sirosis hepatis
 

Dernier

Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfSubhamKumar3239
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 

Dernier (20)

Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdf
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 

Wrinkles in the_data

  • 1. Wrinkles in the data Dr David King BL Labs Text Mining Workshop Thursday 27th November 2014
  • 2. So far, so conventional
  • 3. Standard OCR issues Attelabus Attelans Attelabüs AttelabwAttelabtis
  • 4. … with a twist æquinoctialis aequinoctialis cequiQioctialis asquinoctialis sequinoctialis cequinoctialis wquinoctialis
  • 5. … somewhat better Otiorhynchinæ → Otiorhynchinœ Alatæ → Alatse Curculionidæ → Curculionidœ Attelabinæ → Attelabinae Pterocolinæ → Pterocolinœ Allocoryninæ → Allocoryninee Apioninæ → Apioninœ Thecesterninæ → Thecesterninae Otiorhynchinæ → Otiorhynchinre Attelabinæ → Attelabinae Pterocolinæ → Pterocolinse Allocoryninæ → Allocoryninse Thecesterninæ → Thecesterninse Apioninae → Apioninae Otiorhynchinæ → Otiorhynchinae Otiorhynchinæ → Otiorhynchinae Curculionidæ → Curculionidœ Curculioninæ → Curculioninae Calandrinæ → Calandrinse Brenthidæ → Brenthidae Scolytidæ → Scolytidae Anthribidæ → Anthribidae Hispidæ → Hispida? Cassididæ → Cassididae 24 æ rendered as: 11 ae 5 œ 5 se 1 ee 1 re 1 a?
  • 6. … depending where you are Sciurus 70 times in body text 68 correct 7 times as isolated word 1 correct
  • 7. … with a twist Terms Levenshtein denticulate → denticulata, denticulatæ 1: 0,0,1 denticulate → bidenticulate 2: 4,0,0 denticulate → reticulate 2: 3,2,0 denticulate → geniculate 2: 2,2,0
  • 8. … and of course Tachgs  Tachys Pioa  Pica Homa  Homo Pieris  Pieris
  • 9. Thank you! And any questions… Dr David King Research Fellow, Computing and Communications Department, Faculty of Mathematics, Computing and Technology, The Open University, MK7 6AA, UK. mailto: david.king@open.ac.uk webpage: http://www9.open.ac.uk/mct/people/david.king tel: +44 (0)1908 652695

Notes de l'éditeur

  1. Being a brief run through of some of the issues arising in my work relating to legacy biodiversity literature over the last five years. The literature is important to today’s researchers because knowing where wild relatives of modern crops might live can inform new field work seeking to enhance the genetic composition of those crops. Similarly, the old literature can provide insights into the habitats and climates of hundred years ago and more, so giving modern climate researchers an extended baseline of data over which to track changes.
  2. Here’s a screenshot of Manchester’s brat, used by me to mark-up some biodiversity literature at a detailed level. Why re-invent the wheel when there are good tools already? One issue not so obvious from here is that names change. Hence, we have to normalise, just as you do with place names that change through time, for example. However, an added wrinkle of historic biodiversity data is that developing tools to identify taxonomic names in modern, born digital literature is one thing, helped by the presence of internationally recognised naming conventions; the really old literature pre-dates the conventions and the old scientists could be really creative in their taxonomic names! Note in line 00, the Dendrececa is an OCR error for Dendreœca. See slide 4 for more about this wrinkle.
  3. We have the conventional issues of identifying names in text, complicated by the absence of a gazetteer to act as a master reference to verify our findings. We’re into fuzzy matching and other techniques to address these issues. We at the OU have produced curation tools to help resolve candidate names, but our ideal would be to automate as much as possible simply because of the volume of text we want to work with. Two post-docs at Plazi marked-up 2,500 pages of hymenoptera related literature over a year using their in-house tool. That’s not tractable when dealing with potentially millions of pages.
  4. The twist is that in these examples we will always have an error. This is an example of a document from the Smithsonian, Washington, who outsourced their digitisation to the Internet Archive, who OCRed the texts using a standard north American English dictionary… and character set, which means it lacks the æ ligature. Hence, we know that every occurrence of the æ ligature in the text will be rendered incorrectly. The same issue causes a real headache when mining biological texts with the mis-recognition of ♂ and ♀.
  5. And even if the document is OCRed using a more appropriate character set, as in this example from the Natural History Museum, London, you have no guarantee that the results are much better! This example is taken from an introductory page in a book about beetles. It lists the families and sub-families to be discussed in the book. Nowadays, the taxonomic rank family is identified in the name by ending with -idae and sub-family with –inae. This old text, however, uses the ligature æ for ae. Of the 24 names on this page, all 24 æ ligatures are incorrect. For completeness, there was one valid use of œ in the source page, and it was rendered correctly by the OCR.
  6. Investigating OCR accuracy further, I examined a text about Central American squirrels of the genus Sciurus. The word was not in the OCR engines dictionary, yet the accuracy rate was very high when the word was located in body text because there were enough accurately identified words for the engine to confidently identify individual characters. Isolated words, usually set in a different font to the body text, however, destroyed the accuracy. The headline from this work seems to be that while humans can read tables of contents and page headers to accurately navigate around a document, computers really need to process the whole text to build up a meaningful index to navigate the content.
  7. However, we cannot simply group words and use them to determine a text’s content. Examples, from ‘denticulate’, which has 11 characters: The first example is of simple OCR errors, the terminal ‘e’ is mis-recognised as an ’a’ or ‘æ’, producing an edit distance score of one. The second example of correct OCR of a similar but distinct word, with an edit distance score of two. The next two examples show words that are not connected. It is happenstance that their common ending suggests they could be clustered based on their distance score of 2. Given these issues in grouping, in this case arising from the peculiarities of taxonomic names, some time ago I began to explore applying Levenshtein distance to parts of words rather than their whole. This is one of many ideas I would love to revisit despite the pretty crazy results you can see here when breaking the words into five character segments beginning, middle and end. This issue relates back to the earlier issue of taxonomic name similarity. See slide 5 with its list of family and sub-family names. Sometimes the same root is used for both taxonomic levels; Curculionidæ and Curculioninæ for example, they are distinct terms.
  8. And to conclude with a few more wrinkles to think about: 1 - the OCR is correct. It has accurately rendered a typo in the original. 2 – there is no genus Pioa, so by reference to say the Catalog of Life we know it is wrong 3 – but here man has become insect, both Homo and its mis-OCRed variant Homa are valid genera so it is more difficult to pick up the mistake 4 – but which one is the butterfly and which the heather, and should I mention that other favourite overloaded name: Bacillus, a genus of stick insect or are we talking about bacteria? Context can assist, whether through collocation or other methods. There are many examples I could cite, such as the cichlids of Lake George; putting Lake Geoirge into a gazetteer I got ten townships in the US and Canada all called Lake George, I did not get the lake in Uganda and its associated metadata. Let’s talk semantic web shall we?  There are so many wonderfully challenging edge cases to explore! Therefore, I suggest that the wrinkles mean we have to look beyond named entity recognition techniques and draw on: concept extraction – I have done work in this area sentiment analysis – not done anything here, but I can see how to apply the techniques and possibly other areas too – but one step at a time!