Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing
1. Digitization and enhancement of biodiversity literature through OCR, scientific names mapping andcrowdsourcing Chris Freeland Technical Director, Biodiversity Heritage Library BioSystematics Berlin 2011 22 Feb 2011 http://biodiversitylibrary.org/page/33061402
6. OCR is a *BIG* challenge All book / literature digitization projects affected, not just BHL Especially problematic in BHL More than 50 languages represented in BHL Dates of publication from 1400’s to 2000’s Irregular typeface / typesetting Multiple languages on one page Botanical descriptions in Latin
7. Abbildungenund Beschreibungen der FischeSyriens, nebst einerneuen Classification und Characteristik sämmtlicherGattungen der i JOH. JAKOB HECKEL, Inipectoiam k. k. Hof-Natur.-iUenkabinete in Wien, mehr, yelelirt. UeHtllMeii. MIfglivd. STUTTGART. E. Schweizerbart' seheVerlagshandlung, 1843.
8. *E.xvi�c�piteI von c. cXx.WptdvonfnrWmn bu�fbe;bcn.5 am cixbIa� S &3rn~ 41X a�mcv(f b1air�'o�et ertoiensr�; �', :�hlrfc�cwa ff�4am.diug bist a 6aiw~s ff oJrJtwtnof bL4ecImt& blfaframembt wag `wr 4 cnwiu 4 e8t5m.ed bvUratflb ck wuo, ma144'*4I bttE5rmbebt =rt3'kn am4ra tifvrmrWaff C * t6rmnli an `tn�ciblatGteaMw ?ffoaifrn w4wmeu nu weibe , wpiteI voE5teiri ct cobergtUcr cit cm` 91 cLibiar J ' >bSciatl�Oiff ;Bruetwacfttcnqmcx b1a bl: bt5c lttmtt bb9 lkrw.llr#eitincnxoa ff cu :rtrtuft *et� B Rn "�trv W1Rt' ?Cm cblaswaIwutrOber�citi 1V Ces ' wt gbtiemwwajfutpctt, afferain 9 c: b�titbfof�rferanmrs bra wlg auig4;f aer�m *mc vrtblatcabtfmwfruan'deg~mrtblasIaumbwWt� run fncmai b14ianf tJobrrfan ebrut4net vnberBrwtOberawawi*m.crriiibtafwfmuwwc on$ 'it ttuwttkc 5,10 $ m~Cfcatrc* cxu W�e�&mcyfbq4 Mabttmmwrc a iiubcJcnncI.end.*, blat s. au:�rprd3 rw4ftf wm c ii,+ ttCCtnwa frr9fr orfabfcfbtenbcoptitibt -r9 ceDattDcn i34M snSemi
9. 2007 Name Finding Study 35.16% >35% OCR error rate for names only Of the 3,003 names, 1,056 were incorrectly transcribed by OCR. Top OCR errors Wei, et al. An Evaluation of Taxonomic Name Recognition (TNR) in the Biodiversity Heritage Library. Proceedings of TDWG. 2008. http://www.tdwg.org/proceedings/article/view/380
10. WikiSource Trove - National Library of Australia Manual techniques for text correction
12. Goal: Semi-automated text correction OCR + Machine Learning + Users Let machines do raw processing Develop algorithms for natural language processing & machine learning Build a community of (human) users to help reCAPTCHA as an example Why not just use reCAPTCHA? Google bought it *More work needed here*
14. TaxonFinder API response Name finding via TaxonFinder Extract names Submit to NameBank Image from Scanner Converted to text via OCR Name Finding in action withuBio’sTaxonFinder…
28. CiteBank: http://citebank.org New search index to BHL content Platform for journals/publishers/societies in need of tools to store & share their digitized content Access to “crowdsourced” articles from BHL scans
29.
30.
31.
32.
33.
34. Crowdsourcing Statistics & Analysis Analysis http://biodiversitylibrary.blogspot.com/2009/04/pdf-article-metadata-analysis.html At that time, more than 80% of the PDFs created had metadata attached by users More than 50% contributed accurate article-level information New analysis over more data this summer / fall Now have more than 58,000 PDFs to analyze
35. Open Data = More Use Scholars Rod Page iPhylo BioGUID BioStor Ryan Schenk Other Apps EarthCape ZipecodeZoo
36. Conclusion BHL is a massive dataset useful for multidisciplinary research Systematics Natural Language Processing Humanities BHL is open Free to use at http://biodiversitylibrary.org Open access data for scholarly use & reuse BHL has APIs and data exports to enable reuse BHL data can be incorporated into other virtual research environments (EOL, Scratchpads, BioStor, others)
37. Questions? Chris Freeland Technical Director, Biodiversity Heritage Library Director, Center for Biodiversity Informatics, Missouri Botanical Garden Missouri Botanical Garden 4344 Shaw Blvd. St. Louis, MO 63110 USA Email: chris.freeland@mobot.org Twitter: @chrisfreeland Blog / info: chrisfreeland.com BioSystematics Berlin 2011 22 Feb 2011