SlideShare une entreprise Scribd logo
1  sur  37
Digitization and enhancement of biodiversity literature through OCR, scientific names mapping andcrowdsourcing Chris Freeland Technical Director, Biodiversity Heritage Library BioSystematics Berlin 2011 22 Feb 2011 http://biodiversitylibrary.org/page/33061402
Digitization http://biodiversitylibrary.org/page/6165462
Workflow Conservation Digitization Selection Preparation Post Production (Re)publication
Scanning Derivatives Files are stored & sync’d across BHL clusters Master Derivatives XML JP2 PDF JPG TXT DJVu Storage PDF OCR JP2 XML
Optical Character Recognition (OCR) http://biodiversitylibrary.org/page/2836705
OCR is a *BIG* challenge All book / literature digitization projects affected, not just BHL Especially problematic in BHL More than 50 languages represented in BHL Dates of publication from 1400’s to 2000’s Irregular typeface / typesetting Multiple languages on one page Botanical descriptions in Latin
Abbildungenund Beschreibungen der FischeSyriens, nebst einerneuen Classification und Characteristik sämmtlicherGattungen der i JOH. JAKOB HECKEL,  Inipectoiam k. k. Hof-Natur.-iUenkabinete in Wien, mehr, yelelirt. UeHtllMeii. MIfglivd. STUTTGART. E. Schweizerbart' seheVerlagshandlung, 1843.
*E.xvi�c�piteI von c. cXx.WptdvonfnrWmn bu�fbe;bcn.5 am cixbIa� S &3rn~ 41X a�mcv(f b1air�'o�et ertoiensr�; �', :�hlrfc�cwa ff�4am.diug bist a 6aiw~s ff oJrJtwtnof bL4ecImt& blfaframembt wag `wr 4 cnwiu 4 e8t5m.ed bvUratflb ck wuo, ma144'*4I bttE5rmbebt =rt3'kn am4ra tifvrmrWaff C * t6rmnli an `tn�ciblatGteaMw ?ffoaifrn w4wmeu nu weibe , wpiteI voE5teiri ct cobergtUcr cit cm` 91 cLibiar J ' >bSciatl�Oiff ;Bruetwacfttcnqmcx b1a bl: bt5c lttmtt bb9 lkrw.llr#eitincnxoa ff cu :rtrtuft *et� B Rn "�trv W1Rt' ?Cm cblaswaIwutrOber�citi 1V Ces ' wt gbtiemwwajfutpctt, afferain 9 c: b�titbfof�rferanmrs bra wlg auig4;f aer�m *mc vrtblatcabtfmwfruan'deg~mrtblasIaumbwWt� run fncmai b14ianf tJobrrfan ebrut4net vnberBrwtOberawawi*m.crriiibtafwfmuwwc on$ 'it ttuwttkc 5,10 $ m~Cfcatrc* cxu W�e�&mcyfbq4 Mabttmmwrc a iiubcJcnncI.end.*, blat s. au:�rprd3 rw4ftf wm c ii,+ ttCCtnwa frr9fr orfabfcfbtenbcoptitibt -r9 ceDattDcn i34M snSemi
2007 Name Finding Study 35.16% >35% OCR error rate for names only Of the 3,003 names, 1,056 were incorrectly transcribed by OCR. Top OCR errors Wei, et al. An Evaluation of Taxonomic Name Recognition (TNR) in the Biodiversity Heritage Library. Proceedings of TDWG. 2008. http://www.tdwg.org/proceedings/article/view/380
WikiSource Trove - National Library of Australia Manual techniques for text correction
WikiSource Example http://biostor.org/wiki/Page:Spixiana1999zool.djvu/293
Goal: Semi-automated text correction OCR + Machine Learning + Users Let machines do raw processing	 Develop algorithms for natural language processing & machine learning Build a community of (human) users to help reCAPTCHA as an example Why not just use reCAPTCHA? Google bought it *More work needed here*
Scientific names mapping  http://biodiversitylibrary.org/page/27782237
TaxonFinder API response Name finding via TaxonFinder Extract names Submit to NameBank Image from Scanner Converted to text via OCR Name Finding in action withuBio’sTaxonFinder…
Crowdsourcing http://biodiversitylibrary.org/page/20965795
CiteBank:   http://citebank.org New search index to BHL content Platform for journals/publishers/societies in need of tools to store & share their digitized content Access to “crowdsourced” articles from BHL scans
Crowdsourcing Statistics & Analysis Analysis http://biodiversitylibrary.blogspot.com/2009/04/pdf-article-metadata-analysis.html At that time, more than 80% of the PDFs created had metadata attached by users More than 50% contributed accurate article-level information New analysis over more data this summer / fall Now have more than 58,000 PDFs to analyze
Open Data = More Use Scholars Rod Page iPhylo BioGUID BioStor Ryan Schenk Other Apps EarthCape ZipecodeZoo
Conclusion BHL is a massive dataset useful for multidisciplinary research Systematics Natural Language Processing Humanities BHL is open Free to use at http://biodiversitylibrary.org Open access data for scholarly use & reuse BHL has APIs and data exports to enable reuse BHL data can be incorporated into other virtual research environments (EOL, Scratchpads, BioStor, others)
Questions? Chris Freeland Technical Director, Biodiversity Heritage Library Director, Center for Biodiversity Informatics, Missouri Botanical Garden Missouri Botanical Garden 4344 Shaw Blvd. St. Louis, MO 63110 USA Email: chris.freeland@mobot.org Twitter: @chrisfreeland Blog / info: chrisfreeland.com BioSystematics Berlin 2011 22 Feb 2011

Contenu connexe

En vedette

Mapping birds, biodiversity and business: the role of GIS in conservation
Mapping birds, biodiversity and business: the role of GIS in conservationMapping birds, biodiversity and business: the role of GIS in conservation
Mapping birds, biodiversity and business: the role of GIS in conservationBritish Cartographic Society
 
Hanspach, J. (2012) Seminar at Stanford University
Hanspach, J. (2012) Seminar at Stanford UniversityHanspach, J. (2012) Seminar at Stanford University
Hanspach, J. (2012) Seminar at Stanford UniversityJSchultner
 
Role of computers in science and technology agriculture
Role of computers in science and technology agricultureRole of computers in science and technology agriculture
Role of computers in science and technology agricultureGobind Raj Aulakh
 
EVS - Biodiversity Notes
EVS - Biodiversity NotesEVS - Biodiversity Notes
EVS - Biodiversity NotesArzoo Sahni
 

En vedette (6)

Mapping birds, biodiversity and business: the role of GIS in conservation
Mapping birds, biodiversity and business: the role of GIS in conservationMapping birds, biodiversity and business: the role of GIS in conservation
Mapping birds, biodiversity and business: the role of GIS in conservation
 
Hanspach, J. (2012) Seminar at Stanford University
Hanspach, J. (2012) Seminar at Stanford UniversityHanspach, J. (2012) Seminar at Stanford University
Hanspach, J. (2012) Seminar at Stanford University
 
Role of computers in science and technology agriculture
Role of computers in science and technology agricultureRole of computers in science and technology agriculture
Role of computers in science and technology agriculture
 
EVS - Biodiversity Notes
EVS - Biodiversity NotesEVS - Biodiversity Notes
EVS - Biodiversity Notes
 
Biodiversity of India
Biodiversity of IndiaBiodiversity of India
Biodiversity of India
 
PPT OF BIODIVERSITY
PPT OF BIODIVERSITYPPT OF BIODIVERSITY
PPT OF BIODIVERSITY
 

Similaire à Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing

BHL Technologies: Review for BHL-Australia
BHL Technologies: Review for BHL-AustraliaBHL Technologies: Review for BHL-Australia
BHL Technologies: Review for BHL-AustraliaChris Freeland
 
BHL / EOL technology sit down
BHL / EOL technology sit downBHL / EOL technology sit down
BHL / EOL technology sit downChris Freeland
 
BHL Tech Overview for BHL-Europe
BHL Tech Overview for BHL-EuropeBHL Tech Overview for BHL-Europe
BHL Tech Overview for BHL-EuropeChris Freeland
 
BHL Developments - Prague
BHL Developments - PragueBHL Developments - Prague
BHL Developments - PragueChris Freeland
 
BHL @ #TDWG09 - with discussion
BHL @ #TDWG09 - with discussionBHL @ #TDWG09 - with discussion
BHL @ #TDWG09 - with discussionChris Freeland
 
Biodiversity Knowledge Graphs
Biodiversity Knowledge GraphsBiodiversity Knowledge Graphs
Biodiversity Knowledge GraphsRoderic Page
 
Digital Libraries for Science: Botanicus and the Biodiversity Heritage Library
Digital Libraries for Science: Botanicus and the Biodiversity Heritage LibraryDigital Libraries for Science: Botanicus and the Biodiversity Heritage Library
Digital Libraries for Science: Botanicus and the Biodiversity Heritage LibraryChris Freeland
 
Knowledge Organization System (KOS) for biodiversity information resources, G...
Knowledge Organization System (KOS) for biodiversity information resources, G...Knowledge Organization System (KOS) for biodiversity information resources, G...
Knowledge Organization System (KOS) for biodiversity information resources, G...Dag Endresen
 
Eol fellow-march2010
Eol fellow-march2010Eol fellow-march2010
Eol fellow-march2010tgarnett
 
Knowledge Organization System (KOS) for biodiversity information resources, G...
Knowledge Organization System (KOS) for biodiversity information resources, G...Knowledge Organization System (KOS) for biodiversity information resources, G...
Knowledge Organization System (KOS) for biodiversity information resources, G...Dag Endresen
 
Forging the Digital Roadmap: The Preservation, Curation and Stewardship Nexus
Forging the Digital Roadmap: The Preservation, Curation and Stewardship NexusForging the Digital Roadmap: The Preservation, Curation and Stewardship Nexus
Forging the Digital Roadmap: The Preservation, Curation and Stewardship NexusBianca Crowley
 
How Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open ScienceHow Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open Sciencedrnigam
 
BHL-Europe for sherborn 2011 - henning scholz
BHL-Europe for sherborn 2011 - henning scholzBHL-Europe for sherborn 2011 - henning scholz
BHL-Europe for sherborn 2011 - henning scholzcoelatura
 
Sherborn: Scholz - BHL-Europe: Tools and Services for Legacy Taxonomic Litera...
Sherborn: Scholz - BHL-Europe: Tools and Services for Legacy Taxonomic Litera...Sherborn: Scholz - BHL-Europe: Tools and Services for Legacy Taxonomic Litera...
Sherborn: Scholz - BHL-Europe: Tools and Services for Legacy Taxonomic Litera...ICZN
 
Cross-Community User Requirements and the Biodiversity Heritage Library
Cross-Community User Requirements and the Biodiversity Heritage LibraryCross-Community User Requirements and the Biodiversity Heritage Library
Cross-Community User Requirements and the Biodiversity Heritage LibraryChris Freeland
 
Biodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and processBiodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and processPhil Cryer
 
Ontology Web Services for Semantic Applications
Ontology Web Services for Semantic Applications Ontology Web Services for Semantic Applications
Ontology Web Services for Semantic Applications Trish Whetzel
 

Similaire à Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing (20)

BHL Technologies: Review for BHL-Australia
BHL Technologies: Review for BHL-AustraliaBHL Technologies: Review for BHL-Australia
BHL Technologies: Review for BHL-Australia
 
BHL / EOL technology sit down
BHL / EOL technology sit downBHL / EOL technology sit down
BHL / EOL technology sit down
 
BHL Tech Overview for BHL-Europe
BHL Tech Overview for BHL-EuropeBHL Tech Overview for BHL-Europe
BHL Tech Overview for BHL-Europe
 
BHL Developments - Prague
BHL Developments - PragueBHL Developments - Prague
BHL Developments - Prague
 
BHL @ #TDWG09 - with discussion
BHL @ #TDWG09 - with discussionBHL @ #TDWG09 - with discussion
BHL @ #TDWG09 - with discussion
 
Biodiversity Knowledge Graphs
Biodiversity Knowledge GraphsBiodiversity Knowledge Graphs
Biodiversity Knowledge Graphs
 
BHL @ #TDWG09
BHL @ #TDWG09BHL @ #TDWG09
BHL @ #TDWG09
 
Digital Libraries for Science: Botanicus and the Biodiversity Heritage Library
Digital Libraries for Science: Botanicus and the Biodiversity Heritage LibraryDigital Libraries for Science: Botanicus and the Biodiversity Heritage Library
Digital Libraries for Science: Botanicus and the Biodiversity Heritage Library
 
Knowledge Organization System (KOS) for biodiversity information resources, G...
Knowledge Organization System (KOS) for biodiversity information resources, G...Knowledge Organization System (KOS) for biodiversity information resources, G...
Knowledge Organization System (KOS) for biodiversity information resources, G...
 
Eol fellow-march2010
Eol fellow-march2010Eol fellow-march2010
Eol fellow-march2010
 
Knowledge Organization System (KOS) for biodiversity information resources, G...
Knowledge Organization System (KOS) for biodiversity information resources, G...Knowledge Organization System (KOS) for biodiversity information resources, G...
Knowledge Organization System (KOS) for biodiversity information resources, G...
 
Forging the Digital Roadmap: The Preservation, Curation and Stewardship Nexus
Forging the Digital Roadmap: The Preservation, Curation and Stewardship NexusForging the Digital Roadmap: The Preservation, Curation and Stewardship Nexus
Forging the Digital Roadmap: The Preservation, Curation and Stewardship Nexus
 
How Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open ScienceHow Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open Science
 
BHL-Europe for sherborn 2011 - henning scholz
BHL-Europe for sherborn 2011 - henning scholzBHL-Europe for sherborn 2011 - henning scholz
BHL-Europe for sherborn 2011 - henning scholz
 
Sherborn: Scholz - BHL-Europe: Tools and Services for Legacy Taxonomic Litera...
Sherborn: Scholz - BHL-Europe: Tools and Services for Legacy Taxonomic Litera...Sherborn: Scholz - BHL-Europe: Tools and Services for Legacy Taxonomic Litera...
Sherborn: Scholz - BHL-Europe: Tools and Services for Legacy Taxonomic Litera...
 
Cross-Community User Requirements and the Biodiversity Heritage Library
Cross-Community User Requirements and the Biodiversity Heritage LibraryCross-Community User Requirements and the Biodiversity Heritage Library
Cross-Community User Requirements and the Biodiversity Heritage Library
 
Limitreal
LimitrealLimitreal
Limitreal
 
Biodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and processBiodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and process
 
BHL Tech Report
BHL Tech ReportBHL Tech Report
BHL Tech Report
 
Ontology Web Services for Semantic Applications
Ontology Web Services for Semantic Applications Ontology Web Services for Semantic Applications
Ontology Web Services for Semantic Applications
 

Plus de Chris Freeland

From Eames & Young to Pruitt-Igoe
From Eames & Young to Pruitt-IgoeFrom Eames & Young to Pruitt-Igoe
From Eames & Young to Pruitt-IgoeChris Freeland
 
Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...
Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...
Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...Chris Freeland
 
Building the Missouri Hub for DPLA
Building the Missouri Hub for DPLABuilding the Missouri Hub for DPLA
Building the Missouri Hub for DPLAChris Freeland
 
Documenting Ferguson: Building a community digital repository
Documenting Ferguson: Building a community digital repositoryDocumenting Ferguson: Building a community digital repository
Documenting Ferguson: Building a community digital repositoryChris Freeland
 
Newman Numismatic Portal Overview - Mar 2015
Newman Numismatic Portal Overview - Mar 2015Newman Numismatic Portal Overview - Mar 2015
Newman Numismatic Portal Overview - Mar 2015Chris Freeland
 
Establishing the Missouri Hub: A Service Hub for DPLA
Establishing the Missouri Hub: A Service Hub for DPLAEstablishing the Missouri Hub: A Service Hub for DPLA
Establishing the Missouri Hub: A Service Hub for DPLAChris Freeland
 
Organizing a DPLA Service Hub in Missouri
Organizing a DPLA Service Hub in MissouriOrganizing a DPLA Service Hub in Missouri
Organizing a DPLA Service Hub in MissouriChris Freeland
 
Pilots & Partnerships: University Academic Computing and University Libraries...
Pilots & Partnerships: University Academic Computing and University Libraries...Pilots & Partnerships: University Academic Computing and University Libraries...
Pilots & Partnerships: University Academic Computing and University Libraries...Chris Freeland
 
BHL: Big Data, Big Challenges
BHL: Big Data, Big ChallengesBHL: Big Data, Big Challenges
BHL: Big Data, Big ChallengesChris Freeland
 
Built Works Registry: Geocoding Biodiversity Heritage Library
Built Works Registry: Geocoding Biodiversity Heritage LibraryBuilt Works Registry: Geocoding Biodiversity Heritage Library
Built Works Registry: Geocoding Biodiversity Heritage LibraryChris Freeland
 
A Digitization Primer for Botanical and Horticultural Librarians
A Digitization Primer for Botanical and Horticultural LibrariansA Digitization Primer for Botanical and Horticultural Librarians
A Digitization Primer for Botanical and Horticultural LibrariansChris Freeland
 
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives Chris Freeland
 
MBG Rare Book Digitization Project (2003)
MBG Rare Book Digitization Project (2003)MBG Rare Book Digitization Project (2003)
MBG Rare Book Digitization Project (2003)Chris Freeland
 
BHL: Your 24hr Library
BHL: Your 24hr LibraryBHL: Your 24hr Library
BHL: Your 24hr LibraryChris Freeland
 
Seeding links from Wikipedia to BHL (2008 - 2012)
Seeding links from Wikipedia to BHL (2008 - 2012)Seeding links from Wikipedia to BHL (2008 - 2012)
Seeding links from Wikipedia to BHL (2008 - 2012)Chris Freeland
 
BHL: Assigning DOIs & Other Identifiers to Legacy Literature
BHL: Assigning DOIs & Other Identifiers to Legacy LiteratureBHL: Assigning DOIs & Other Identifiers to Legacy Literature
BHL: Assigning DOIs & Other Identifiers to Legacy LiteratureChris Freeland
 
Life & Literature Future Framing for BHL
Life & Literature Future Framing for BHLLife & Literature Future Framing for BHL
Life & Literature Future Framing for BHLChris Freeland
 
Approaches to preserving digitized taxonomic data
Approaches to preserving digitized taxonomic dataApproaches to preserving digitized taxonomic data
Approaches to preserving digitized taxonomic dataChris Freeland
 
Scribbles & Scraps: Darwin’s Library & Annotated Literature
Scribbles & Scraps: Darwin’s Library & Annotated LiteratureScribbles & Scraps: Darwin’s Library & Annotated Literature
Scribbles & Scraps: Darwin’s Library & Annotated LiteratureChris Freeland
 

Plus de Chris Freeland (20)

From Eames & Young to Pruitt-Igoe
From Eames & Young to Pruitt-IgoeFrom Eames & Young to Pruitt-Igoe
From Eames & Young to Pruitt-Igoe
 
Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...
Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...
Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...
 
Building the Missouri Hub for DPLA
Building the Missouri Hub for DPLABuilding the Missouri Hub for DPLA
Building the Missouri Hub for DPLA
 
Documenting Ferguson: Building a community digital repository
Documenting Ferguson: Building a community digital repositoryDocumenting Ferguson: Building a community digital repository
Documenting Ferguson: Building a community digital repository
 
Newman Numismatic Portal Overview - Mar 2015
Newman Numismatic Portal Overview - Mar 2015Newman Numismatic Portal Overview - Mar 2015
Newman Numismatic Portal Overview - Mar 2015
 
Establishing the Missouri Hub: A Service Hub for DPLA
Establishing the Missouri Hub: A Service Hub for DPLAEstablishing the Missouri Hub: A Service Hub for DPLA
Establishing the Missouri Hub: A Service Hub for DPLA
 
Organizing a DPLA Service Hub in Missouri
Organizing a DPLA Service Hub in MissouriOrganizing a DPLA Service Hub in Missouri
Organizing a DPLA Service Hub in Missouri
 
Pilots & Partnerships: University Academic Computing and University Libraries...
Pilots & Partnerships: University Academic Computing and University Libraries...Pilots & Partnerships: University Academic Computing and University Libraries...
Pilots & Partnerships: University Academic Computing and University Libraries...
 
BHL: Big Data, Big Challenges
BHL: Big Data, Big ChallengesBHL: Big Data, Big Challenges
BHL: Big Data, Big Challenges
 
Built Works Registry: Geocoding Biodiversity Heritage Library
Built Works Registry: Geocoding Biodiversity Heritage LibraryBuilt Works Registry: Geocoding Biodiversity Heritage Library
Built Works Registry: Geocoding Biodiversity Heritage Library
 
A Digitization Primer for Botanical and Horticultural Librarians
A Digitization Primer for Botanical and Horticultural LibrariansA Digitization Primer for Botanical and Horticultural Librarians
A Digitization Primer for Botanical and Horticultural Librarians
 
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
 
MBG Rare Book Digitization Project (2003)
MBG Rare Book Digitization Project (2003)MBG Rare Book Digitization Project (2003)
MBG Rare Book Digitization Project (2003)
 
BHL: Your 24hr Library
BHL: Your 24hr LibraryBHL: Your 24hr Library
BHL: Your 24hr Library
 
Seeding links from Wikipedia to BHL (2008 - 2012)
Seeding links from Wikipedia to BHL (2008 - 2012)Seeding links from Wikipedia to BHL (2008 - 2012)
Seeding links from Wikipedia to BHL (2008 - 2012)
 
BHL: Assigning DOIs & Other Identifiers to Legacy Literature
BHL: Assigning DOIs & Other Identifiers to Legacy LiteratureBHL: Assigning DOIs & Other Identifiers to Legacy Literature
BHL: Assigning DOIs & Other Identifiers to Legacy Literature
 
Global BHL Activities
Global BHL ActivitiesGlobal BHL Activities
Global BHL Activities
 
Life & Literature Future Framing for BHL
Life & Literature Future Framing for BHLLife & Literature Future Framing for BHL
Life & Literature Future Framing for BHL
 
Approaches to preserving digitized taxonomic data
Approaches to preserving digitized taxonomic dataApproaches to preserving digitized taxonomic data
Approaches to preserving digitized taxonomic data
 
Scribbles & Scraps: Darwin’s Library & Annotated Literature
Scribbles & Scraps: Darwin’s Library & Annotated LiteratureScribbles & Scraps: Darwin’s Library & Annotated Literature
Scribbles & Scraps: Darwin’s Library & Annotated Literature
 

Dernier

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 

Dernier (20)

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 

Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing

  • 1. Digitization and enhancement of biodiversity literature through OCR, scientific names mapping andcrowdsourcing Chris Freeland Technical Director, Biodiversity Heritage Library BioSystematics Berlin 2011 22 Feb 2011 http://biodiversitylibrary.org/page/33061402
  • 3. Workflow Conservation Digitization Selection Preparation Post Production (Re)publication
  • 4. Scanning Derivatives Files are stored & sync’d across BHL clusters Master Derivatives XML JP2 PDF JPG TXT DJVu Storage PDF OCR JP2 XML
  • 5. Optical Character Recognition (OCR) http://biodiversitylibrary.org/page/2836705
  • 6. OCR is a *BIG* challenge All book / literature digitization projects affected, not just BHL Especially problematic in BHL More than 50 languages represented in BHL Dates of publication from 1400’s to 2000’s Irregular typeface / typesetting Multiple languages on one page Botanical descriptions in Latin
  • 7. Abbildungenund Beschreibungen der FischeSyriens, nebst einerneuen Classification und Characteristik sämmtlicherGattungen der i JOH. JAKOB HECKEL, Inipectoiam k. k. Hof-Natur.-iUenkabinete in Wien, mehr, yelelirt. UeHtllMeii. MIfglivd. STUTTGART. E. Schweizerbart' seheVerlagshandlung, 1843.
  • 8. *E.xvi�c�piteI von c. cXx.WptdvonfnrWmn bu�fbe;bcn.5 am cixbIa� S &3rn~ 41X a�mcv(f b1air�'o�et ertoiensr�; �', :�hlrfc�cwa ff�4am.diug bist a 6aiw~s ff oJrJtwtnof bL4ecImt& blfaframembt wag `wr 4 cnwiu 4 e8t5m.ed bvUratflb ck wuo, ma144'*4I bttE5rmbebt =rt3'kn am4ra tifvrmrWaff C * t6rmnli an `tn�ciblatGteaMw ?ffoaifrn w4wmeu nu weibe , wpiteI voE5teiri ct cobergtUcr cit cm` 91 cLibiar J ' >bSciatl�Oiff ;Bruetwacfttcnqmcx b1a bl: bt5c lttmtt bb9 lkrw.llr#eitincnxoa ff cu :rtrtuft *et� B Rn "�trv W1Rt' ?Cm cblaswaIwutrOber�citi 1V Ces ' wt gbtiemwwajfutpctt, afferain 9 c: b�titbfof�rferanmrs bra wlg auig4;f aer�m *mc vrtblatcabtfmwfruan'deg~mrtblasIaumbwWt� run fncmai b14ianf tJobrrfan ebrut4net vnberBrwtOberawawi*m.crriiibtafwfmuwwc on$ 'it ttuwttkc 5,10 $ m~Cfcatrc* cxu W�e�&mcyfbq4 Mabttmmwrc a iiubcJcnncI.end.*, blat s. au:�rprd3 rw4ftf wm c ii,+ ttCCtnwa frr9fr orfabfcfbtenbcoptitibt -r9 ceDattDcn i34M snSemi
  • 9. 2007 Name Finding Study 35.16% >35% OCR error rate for names only Of the 3,003 names, 1,056 were incorrectly transcribed by OCR. Top OCR errors Wei, et al. An Evaluation of Taxonomic Name Recognition (TNR) in the Biodiversity Heritage Library. Proceedings of TDWG. 2008. http://www.tdwg.org/proceedings/article/view/380
  • 10. WikiSource Trove - National Library of Australia Manual techniques for text correction
  • 12. Goal: Semi-automated text correction OCR + Machine Learning + Users Let machines do raw processing Develop algorithms for natural language processing & machine learning Build a community of (human) users to help reCAPTCHA as an example Why not just use reCAPTCHA? Google bought it *More work needed here*
  • 13. Scientific names mapping http://biodiversitylibrary.org/page/27782237
  • 14. TaxonFinder API response Name finding via TaxonFinder Extract names Submit to NameBank Image from Scanner Converted to text via OCR Name Finding in action withuBio’sTaxonFinder…
  • 15.
  • 16.
  • 17.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28. CiteBank: http://citebank.org New search index to BHL content Platform for journals/publishers/societies in need of tools to store & share their digitized content Access to “crowdsourced” articles from BHL scans
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34. Crowdsourcing Statistics & Analysis Analysis http://biodiversitylibrary.blogspot.com/2009/04/pdf-article-metadata-analysis.html At that time, more than 80% of the PDFs created had metadata attached by users More than 50% contributed accurate article-level information New analysis over more data this summer / fall Now have more than 58,000 PDFs to analyze
  • 35. Open Data = More Use Scholars Rod Page iPhylo BioGUID BioStor Ryan Schenk Other Apps EarthCape ZipecodeZoo
  • 36. Conclusion BHL is a massive dataset useful for multidisciplinary research Systematics Natural Language Processing Humanities BHL is open Free to use at http://biodiversitylibrary.org Open access data for scholarly use & reuse BHL has APIs and data exports to enable reuse BHL data can be incorporated into other virtual research environments (EOL, Scratchpads, BioStor, others)
  • 37. Questions? Chris Freeland Technical Director, Biodiversity Heritage Library Director, Center for Biodiversity Informatics, Missouri Botanical Garden Missouri Botanical Garden 4344 Shaw Blvd. St. Louis, MO 63110 USA Email: chris.freeland@mobot.org Twitter: @chrisfreeland Blog / info: chrisfreeland.com BioSystematics Berlin 2011 22 Feb 2011