SlideShare une entreprise Scribd logo
1  sur  11
BHL: Big Data, Big Challenges

               Chris Freeland               @chrisfreeland


        Founding Technical Director, BHL

   Sr. Director, University Academic Computing,
               Washington University
>100,000 books, > 39 million pages, > 70TB data
BHL Architecture: Window Seat Ed.
Access
                                                     Data
                      APIs          UI
                                                    Exports


Logic
          Geocoding




                                                                          Firewall
                                                       Data Transform
           Name                   BHL DB                  Utilities
          Finding


Storage

                             Images (JP2)              Internet Archive
                             PDF
                             Coordinate-based OCR
                             XML metadata
BHL Content: Structured Data
• Metadata from Library catalogues at
  title/volume level
 <titleInfo>
 <title>CaroliLinnaei ... Species plantarum :exhibentesplantas rite cognitas, ad genera
 relatas, cum differentiisspecificis, nominibustrivialibus, synonymisselectis, locisnatalibus,
 secundumsystemasexualedigestas...</title>
 </titleInfo>
 <titleInfo type="abbreviated"><title>Sp. Pl.</title></titleInfo>
 <name type="personal">
 <namePart>Linné, Carl von,</namePart>
 <namePart type="date">1707-1778</namePart>
 </name>
 <typeOfResource>text</typeOfResource>
 <genre authority="marcgt">book</genre>
 <originInfo>
 <place><placeTerm type="text">Holmiae :</placeTerm></place>
 <publisher>ImpensisLaurentiiSalvii,</publisher><dateIssued>1753.</dateIssued>
BHL Content: Unstructured Data
More than 39 million pages!



….of uncorrected OCR   
Abbild ungen und Beschreibungen
                   der

               Fische Syriens,
                    nebst
einer neuen Classification und Characteristik
           sämmtlicher Gattungen
                     der
                       i
             JOH. JAKOB HECKEL,
Inipectoi am k. k. Hof-Natur.-iUenkabinete in
    Wien, mehr, yelelirt. UeHtllMeii. MIfglivd.




               STUTTGART.
  E. Schweizerbart' sehe Verlagshandlung,
                   1843.
*E.xvi�c�piteI von c. cXx.WptdvonfnrWmn
bu�fbe;bcn.5 am cix bIa � S &3rn~ 41X
a�m cv(f b1air�'o�et ert oiensr �; �',
:�hlrfc�c wa ff�4am.diug bist a
6aiw~s ff oJrJtwt nof bL4ecImt& blfafra mem
b t wag `wr 4 cn wiu 4 e8t5m.ed bvUratflb ck
wuo, ma144'*4I bttE5rmbebt =rt3'kn am4ra
tif vrmr Waff C * t6rmnli an `tn�ciblatGteaM
w ?ffoaifrn w4wmeu nu weib e , wpiteI
voE5teiri ct c ober gtUcr cit cm` 91 cLi biar J '
>bSciatl�Oiff ;Bruet wacfttc n qmcx b1a bl:
bt5c lttmtt bb9 lkr w.llr#e iti ncn xoa ff cu :r
trtuft *e t � B Rn "� trv W1Rt' ?Cm c blas
waIwutr Ober �ci ti 1V Ces ' wt
gbtiemwwajfu tpctt, afferain 9 c: b�titbfof
�r f eran m rs bra wlg auig4;f aer�m *mc vrt
blatcabtfm wfru an'deg~m rt blas Iaum
bwWt� run f ncmai b14ianf tJobrrfan
ebrut4net vnber Brwt Ober awawi*m.crriii
btafwfm uww c on$ 'it ttu wttkc 5,10 $ m~C
fca trc* cx u W�e�&mcyfbq4 Mabtt mmw
rc a iiu bc Jcn ncI.end.*, blat s. a u:�rprd3
rw4ftf wm c ii,+ ttCC tn wa frr9fr orfab fcfbt
enb c optiti bt -r9 ceDa ttDcn i34M sn Sem i
How to connect/consume
http://biodivlib.wikispaces.com/Developer+Tools+and+API


                                        Data
        APIs              UI
                                       Exports




                       BHL DB




                              Internet Archive
       Images (JP2)
       PDF
       Coordinate-based OCR
       XML metadata
BHL Data Challenge: Name Finding
            pre-2007
BHL Data Challenge: Name Finding
• TaxonFinder algorithm in production since
  2008
  – More than 100 million candidate name strings
  – More than 1.5 million unique, verified names
  – Available through UI, APIs, Data Exports & Internet
    Archive
• New collaboration with Global Names
  – Improved algorithm, better precision & recall
  – More data!
New Data Challenges
        http://biodivlib.wikispaces.com/BHL+and+Gaming
                                             ^Challenges framed as games

•   Correcting OCR
•   Rekeying Tables of Contents
•   Researching candidate Scientific Names
•   Image identification & extraction
    – http://biodivlib.wikispaces.com/Art+of+Life
    – Currently funded by NEH

Contenu connexe

Similaire à BHL: Big Data, Big Challenges

BHL Tech Overview for BHL-Europe
BHL Tech Overview for BHL-EuropeBHL Tech Overview for BHL-Europe
BHL Tech Overview for BHL-EuropeChris Freeland
 
BHL Technologies: Review for BHL-Australia
BHL Technologies: Review for BHL-AustraliaBHL Technologies: Review for BHL-Australia
BHL Technologies: Review for BHL-AustraliaChris Freeland
 
BHL Technical Director's Report, Mar. 2014
BHL Technical Director's Report, Mar. 2014BHL Technical Director's Report, Mar. 2014
BHL Technical Director's Report, Mar. 2014William Ulate
 
BHL @ #TDWG09 - with discussion
BHL @ #TDWG09 - with discussionBHL @ #TDWG09 - with discussion
BHL @ #TDWG09 - with discussionChris Freeland
 
BHL / EOL technology sit down
BHL / EOL technology sit downBHL / EOL technology sit down
BHL / EOL technology sit downChris Freeland
 
Global Library of Life: The Biodiversity Heritage Library
Global Library of Life: The Biodiversity Heritage LibraryGlobal Library of Life: The Biodiversity Heritage Library
Global Library of Life: The Biodiversity Heritage LibraryMartin Kalfatovic
 
Biodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and processBiodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and processPhil Cryer
 
Defrosting the Digital Library: A survey of bibliographic tools for the next ...
Defrosting the Digital Library: A survey of bibliographic tools for the next ...Defrosting the Digital Library: A survey of bibliographic tools for the next ...
Defrosting the Digital Library: A survey of bibliographic tools for the next ...Duncan Hull
 
Digital Libraries for Science: Botanicus and the Biodiversity Heritage Library
Digital Libraries for Science: Botanicus and the Biodiversity Heritage LibraryDigital Libraries for Science: Botanicus and the Biodiversity Heritage Library
Digital Libraries for Science: Botanicus and the Biodiversity Heritage LibraryChris Freeland
 
BHL Developments - Prague
BHL Developments - PragueBHL Developments - Prague
BHL Developments - PragueChris Freeland
 
An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...
An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...
An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...Chris Freeland
 
Plays Well with Others, or What I’ve learned as a data provider in an intero...
Plays Well with Others, or What I’ve learned as a data provider in an intero...Plays Well with Others, or What I’ve learned as a data provider in an intero...
Plays Well with Others, or What I’ve learned as a data provider in an intero...Chris Freeland
 
Cross-Community User Requirements and the Biodiversity Heritage Library
Cross-Community User Requirements and the Biodiversity Heritage LibraryCross-Community User Requirements and the Biodiversity Heritage Library
Cross-Community User Requirements and the Biodiversity Heritage LibraryChris Freeland
 
BHL Markup Efforts and Plans
BHL Markup Efforts and PlansBHL Markup Efforts and Plans
BHL Markup Efforts and PlansWilliam Ulate
 
Furore devdays 2017- rdf1(solbrig)
Furore devdays 2017- rdf1(solbrig)Furore devdays 2017- rdf1(solbrig)
Furore devdays 2017- rdf1(solbrig)DevDays
 
BHL-Europe_MINERVA_20111116_hrainer
BHL-Europe_MINERVA_20111116_hrainerBHL-Europe_MINERVA_20111116_hrainer
BHL-Europe_MINERVA_20111116_hrainerHeimo Rainer
 
Lifting the Lid on Linked Data
Lifting the Lid on Linked DataLifting the Lid on Linked Data
Lifting the Lid on Linked DataJane Stevenson
 

Similaire à BHL: Big Data, Big Challenges (20)

BHL Tech Overview for BHL-Europe
BHL Tech Overview for BHL-EuropeBHL Tech Overview for BHL-Europe
BHL Tech Overview for BHL-Europe
 
BHL Technologies: Review for BHL-Australia
BHL Technologies: Review for BHL-AustraliaBHL Technologies: Review for BHL-Australia
BHL Technologies: Review for BHL-Australia
 
BHL Tech Report
BHL Tech ReportBHL Tech Report
BHL Tech Report
 
BHL @ #TDWG09
BHL @ #TDWG09BHL @ #TDWG09
BHL @ #TDWG09
 
BHL Technical Director's Report, Mar. 2014
BHL Technical Director's Report, Mar. 2014BHL Technical Director's Report, Mar. 2014
BHL Technical Director's Report, Mar. 2014
 
BHL @ #TDWG09 - with discussion
BHL @ #TDWG09 - with discussionBHL @ #TDWG09 - with discussion
BHL @ #TDWG09 - with discussion
 
BHL / EOL technology sit down
BHL / EOL technology sit downBHL / EOL technology sit down
BHL / EOL technology sit down
 
Global Library of Life: The Biodiversity Heritage Library
Global Library of Life: The Biodiversity Heritage LibraryGlobal Library of Life: The Biodiversity Heritage Library
Global Library of Life: The Biodiversity Heritage Library
 
Biodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and processBiodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and process
 
Defrosting the Digital Library: A survey of bibliographic tools for the next ...
Defrosting the Digital Library: A survey of bibliographic tools for the next ...Defrosting the Digital Library: A survey of bibliographic tools for the next ...
Defrosting the Digital Library: A survey of bibliographic tools for the next ...
 
Digital Libraries for Science: Botanicus and the Biodiversity Heritage Library
Digital Libraries for Science: Botanicus and the Biodiversity Heritage LibraryDigital Libraries for Science: Botanicus and the Biodiversity Heritage Library
Digital Libraries for Science: Botanicus and the Biodiversity Heritage Library
 
BHL Developments - Prague
BHL Developments - PragueBHL Developments - Prague
BHL Developments - Prague
 
An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...
An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...
An evaluation of taxonomic name finding & next steps in Biodiversity Heritage...
 
Plays Well with Others, or What I’ve learned as a data provider in an intero...
Plays Well with Others, or What I’ve learned as a data provider in an intero...Plays Well with Others, or What I’ve learned as a data provider in an intero...
Plays Well with Others, or What I’ve learned as a data provider in an intero...
 
Cross-Community User Requirements and the Biodiversity Heritage Library
Cross-Community User Requirements and the Biodiversity Heritage LibraryCross-Community User Requirements and the Biodiversity Heritage Library
Cross-Community User Requirements and the Biodiversity Heritage Library
 
BHL Markup Efforts and Plans
BHL Markup Efforts and PlansBHL Markup Efforts and Plans
BHL Markup Efforts and Plans
 
Furore devdays 2017- rdf1(solbrig)
Furore devdays 2017- rdf1(solbrig)Furore devdays 2017- rdf1(solbrig)
Furore devdays 2017- rdf1(solbrig)
 
FAIRy Stories
FAIRy StoriesFAIRy Stories
FAIRy Stories
 
BHL-Europe_MINERVA_20111116_hrainer
BHL-Europe_MINERVA_20111116_hrainerBHL-Europe_MINERVA_20111116_hrainer
BHL-Europe_MINERVA_20111116_hrainer
 
Lifting the Lid on Linked Data
Lifting the Lid on Linked DataLifting the Lid on Linked Data
Lifting the Lid on Linked Data
 

Plus de Chris Freeland

From Eames & Young to Pruitt-Igoe
From Eames & Young to Pruitt-IgoeFrom Eames & Young to Pruitt-Igoe
From Eames & Young to Pruitt-IgoeChris Freeland
 
Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...
Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...
Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...Chris Freeland
 
Building the Missouri Hub for DPLA
Building the Missouri Hub for DPLABuilding the Missouri Hub for DPLA
Building the Missouri Hub for DPLAChris Freeland
 
Organizing a DPLA Service Hub in Missouri
Organizing a DPLA Service Hub in MissouriOrganizing a DPLA Service Hub in Missouri
Organizing a DPLA Service Hub in MissouriChris Freeland
 
Built Works Registry: Geocoding Biodiversity Heritage Library
Built Works Registry: Geocoding Biodiversity Heritage LibraryBuilt Works Registry: Geocoding Biodiversity Heritage Library
Built Works Registry: Geocoding Biodiversity Heritage LibraryChris Freeland
 
A Digitization Primer for Botanical and Horticultural Librarians
A Digitization Primer for Botanical and Horticultural LibrariansA Digitization Primer for Botanical and Horticultural Librarians
A Digitization Primer for Botanical and Horticultural LibrariansChris Freeland
 
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives Chris Freeland
 
MBG Rare Book Digitization Project (2003)
MBG Rare Book Digitization Project (2003)MBG Rare Book Digitization Project (2003)
MBG Rare Book Digitization Project (2003)Chris Freeland
 
BHL: Your 24hr Library
BHL: Your 24hr LibraryBHL: Your 24hr Library
BHL: Your 24hr LibraryChris Freeland
 
Seeding links from Wikipedia to BHL (2008 - 2012)
Seeding links from Wikipedia to BHL (2008 - 2012)Seeding links from Wikipedia to BHL (2008 - 2012)
Seeding links from Wikipedia to BHL (2008 - 2012)Chris Freeland
 
BHL: Assigning DOIs & Other Identifiers to Legacy Literature
BHL: Assigning DOIs & Other Identifiers to Legacy LiteratureBHL: Assigning DOIs & Other Identifiers to Legacy Literature
BHL: Assigning DOIs & Other Identifiers to Legacy LiteratureChris Freeland
 
Life & Literature Future Framing for BHL
Life & Literature Future Framing for BHLLife & Literature Future Framing for BHL
Life & Literature Future Framing for BHLChris Freeland
 
Approaches to preserving digitized taxonomic data
Approaches to preserving digitized taxonomic dataApproaches to preserving digitized taxonomic data
Approaches to preserving digitized taxonomic dataChris Freeland
 
Scribbles & Scraps: Darwin’s Library & Annotated Literature
Scribbles & Scraps: Darwin’s Library & Annotated LiteratureScribbles & Scraps: Darwin’s Library & Annotated Literature
Scribbles & Scraps: Darwin’s Library & Annotated LiteratureChris Freeland
 
Plant Name Services Using Tropicos
Plant Name Services Using TropicosPlant Name Services Using Tropicos
Plant Name Services Using TropicosChris Freeland
 
Biodiversity Heritage Library (BHL): A Global Resource for Open Access Scient...
Biodiversity Heritage Library (BHL): A Global Resource for Open Access Scient...Biodiversity Heritage Library (BHL): A Global Resource for Open Access Scient...
Biodiversity Heritage Library (BHL): A Global Resource for Open Access Scient...Chris Freeland
 
Global Strategy for Plant Conservation Target 1
Global Strategy for Plant Conservation Target 1Global Strategy for Plant Conservation Target 1
Global Strategy for Plant Conservation Target 1Chris Freeland
 
DPLA Technologies: Foundations for Growth & Sustainability
DPLA Technologies: Foundations for Growth & Sustainability DPLA Technologies: Foundations for Growth & Sustainability
DPLA Technologies: Foundations for Growth & Sustainability Chris Freeland
 
Digitized Public Domain Literature
Digitized Public Domain LiteratureDigitized Public Domain Literature
Digitized Public Domain LiteratureChris Freeland
 

Plus de Chris Freeland (20)

From Eames & Young to Pruitt-Igoe
From Eames & Young to Pruitt-IgoeFrom Eames & Young to Pruitt-Igoe
From Eames & Young to Pruitt-Igoe
 
Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...
Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...
Documenting the Now: Supporting Scholarly Use & Preservation of Social Media ...
 
Building the Missouri Hub for DPLA
Building the Missouri Hub for DPLABuilding the Missouri Hub for DPLA
Building the Missouri Hub for DPLA
 
Organizing a DPLA Service Hub in Missouri
Organizing a DPLA Service Hub in MissouriOrganizing a DPLA Service Hub in Missouri
Organizing a DPLA Service Hub in Missouri
 
Built Works Registry: Geocoding Biodiversity Heritage Library
Built Works Registry: Geocoding Biodiversity Heritage LibraryBuilt Works Registry: Geocoding Biodiversity Heritage Library
Built Works Registry: Geocoding Biodiversity Heritage Library
 
A Digitization Primer for Botanical and Horticultural Librarians
A Digitization Primer for Botanical and Horticultural LibrariansA Digitization Primer for Botanical and Horticultural Librarians
A Digitization Primer for Botanical and Horticultural Librarians
 
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
 
MBG Rare Book Digitization Project (2003)
MBG Rare Book Digitization Project (2003)MBG Rare Book Digitization Project (2003)
MBG Rare Book Digitization Project (2003)
 
BHL: Your 24hr Library
BHL: Your 24hr LibraryBHL: Your 24hr Library
BHL: Your 24hr Library
 
Seeding links from Wikipedia to BHL (2008 - 2012)
Seeding links from Wikipedia to BHL (2008 - 2012)Seeding links from Wikipedia to BHL (2008 - 2012)
Seeding links from Wikipedia to BHL (2008 - 2012)
 
BHL: Assigning DOIs & Other Identifiers to Legacy Literature
BHL: Assigning DOIs & Other Identifiers to Legacy LiteratureBHL: Assigning DOIs & Other Identifiers to Legacy Literature
BHL: Assigning DOIs & Other Identifiers to Legacy Literature
 
Global BHL Activities
Global BHL ActivitiesGlobal BHL Activities
Global BHL Activities
 
Life & Literature Future Framing for BHL
Life & Literature Future Framing for BHLLife & Literature Future Framing for BHL
Life & Literature Future Framing for BHL
 
Approaches to preserving digitized taxonomic data
Approaches to preserving digitized taxonomic dataApproaches to preserving digitized taxonomic data
Approaches to preserving digitized taxonomic data
 
Scribbles & Scraps: Darwin’s Library & Annotated Literature
Scribbles & Scraps: Darwin’s Library & Annotated LiteratureScribbles & Scraps: Darwin’s Library & Annotated Literature
Scribbles & Scraps: Darwin’s Library & Annotated Literature
 
Plant Name Services Using Tropicos
Plant Name Services Using TropicosPlant Name Services Using Tropicos
Plant Name Services Using Tropicos
 
Biodiversity Heritage Library (BHL): A Global Resource for Open Access Scient...
Biodiversity Heritage Library (BHL): A Global Resource for Open Access Scient...Biodiversity Heritage Library (BHL): A Global Resource for Open Access Scient...
Biodiversity Heritage Library (BHL): A Global Resource for Open Access Scient...
 
Global Strategy for Plant Conservation Target 1
Global Strategy for Plant Conservation Target 1Global Strategy for Plant Conservation Target 1
Global Strategy for Plant Conservation Target 1
 
DPLA Technologies: Foundations for Growth & Sustainability
DPLA Technologies: Foundations for Growth & Sustainability DPLA Technologies: Foundations for Growth & Sustainability
DPLA Technologies: Foundations for Growth & Sustainability
 
Digitized Public Domain Literature
Digitized Public Domain LiteratureDigitized Public Domain Literature
Digitized Public Domain Literature
 

Dernier

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 

Dernier (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 

BHL: Big Data, Big Challenges

  • 1. BHL: Big Data, Big Challenges Chris Freeland @chrisfreeland Founding Technical Director, BHL Sr. Director, University Academic Computing, Washington University
  • 2. >100,000 books, > 39 million pages, > 70TB data
  • 3. BHL Architecture: Window Seat Ed. Access Data APIs UI Exports Logic Geocoding Firewall Data Transform Name BHL DB Utilities Finding Storage Images (JP2) Internet Archive PDF Coordinate-based OCR XML metadata
  • 4. BHL Content: Structured Data • Metadata from Library catalogues at title/volume level <titleInfo> <title>CaroliLinnaei ... Species plantarum :exhibentesplantas rite cognitas, ad genera relatas, cum differentiisspecificis, nominibustrivialibus, synonymisselectis, locisnatalibus, secundumsystemasexualedigestas...</title> </titleInfo> <titleInfo type="abbreviated"><title>Sp. Pl.</title></titleInfo> <name type="personal"> <namePart>Linné, Carl von,</namePart> <namePart type="date">1707-1778</namePart> </name> <typeOfResource>text</typeOfResource> <genre authority="marcgt">book</genre> <originInfo> <place><placeTerm type="text">Holmiae :</placeTerm></place> <publisher>ImpensisLaurentiiSalvii,</publisher><dateIssued>1753.</dateIssued>
  • 5. BHL Content: Unstructured Data More than 39 million pages! ….of uncorrected OCR 
  • 6. Abbild ungen und Beschreibungen der Fische Syriens, nebst einer neuen Classification und Characteristik sämmtlicher Gattungen der i JOH. JAKOB HECKEL, Inipectoi am k. k. Hof-Natur.-iUenkabinete in Wien, mehr, yelelirt. UeHtllMeii. MIfglivd. STUTTGART. E. Schweizerbart' sehe Verlagshandlung, 1843.
  • 7. *E.xvi�c�piteI von c. cXx.WptdvonfnrWmn bu�fbe;bcn.5 am cix bIa � S &3rn~ 41X a�m cv(f b1air�'o�et ert oiensr �; �', :�hlrfc�c wa ff�4am.diug bist a 6aiw~s ff oJrJtwt nof bL4ecImt& blfafra mem b t wag `wr 4 cn wiu 4 e8t5m.ed bvUratflb ck wuo, ma144'*4I bttE5rmbebt =rt3'kn am4ra tif vrmr Waff C * t6rmnli an `tn�ciblatGteaM w ?ffoaifrn w4wmeu nu weib e , wpiteI voE5teiri ct c ober gtUcr cit cm` 91 cLi biar J ' >bSciatl�Oiff ;Bruet wacfttc n qmcx b1a bl: bt5c lttmtt bb9 lkr w.llr#e iti ncn xoa ff cu :r trtuft *e t � B Rn "� trv W1Rt' ?Cm c blas waIwutr Ober �ci ti 1V Ces ' wt gbtiemwwajfu tpctt, afferain 9 c: b�titbfof �r f eran m rs bra wlg auig4;f aer�m *mc vrt blatcabtfm wfru an'deg~m rt blas Iaum bwWt� run f ncmai b14ianf tJobrrfan ebrut4net vnber Brwt Ober awawi*m.crriii btafwfm uww c on$ 'it ttu wttkc 5,10 $ m~C fca trc* cx u W�e�&mcyfbq4 Mabtt mmw rc a iiu bc Jcn ncI.end.*, blat s. a u:�rprd3 rw4ftf wm c ii,+ ttCC tn wa frr9fr orfab fcfbt enb c optiti bt -r9 ceDa ttDcn i34M sn Sem i
  • 8. How to connect/consume http://biodivlib.wikispaces.com/Developer+Tools+and+API Data APIs UI Exports BHL DB Internet Archive Images (JP2) PDF Coordinate-based OCR XML metadata
  • 9. BHL Data Challenge: Name Finding pre-2007
  • 10. BHL Data Challenge: Name Finding • TaxonFinder algorithm in production since 2008 – More than 100 million candidate name strings – More than 1.5 million unique, verified names – Available through UI, APIs, Data Exports & Internet Archive • New collaboration with Global Names – Improved algorithm, better precision & recall – More data!
  • 11. New Data Challenges http://biodivlib.wikispaces.com/BHL+and+Gaming ^Challenges framed as games • Correcting OCR • Rekeying Tables of Contents • Researching candidate Scientific Names • Image identification & extraction – http://biodivlib.wikispaces.com/Art+of+Life – Currently funded by NEH